Archival of Research Products
Adapted from https://reproducible-science-curriculum.github.io/publication-RR-Jupyter
In this lesson we will learn about why appropriate online archival of digital research inputs and outputs is an important part of promoting reproducibility. Specifically, research inputs such as data, source code, and notebooks that become digitally unavailable or inaccessible due to insufficient archiving for perpetuity are a major impediment to reproducibility.
Learning objectives
- Define and describe the importance of archiving research inputs and outputs.
- Select appropriate archival services for different types of research objects.
- Define benefits of acquiring globally unique resolvable identifiers for digital resarch objects archived online.
Online archiving of digital research inputs and products
Why archives for research inputs and products, and why use them
A not so uncommon story: You’re a graduate student reading a paper on which you want to base your analysis approach, and for you therefore need to verify and reproduce the analysis. The paper gives the lab’s website as the link for obtaining the code. However, it turns out the researcher has since left that university, and their new lab’s website no longer has a link to that code. You contact the author directly to ask about the code that the paper claims to be available from the lab website. After several weeks of silence the author responds that they will try and find the code, but they’re working on a different project now. That was a month ago.
Lab websites aren’t archives. Doing online archiving well is non-trivial, and likely isn’t your line of research. Use an online archive that specializes in doing well what you need from an archive.
Journal supplemental materials are popular for digital archiving (they are typically free for the author) but often suffer from a number deficiencies when compared to a bona-fide online archive:
- Impoverished or lack of independent indexing for findability
- Paywalled if the article is paywalled
- Lack of direct download link
- Only manuscript-oriented formats supported (PDF, MS Word)
- No separate citation and unique identifier
There are many archives, for all imaginable purposes and domains. In fact, there are so many that there is re3data, a registry of currently >2000 repositories that allows browsing them by various attributes.
Exercise 1
- Identify requirements and desirable features for an archive for a non-manuscript research product of your choice. Compare to lab website archiving and journal supplemental materials.
Exercise 2
Identify the research products that underly and support a manuscript of yours in preparation (or one recently published if those research products became supplementary materials or were not published). Consider the following choices of repositories for fit for purpose:
- Dryad
- Zenodo
- Harvard Dataverse
- Figshare
- Journal of Open Source Software
- Your university’s Institutional Repository (IR)
- A public source code repository (Github, Bitbucket, Gitlab etc)
Explain your preferences, and compare to lab website and supplemental material archiving.
Stable, globally unique, and resolvable identifiers for research products
Why globally unique resolvable identifiers for non-paper research products?
One of the key benefits of using an archive is that nearly all of them will assign a globally unique resolvable identifier to deposits. Deposit identifiers benefit both depositors, and those reusing deposits, i.e., all of an archive’s primary users:
- Identify and cite the deposit in a manuscript and a CV
- Track views, downloads, or more generally impact
- Identify exactly what record, and which version of it was (re)used
Why DOIs
DOIs (digital object identifiers) are only one type of unique identifier, but is the most frequently used type in scholarly communication, and for identifying research products. Some of its benefits include:
- Allows separating content from who hosts the content.
- Cannot be minted ad-hoc, and instead requires interacting with a registration agency, which typically need to be paid a fee. This fosters metadata quality, and assigns clear responsibility for maintaining the DOI’s continued resolution.
- Publishers, and the publishing industry, knows how to deal with them.
- Practically every scholar knows how to deal with them.
CrossRef versus DataCite
While DOIs on the surface all look the same, some expectations for their associated metadata (and programmable APIs differ based on the issuing DOI registrar (often referred to as “type of DOI”). In scholarly publishing and communication, the most frequently encountered DOI registrars are CrossRef (issues almost all scientific paper DOIs, works with publishers) and DataCite. The latter is used for all kinds of “other” research products, including data, software, source code, and preprints.
Publishing
Authoring & Citation
In addition to LICENSE, it is recommended to add:
- AUTHORS: file containing the list of contributors
- CITATION: file containing information on how to cite your work, including a DOI (Digital Object Identifier)
List of authors
A GitHub repository is very often created by an individual user so adding the list of authors is very important.
Add a new file AUTHORS in your repository
- Go to your repository in your web browser
- Click on “Create new file” and name your file
AUTHORS
- Add the list of authors/contributors and commit your changes
Make your GitHub repository citable (DOI)
Your GitHub repository contains your scientific workflow, your programs/software, datasets (or links to your datasets) and jupyter dashboards so it is important to make the work you share on GitHub citable by archiving your GitHub repository to get a DOI. You may have a Data archive in your University or you may use the data archiving tool Zenodo.
Login to Zenodo
- Go to https://zenodo.org/ and click on
Log in
(notSign up
) - Choose
Log in with GitHub
- Zenodo will redirect you back to GitHub and ask you to give Zenodo the permissions it needs. click
Authorize Application
:
Source: https://guides.github.com/activities/citable-code/zenodo-authorize.png
Get a DOI for your Github repository
- When sucessfully login to Zenodo, click on your username (top right) and select
GitHub
Then
- Select the repository
sharing-github
and flip the switch toon
- Create a Release on Github
- Then go to your GitHub repository and click on
settings
and selectWebhooks
Your GitHub repository is now linked to Zenodo and you will automatically get a DOI:
Add your DOI to your GitHub repository
- Get your DOI badge on Zenodo and copy your DOI information (selection markdown)
- Go to your GitHub repository and edit your README file to add your DOI
- Create a new file CITATION in your GitHub repository and show how to cite your reprository with your DOI
Limitations
Sharing your GitHub repository along with your jupyter notebooks/dashboards is an important step for making your research reproducible. However, anyone willing to rerun your programs/dashboards/notebooks need to get the same computational environment (python, additional python packages, etc.).
The next section (using Binder) will show you how to make your research fully reproducible, offering users the same computational environment with no efforts.
Go to MyBinder.org https://mybinder.org/ to see more about how to create an interactive computing environment form you github repositiory.