Introduction
This lesson is inspired and derived from Reproducible Research using Jupyter Notebooks, Reproducible Research with Interactive Jupyter Dashboardsand released under the CC Zero public domain waiver.
What is reproducible research?
What we mean by “Reproducible Research” is the ability of repeating the same research “experiment” in any place with any person. Here the word “experiment” is more general than data acquisition or running a simulation and should be seen as the whole set of steps for achieving your research results.
To make your research reproducible you need:
- to document the steps of your research project, and make them easy to understand
- to publish these steps and materials along with data used for your research project
Why is it important to be reproducible?
Being able to reproduce results is becoming increasingly important to:
- confirm that results are indeed correct
- allow scientists (including students and PhDs working in your team) to develop from previous work to create new scientific breakthroughs faster
- use the same scientific workflow with the same or different datasets and confirm/contradict existing results.
What kind of problems can you face if your research is not reproducible?
Science retracts paper without agreement of lead author.
- Journal retracted a study of how canvassers can sway people’s opinions about gay marriage.
- Editor-in-Chief: Original survey data:
- not made available for independent reproduction of results
- Survey incentives misrepresented
- Sponsorship statement false
- Two grad students attempted to reproduce the study and could not.
- Concluded the data must have been fabricated. 538 story.
Key Point
Methods we’ll discuss today can’t prevent fraud, but they can make it easier to discover such issues.
Retracted, but not fraud
- One researcher had seven papers retracted because of irreproducibility Retraction Watch
- Couldn’t find the data Wiley
- “Extensive” errors force retraction of lymphoma paper JRO
- Many, many more Irreproducible examples
Seizure study retracted after authors realized data were “Terribly mixed”
- From the authors of Low Dose Lidocaine for Refractory Seizures in Preterm Neonates: “The article has been - retracted at the request of the authors. After carefully re-examining the data presented in the article, they identified that data of two different hospitals got terribly mixed. The published results cannot be reproduced in accordance with scientific and clinical correctness.” IJP
Bad spreadsheet merge kills depression paper, quick fix resurrects it
- The authors informed the journal that the merge of lab results and other survey data used in the paper resulted in an error regarding the identification codes.
- Original conclusion : Lower levels of CSF IL-6 were associated with current depression and with future depression.
- Revised conclusion: Higher levels of CSF IL-6 and IL-8 were associated with current depression.
LIGO (Laser Interferometer Gravitational-Wave Observatory)
- All data are publically available free of charge.
- Jupyter Notebooks running Python are produced for each publication. These notebooks allow full reproducibility: all analyses and figures can be recreated.
- Produce in-depth Tutorials using Jupyter Notebooks and Python
How can we make our research more reproducible?
One way to make our research more reproducible is to share:
- Well-documented and fully open codes.
- Data used to produce our results.
- The details of our scientific workflow allowing researchers to easily reproduce steps by steps our research results and go beyond the state of the art without re-inventing the wheel.
- Information on how to cite my research to get valuable recognition for my work.
These are rather vague statements and we need tools to help overcome the technical challenges of doing reproducible research. The most important requirements these tools need to fulfill are:
- Collaborative authoring (scientific publications are rarely written by one person only)
- Math support to write complex formulas
- Multiple output formats: one document but several output formats i.e. can produce on demand (jupyter notebook, HTML, LaTeX, pdf, etc. with no effort
- Linking data (whether they are archived in Zenodo or another Data Archive)
- Linking to Software programs (user programs for instance stored on GitHub, pre-compiled open-source or containers(this is a topic for another workshop) etc.)
- Linking to people (social media such as ORCID, ResearchGate, LinkedIn, etc.)
- Receive credits for my research (citation)
- Commenting and possibility to add additional information (additional plots, video, sound, programs, explanation, etc.)
- Interactive figures allowing other researchers to explore my data.
- Possibility to easily archive my research work.
The purpose of this workshop is to show you why Git and GitHub associated with Jupyter notebooks and Binder are tools that fulfill most of our needs. So you can say at this time, the combination of this data and this code produced these results - and here is what I did and why.