Mark Woodbridge | Jupyter Notebooks and reproducible data science

lterrat's bookmarks 2017-03-09

Summary:

"One of the ideas pitched by Daniel Mietchen at the London Open Research Data do-a-thon for Open Data Day 2017 was to analyse Jupyter Notebooks mentioned in PubMed Central. This is potentially valuable exercise because these notebooks are an increasingly popular tool for documenting data science workflows used in research, and therefore play an important role in making the relevant analyses replicable. Daniel Sanz (DS) and I spent some time exploring this with help from Daniel (DM). We’re reasonably experienced Python developers, but definitely not Jupyter experts.

[...]

This whole exercise was hugely valuable in improving our understanding of Jupyter and of the challenges facing notebook authors. Based on our experiences, and acknowledging that non-trivial analyses will depend on potentially large numbers of dependencies and amounts of source data, we recommend taking the following approach to publishing notebooks:

If a notebook depends on external files then link to repositories rather than directly to rendered notebooks
Obtain a DOI for your repository, and use this link consistently. Don’t variously link to the repository (e.g. on GitHub), the notebook (e.g. on nbviewer) and any published container (e.g. on Docker Hub).
Attempt to limit the use of “shelling out” from notebooks: wherever practicable err on the side of not making assumptions about the runtime platform.
Take a judicious approach to bundling vs retrieving dependencies on-demand
Provide a container definition, and ideally publish the container itself, with clear instructions for its execution
Ensure that containers can be rebuilt - this is key for end-to-end reproducibility
If you require a large amount of data, then make it clear where this should be retrieved from, and provide clear instructions on how to mount this from the local drive to the container (so that it is effectively cached on the host system)
Ensure that you double-check that your notebook/container can be run on a clean system.

[...]

Conclusions

Technologies such as Jupyter and Docker present great opportunities to make digital research more reproducible, and authors who adopt them should be applauded.
These technologies do not, by themselves, ensure replicability. This was demonstrated by the fact that we were able to successfully execute only one of the ~25 notebooks that we downloaded.
Care must be taken when creating and publishing the resultant assets (notebooks, containers etc). Publishers could assist with this, and with validation. Reviewers should also take verification into account."

Date tagged:

03/09/2017, 19:35

Date published:

03/09/2017, 14:35

Mark Woodbridge | Jupyter Notebooks and reproducible data science

lterrat's bookmarks 2017-03-09

Summary:

Conclusions

Link:

From feeds:

Tags:

Date tagged:

Date published: