Restoring reproducibility of Jupyter notebooks

Jiawei Wang, Tzu Yang Kuo, Li Li, Andreas Zeller

Research output: Chapter in Book/Report/Conference proceedingConference PaperOtherpeer-review

4 Citations (Scopus)

Abstract

Jupyter notebooks-documents that contain live code, equations,visualizations, and narrative text-now are among the most popular means to compute, present, discuss and disseminate scientificfindings. In principle, Jupyter notebooks should easily allow to reproduce and extend scientific computations and their findings; butin practice, this is not the case. The individual code cells in Jupyternotebooks can be executed in any order, with identifier usages preceding their definitions and results preceding their computations.In a sample of 936 published notebooks that would be executablein principle, we found that 73% of them would not be reproduciblewith straightforward approaches, requiring humans to infer (andoften guess) the order in which the authors created the cells.In this paper, we present an approach to (1) automatically satisfydependencies between code cells to reconstruct possible executionorders of the cells; and (2) instrument code cells to mitigate theimpact of non-reproducible statements (i.e., random functions) inJupyter notebooks. Our Osiris prototype takes a notebook as inputand outputs the possible execution schemes that reproduce theexact notebook results. In our sample, Osiris was able to reconstructsuch schemes for 82.23% of all executable notebooks, which hasmore than three times better than the state-of-the-art; the resultingreordered code is valid program code and thus available for furthertesting and analysis.

Original languageEnglish
Title of host publicationProceedings - 2020 ACM/IEEE 42nd International Conference on Software Engineering
Subtitle of host publicationCompanion Proceedings, ICSE-Companion 2020
EditorsHyunsook Do, Tien N. Nguyen
Place of PublicationNew York NY USA
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages288-289
Number of pages2
ISBN (Electronic)9781450371223
DOIs
Publication statusPublished - 2020
EventInternational Conference on Software Engineering 2020 - Online, Seoul, Korea, Republic of (South)
Duration: 27 Jun 202019 Jul 2020
Conference number: 42nd
https://dl.acm.org/doi/proceedings/10.1145/3377811 (Proceedings)
https://conf.researchr.org/home/icse-2020 (Website)

Publication series

NameProceedings - International Conference on Software Engineering
PublisherThe Association for Computing Machinery
ISSN (Print)0270-5257

Conference

ConferenceInternational Conference on Software Engineering 2020
Abbreviated titleICSE 2020
Country/TerritoryKorea, Republic of (South)
CitySeoul
Period27/06/2019/07/20
Internet address

Keywords

  • Jupyter Notebooks
  • Osiris
  • Python
  • Reproducibility

Cite this