Creating a Reproducible Example

R-bloggers 2022-05-31

[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Maintaining training materials

Over the last few years, we increased both the number and types oftraining courses we offer. In addition to our usual R courses in {dplyr}and {shiny}, we also offertraining on Docker,Python, Stan, TensorFlow, and others.

As the number of courses we offer increased, so did the maintenanceburden of our associated training materials (lecture notes, slides,exercises, and more). To ease this burden, and to assist in ensuringthat our training materials build consistently, we developed an Rpackage called {jrNotes2}. Amongst other things, this package ensuresthat all courses:

  • have identical “template files”: .gitlab-ci.yml, .gitignore,Makefiles, index.Rmd, …;
  • have the same directory structure, and
  • pass a set of quality-assurance checks.

To make a change to course content, a team member must push theirsuggestions to a branch on GitLab. This action launches a CI job, whichruns a Docker container that performs a set of checks. The templated.gitlab-ci.yml file ensures that every course undergoes the same buildprocess and quality-assurance checks. If the content passes thesechecks, and an eligible approver approves the changes, then thechanges are merged into the main branch.

Cartoon showing arrows from Data scientist to GitLab to Docker container to Continuous Integration

This means course content in a main branch should never fail our checks.Well, not quite…

Why we can’t freeze all dependencies

When teaching a course, we want to teach with the exact same packagesan attendee would get via an install.packages() or pip installcommand. This means we must always use the latest versions of packagesavailable on CRAN and PyPI. However, always using the latest availablepackages has it dangers: a change to a package used by a course cansuddenly cause our teaching materials to begin failing our build checks.

To try and pre-empt package changes breaking our training materials weuse scheduled CI runs. That is, at regular intervals a CI jobautomatically runs our tests and checks against a course’s trainingmaterials. If a course’s materials fail these checks, we are notifiedvia a message in a Slack channel. Around early January, we startedgetting notifications about our Introduction to Python course:

Screenshot of slack notification showing the failed pipeline, where failed job is notes-build.


Do you require help building a Shiny app? Would you like someone to take over the maintenance burden?If so, check outour Shiny and Dash services.


The problem

Unfortunately, the traceback given by the CI wasn’t the mostenlightening:

segfault traceback screenshot

Strangely, the course materials

  • built successfully on Colin’s laptop;
  • failed to build on Jack’s laptop, and
  • failed to build on the CI runner.

As far as we could see, everything appeared roughly the same on allthree systems: with all three running the same operating system, thesame R version, and using the same package versions.

Whilst we could reproduce the error in a docker container, the error wasdifficult to debug as

  • the container used a large number of internal Jumping Rivers Rpackages;
  • the materials build process involved a set of non-trivial Rmd files,and
  • the error wasn’t encountered until around eight minutes into thebuild and test process.

In short, whilst we had a reproducible example of the error, it was onlyreproducible by a Jumping Rivers employee, and it was far from aminimal example.

Simplifying the problem

To make progress, we had to simplify the docker container. We askedourselves the following questions:

  • Can we remove all unnecessary files, such as presentation slides?Yes.
  • Can we simplify the course notes? Yes: we were able to find a singlePython code chunk that caused the issue.
  • Can we remove all of our custom Rmd styling? Yes: a simpler Rmd filewith the same chunk gave the same error.
  • Can we reproduce the issue without R Markdown? Yes: a simple Rscript can reproduce the same error.
  • Does the Dockerfile need to be complex? No: we can remove most ofthe unnecessary Python, Debian and R related packages.

A minimal reproducible example

After all of our simplifications, we arrived at a minimal reproducibleexample with the Dockerfile:

FROM rocker/r-ver:latestRUN apt update && apt install -y python3 python3-dev python3-venvRUN install2.r --error reticulateCOPY test.R /root/

and associated R script:

reticulate::virtualenv_create(  envname = "./venv",  packages = "matplotlib")reticulate::use_virtualenv("./venv")reticulate::py_run_string("import matplotlib.pyplot as plt; plt.plot([1, 2, 3], [1, 2, 3])")

By simplifying the problem, we were now in a position to ask for helpfrom others.

As this appeared to be a bug (it used to work, but now it doesn’t), weraised an issue against the{reticulate}repository.

A (partial) solution

Soon after posting we received aresponsefrom one of the {reticulate} developers. Their response revealed thatmatplotlib was nothing but an innocent bystander in our issue, and thatthe real culprits were the incompatible BLAS (Basic Linear AlgebraSubprograms) libraries being used by R and numpy!

The suggested solution was to was compile the numpy package from sourcewithin Docker. However, compiling numpy at container runtime addedaround 3 minutes to the CI checks every time they ran. As such, weopted to build the numpy package from source at image build-time,effectively caching the package build, and avoiding re-compiling numpyevery time our build tests ran against our training materials.

Although compiling numpy from source did fix our issue, it currentlypresents as more of a workaround than a long-term solution. Hopefully, afuture change to the BLAS libraries used by the rocker image series ornumpy, can allow the two to be friends again. Here’s to hoping!

Take-aways

  • Using scheduled CI jobs allowed us to catch this issue early, andgave us plenty of time to fix it before the next time the courseran.

  • Having a CI ensured we had an (internally) reproducible example, asthe CI is based on a docker container.

  • In order to get help, it was crucial to simplify the problem.

  • Debugging is hard, and it’s okay to ask for help!

References


Jumping Rivers Logo

For updates and revisions to this article, see the original post

To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Creating a Reproducible Example