Offload Shiny’s Workload: COVID-19 processing for the WHO/Europe

R-bloggers 2022-06-21

[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

At Jumping Rivers, we have a wealth of experience developing andmaintaining Shiny applications. Over thepast year, we have been maintaining a Shinyapplicationfor the World Health Organization Europe (WHO/Europe) that presents dataabout COVID-19 vaccination uptake across Europe.

The great strength of Shiny is that it simplifies the production ofdata-focused web applications, making it relatively easy to present datato users / clients in an interactive way. However data can be big anddata-processing can be complex, time-consuming and memory-hungry. So ifyou bake an entire data pipeline into a Shiny application, you may endup with an application that is costly to host and doesn’t provide thebest user experience (slow, frequently crashes).

One of the best tips for ensuring your application runs smoothly issimple:

Do as little as possible.

That is … make sure your application does as little as possible.

The data upon which the application is based comes from several sources,across multiple countries, is frequently updated, and is constantlyevolving. When we joined this project, the integration of these datasetswas performed by the application itself. This meant that when a useropened the app, multiple large datasets were downloaded, cleaned up, andcombined together—a process that might take several minutes—before theuser could see the first table.


Do you require help building a Shiny app? Would you like someone to take over the maintenance burden?If so, check outour Shiny and Dash services.


Do as little processing as possible

A simple data-driven app may look as follows: It downloads some data,processes that data and then presents a subset of the raw and processeddata to the user.

Data is downloaded, processed, then presented by the app in both forms

The initial data processing steps may make the app very slow, and if itis really sluggish, may mean that users close the app before it fullyloads.

Since the data processing pipeline is encoded in the app, a simple wayto improve speed is for the app to cache any processed data. With thecached, processed data in place, for most users the app would only needto download or import the raw and the processed data—alleviating theneed for any data processing while the app is running. But, suppose theraw data had updated. Then when the next user opens the app, thedata-processing and uploading steps would run. Though that user wouldhave a poor experience, most users wouldn’t.

For some users, data-processing will occur before the app loads

This was the structure of the WHO/Europe COVID-19 vaccination programmemonitoring application before we started working with it. Raw andprocessed data were stored on an Azure server and the app ensured thatthe processed data was kept in-sync with any updates to the raw data.The whole data pipeline was only run a few times a week, because the rawdatasets were updated on a weekly basis. The load time for a typicaluser was approximately 1 minute, whereas the first user after the rawdata had been updated may have to wait 3 or 4 minutes for the app toload.

Transfer as little data as possible

Data is slow. So if you need lots of it, keep it close to you, and makesure you only access the bits that you need.

There is a hierarchy of data speeds. For an app running on a server,data-access is fastest when stored in memory, slower when stored on thehard-drive, and much, much slower when it is accessed via the internet.So, where possible, you should aim to store the data that is used withinan app on the server(s) from which the app is deployed.

With Shiny apps, it is possible to bundle datasets alongside the sourcecode, such that wherever the app is deployed, those datasets areavailable. A drawback of coupling the source code and data in this way,is that the data would need to be kept in version control along withyour source code, and a new deployment of the app would be requiredwhenever the data is updated. So for datasets that are frequentlyupdated (as for the vaccination counts that underpin the WHO/Europeapp), this is impractical. But storing datasets alongside the sourcecode (or in a separate R package that is installed on the server) may bevaluable if those datasets are unlikely to change during the lifetime ofa project.

For datasets that are large, or are frequently updated, cloud storagemay be the best solution. This allows collaborators to upload new dataon an ad-hoc basis without touching the app itself. The app would thendownload data from the cloud for presentation during each user session.

That solution might sound mildly inefficient. For each user that opensthe app, the same datasets are downloaded—likely, onto the same server.How can we make this process more efficient? There are some rathertechnical tips that might help—like using efficient fileformats to store large datasets,cacheing the data forthe app’s landing page, or using asynchronouscomputing toinitiate downloading the data while presenting a less data-intensivelanding page.

A somewhat less technical solution is to identify precisely whichdatasets are needed by the app and only download them.

Imagine the raw datasets could be partitioned into:

  1. those that are only required when constructing a (possibly smaller)processed dataset that is presented by the app; and
  2. those that are actually presented by the app.

If this can be done, there won’t be any difference when the app runs thewhole data processing pipeline, both sets of raw data would still bedownloaded, and the processed data would be uploaded to the cloud. Butfor most users the app would only download the processed datasets andthe second set of raw datasets.

Only one set of data is downloaded by most users

In the COVID-19 vaccination programme monitoring app, evolving to thisstate meant that two large files (~ 50MB in total) were no longerdownloaded per user session.

Do as little processing as possible … in the app

In the above, we showed some steps that should reduce the amount ofprocessing and data transfer for a typical user of the app. With thosechanges, the data processing pipeline was still inside the app. This isundesirable. For some users, the whole data processing pipeline will runduring their session, which makes for a poor user experience. But italso means that some user sessions require considerably greater memoryand data transfer requirements than others. If the data pipeline couldrun outside of the app, these issues would be eased.

If we move the data pipeline outside of the app, where should we moveit? It is possible to run processing scripts in a few places. For thisproject, we chose to run the data processing pipeline on GitHub on adaily schedule, as part of a GitHub Actions workflow. This was simplybecause the source code is hosted there. GitLab, Bitbucket, Azure andmany other providers can run scripts in a similar way.

So now, the data used within the app is processed on GitHub and uploadedto Azure before it is needed by the app.

For some users, data-processing will occur before the app loads

The combination of these changes meant that the WHO/Europe COVID-19vaccination programme monitoring app, which previously took ~ 1min (andoccasionally ~ 4min) to load, now takes a matter of seconds.

What complexities might this introduce

In the simplest app presented here, all data processing was performedwhenever a new user session was started. The changes described have madethe app easier to use (from the user’s perspective) but mean that forthe developers, coordination between the different components must bemanaged.

For example, if the data team upload some new raw data, there should bea mechanism to ensure that that data gets processed and into the app ina timely manner. If the source code for the app or the data processingpipeline change, the data processing pipeline should run afresh. Ifchanges to the structure of the raw dataset mean that the dataprocessing pipeline produces malformed processed data, there should be away to log that.

Summary

Working with the WHO/Europe COVID-19 vaccination programme monitoringapp has posed several challenges. The data upon which it is based isconstantly updated and has been restructured several times, consistentwith the challenges that the international community has faced. Herewe’ve outlined some steps that we followed to ensure that the dataunderpinning this COVID-19 vaccination programme monitoring app ispresented to the community in an up-to-date and easy to access way. Todo this, we streamlined the app—making it do as little as possible whena user is viewing it—by downloading only what it needs, and by removingany extensive data processing.


Jumping Rivers Logo

For updates and revisions to this article, see the original post

To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Offload Shiny’s Workload: COVID-19 processing for the WHO/Europe