Polymath proposal: clearinghouse for crowdsourcing COVID-19 data and data cleaning requests
What's new 2020-03-25
After some discussion with the applied math research groups here at UCLA (in particular the groups led by Andrea Bertozzi and Deanna Needell), one of the members of these groups, Chris Strohmeier, has produced a proposal for a Polymath project to crowdsource in a single repository (a) a collection of public data sets relating to the COVID-19 pandemic, (b) requests for such data sets, (c) requests for data cleaning of such sets, and (d) submissions of cleaned data sets. (The proposal can be viewed as a PDF, and is also available on Overleaf). As mentioned in the proposal, this database would be slightly different in focus than existing data sets such as the COVID-19 data sets hosted on Kaggle, with a focus on producing high quality cleaned data sets. (Another relevant data set that I am aware of is the SafeGraph aggregated foot traffic data, although this data set, while open, is not quite public as it requires a non-commercial agreement to execute. Feel free to mention further relevant data sets in the comments.)
This seems like a very interesting and timely proposal to me and I would like to open it up for discussion, for instance by proposing some seed requests for data and data cleaning and to discuss possible platforms that such a repository could be built on. In the spirit of “building the plane while flying it”, one could begin by creating a basic github repository as a prototype and use the comments in this blog post to handle requests, and then migrate to a more high quality platform once it becomes clear what direction this project might move in. (For instance one might eventually move beyond data cleaning to more sophisticated types of data analysis.)
UPDATE, Mar 25: a prototype page for such a clearinghouse is now up at this wiki page.