Deep Curation: Putting Open Science Data to Work | NYU Tandon School of Engineering
Amyluv's bookmarks 2017-09-26
"Data in public repositories and in the scientific literature remains remarkably underused despite significant investments in open data and open science. Making data available online turns out to be the easy part; making the data usable for data science requires new services to support longitudinal, multi-dataset analysis rather than just settling for keyword search.
In this talk, I'll describe a suite of services my group has been building to improve the utility of public data.
In the Deep Curation project, we have developed a variant of distant supervision and co-learning that can automatically label datasets with zero training data. We have applied this approach to curate gene expression data and identify figures in the scientific literature, outperforming state-of-the-art supervised methods that rely on human-provided labels.
In the Wide Open project, we use a simple text-based approach to identify datasets referenced in the scientific literature that are overdue for publication; our results led to the public release of 400 datasets in a one-week period.
In the Claim Verification project, we extract limited forms of scientific claims from the literature and automatically perform reproducibility experiments against data in public repositories.
In the Viziometrics project, we are developing a platform for large-scale information extraction from the figures in the scientific literature. We have used this platform to automatically build a database of phylogenetic information from tens of thousands of tree diagrams in the literature.
Finally, in the Query2Vec project, we are designing vector embeddings of SQL query logs to automate database administration tasks such as index recommendation and result caching.
Our vision is to provide a richer set of services to make data-intensive science more robust and reproducible, and ultimately improve public trust in science."