No Junk Science: Towards Automatic Validation of Scientific Claims Using Open Data | College of Information and Computer Sciences | UMass Amherst
Amyluv's bookmarks 2017-12-14
"Scientific data is increasingly available online in open data repositories, thanks to investments by funding agencies to build the platforms and progressive policies by journals to require their use. But these investments have underdelivered -- most repositories are "data graveyards" that have seen limited use by researchers. The highest level interfaces to these repositories is essentially keyword search, and even that performs poorly due to poor quality metadata. Focusing on gene expression studies, we are applying techniques from machine learning and databases to automatically curate and integrate the data in these repositories, then using the results to automatically validate claims in scientific papers. We use co-learning and distant supervision to provide high-quality metadata without requiring labeled training data. We then extract claims from papers using simple NLP techniques, use the results to derive appropriate schema mappings, and generate statistical tests against the integrated datasets to assess claim validity. I'll show that our automatic results match manual verification, and that the overall approach is robust to various sources of noise. Initial results show that while most claims are supported by prior data, we can identify weaker results that warrant further investigation. If there's time, I'll wrap up by positioning these results within a broader context of "responsible data science" that includes issues of fairness, privacy, and transparency."