Unlocking 100 years of scientific papers: How Scholarcy partnered with BMJ to further I4OC | Scholarcy | The long-form article summariser
peter.suber's bookmarks 2019-05-12
"Reference mining is fundamental to the creation of citation networks and rich, discoverable digital libraries. In recent years, a number of tools have been developed to address this need, but they are often limited by input format, infrastructure requirements and runtime performance. The most recent developments in this space have focused on reference mining PDFs from arts and humanities literature, but there’s a growing need for a fast, accurate way of extracting and parsing references from a wide range of documents and formats across the full research landscape....
From requirements gathering, algorithm refinement, to the process of extracting over 2 million citations as validated XML records in CrossRef, the entire project ran for 12 weeks. Publications which particularly benefited included the British Medical Journal itself (279,000 new records), Gut (177,000), Journal of Clinical Pathology (171,000) and Journal of Neurology, Neurosurgery and Psychiatry (168,000).
99.9% of the extracted records were fully valid XML. In only 0.1% of cases, the XML required some manual correction to meet CrossRef validation standards. The records were uploaded to CrossRef and are now available as open citations for anyone to reuse...."