Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central?

abernard102@gmail.com 2012-08-20

Summary:

“The open access movement in scientific publishing has two broad aims: (i) to make scientific articles more broadly accessible and (ii) to permit unrestricted re-use of published scientific content. From its humble beginnings in 2001 with only two journals, PubMed Central (PMC) has grown to become the world’s largest repository of full-text open-access biomedical articles, containing nearly 2.4 million biomedical articles that can be freely downloaded by anyone around the world. Thus, while holding only ~11% of the total published biomedical literature, PMC can be viewed clearly as a major success in terms of making the biomedical literature more broadly accessible. However, I argue that PMC has yet catalyze similar success on the second goal of the open-access movement — unrestricted re-use of published scientific content. This point became clear to me when writing the discussions for two papers that my lab published last year. In digging around for references to cite, I was struck by how difficult it was to find examples of projects that applied text-mining tools to the entire set of open-access articles from PubMed Central. Unsure if this was a reflection of my ignorance or the actual state of the art in the field, I canvassed the biological text mining community, the bioinformatics community and two major open-access publishers for additional examples of text-mining on the the entire open-access subset of PMC. Surprisingly, I found that after a decade of existence only ~15 articles* have ever been published that have used the entire open-access subset of PMC for text-mining research... Text-miners also routinely apply their systems to MEDLINE abstracts, albeit often on a small scale, and there is a growing community of biocurators and bioinformaticians eager to consume data from full-text mining. So what is going on here? Perhaps it is worth drawing an analogy with another major resource that was released at roughly the same time as PMC — the human genome sequence. According to many, including those in the popular media, the promise of human genome was oversold, perhaps to leverage financial support for this major project. Unfortunately, as Greg Petsko and Jonathan Eisen have argued, overselling the human genome project has had unintended negative consequences for the understanding, and perhaps funding, of basic research. Could the goal of reuse of open access articles likewise represent an overselling of the PMC repository? If so, then the open-access movement runs the risk of failing to deliver on one of the key planks in its platform. Failing to deliver on re-use could ultimately justify funders (if no-one is using it, why should we pay) and publishers (if no-one is using it, why should we make it open) to advocate green over gold open access, which could have a devestating impact on text-mining research, since author-deposited (green) manuscripts in PMC are off-limits for text-mining research. I hope (and am actively working to prove) that re-use of the open access literature will not remain an unfulfilled promise. I suspect rather that we simply are in the lag phase before a period of explosive growth in full-text mining, akin to what happened in the field of genome-wide association studies after the publication of the human genome sequence. So text-miners, bioinformaticians, and computational biologists do your part to maximize the utility of Varmus, Lipman and Brown’s vision of an Arxiv for biology, and prove that the twin aims of the open access movement can be fulfilled.” [The blogger provides a list of 16 “published text mining studies using the entirety of the Open Access subset of PMC.”]

Link:

http://caseybergman.wordpress.com/2012/03/02/why-are-there-so-few-efforts-to-text-mine-the-open-access-subset-of-pubmed-central/

Updated:

08/16/2012, 06:08

From feeds:

Open Access Tracking Project (OATP) » abernard102@gmail.com

Tags:

oa.medicine oa.biology oa.new oa.gold oa.pubmed oa.business_models oa.publishers oa.mining oa.comment oa.green oa.funders oa.biomedicine oa.repositories oa.journals

Authors:

abernard

Date tagged:

08/20/2012, 14:41

Date published:

03/04/2012, 11:52