Trouble at the text mine : Nature News & Comment 2012-08-20


“Computers can rapidly scan through thousands of research papers to make useful connections, but work is being slowed by publishers' unease. When he was a keen young biology graduate student in 2006, Max Haeussler wrote a computer program that would scan, or 'crawl', plain text and pull out any DNA sequences. To test his invention, the naive text-miner downloaded around 20,000 research papers that his institution had paid to access — and promptly found his IP address blocked by the papers' publisher. It was not until 2009 that Haeussler, then at the University of Manchester, UK, and now at the University of California, Santa Cruz, returned to the project in earnest. He had come to realize that standard site licences do not permit systematic downloads, because publishers fear wholesale theft of their content. So Haeussler began asking for licensing terms to crawl and text-mine articles. His goal was to serve science: his program is a key part of the text2genome project, which aims to use DNA sequences in research papers to link the publications to an online record of the human genome. This could produce an annotated genome map linked to millions of research articles, so that biologists browsing a genomic region could immediately click through to any relevant papers. But Haeussler and his text2genome colleague Casey Bergman, a genomicist at the University of Manchester, have spent more than two years trying to agree terms with publishers — and often being ignored or rebuffed... Many publishers say that they will allow their subscribers to text-mine, subject to contract and the text-miners' intentions, and point to a number of successful agreements. But like many early advocates of the technology, Haeussler and Bergman complain that publishers are failing to cope with requests, and so are holding up the progress of research. What is more, they point out, as text-mining expands, it will be impractical for individual academic teams to spend years each working out bilateral agreements with every publisher. With his frustration boiling over, Haeussler last week started a project to e-mail all the main science publishers for permission to mine their content. He will log their responses online (at in the hope of raising awareness of the problem... Thanks to growing computer power, software can recognize, extract and index scientific information from vast amounts of plain text, allowing computers to read and organize a body of knowledge that is expanding too fast for any human to keep up. 'Semantic software' is starting to record the relationships between scientific 'entities'... For pharmaceutical firms, text-mining is ‘a basic necessity’ that assists drug development, says Raul Rodriguez-Esteban, a computational biologist at the drug giant Boehringer Ingelheim in Ridgefield, Connecticut. Companies routinely create custom databases of proteins, drugs, cell types and the interactions between them, all gleaned from text-mining, he explains. The technology still needs human oversight, but most enthusiasts expect text-mining to be the key to a new kind of scientific discovery based on rich, computer-readable representations of knowledge gathered from plain-text research articles. But, as Haeussler has discovered, there is a major roadblock. Freely available patents and article abstracts are open for text-mining, but material behind paywalls is not — even when institutions have paid for a site licence. “The licence is oriented towards permitting the human to download and read an article, but not to text-mine it,” says John McNaught, deputy director of the National Centre for Text Mining at the University of Manchester. Even freely accessible papers may not come with permissive licences: of the 2.4 million abstracts listed by PubMedCentral, only 400,000 (17%) are licensed for text-mining... Software programmers can circumvent publishers' detection systems... but Haeussler says that papers derived from such technically illegal text-mining have been published in leading journals... Those wishing to text-mine within the rules must agree contracts with the publishers, and sometimes pay a fee. Haeussler got permission to mine the corpus of Amsterdam-based publisher Elsevier for free. But another academic text-mining project, BioNOT, based at the University of Wisconsin–Madison, was not so fortunate. The collaboration was charged extra for its contract to search Elsevier papers to automatically extract negative results, potentially useful for showing that genes are not related to a disease, for example. Even powerful drug firms find the negotiations a burden. “When we have licensed and paid for the full text, we feel that we should also have the right to mine it,” says Henning Nielsen, head of the Library and Information Centre at the Danish pharmaceutical firm Novo Nordisk in Bagsværd, Denmark, and president of the Pharma Documentation Ring (PDR), an association of information managers covering 21 of the world's largest drug firms... Publishers deal with text-mining requests in various ways. Last year, the Publishi



08/16/2012, 06:08

From feeds:

Open Access Tracking Project (OATP) »


oa.medicine oa.biology oa.npg oa.business_models oa.publishers oa.policies oa.licensing oa.mining oa.comment oa.elsevier oa.copyright oa.libraries oa.access oa.standards oa.patents oa.fees oa.pharma oa.biomedicine oa.ccc oa.ip oa.u.manchester oa.libre



Date tagged:

08/20/2012, 19:00

Date published:

03/08/2012, 08:50