Text mining: what do publishers have against this hi-tech research tool? | Science | The Guardian

abernard102@gmail.com 2012-05-24


“Professor Peter Murray-Rust was looking for new ways to make better drugs. Dr Heather Piwowar wanted to track how scientific papers were cited and shared by researchers around the world. Dr Casey Bergman wanted to create a way for busy doctors and scientists to quickly navigate the latest research in genetics, to help them treat patients and further their research.  All of them needed access to tens of thousands of research papers at once... to look for unseen patterns and associations across the millions of words in the articles. This technique, called text mining, is a vital 21st-century research method. It uses powerful computers to find links between drugs and side effects, or genes and diseases, that are hidden within the vast scientific literature. These are discoveries that a person scouring through papers one by one may never notice.  A report published by McKinsey Global Institute last year said that "big data" technologies such as text and data mining had the potential to create €250bn (£200bn) of annual value to Europe's economy, if researchers were allowed to make full use of it.  Unfortunately, in most cases, text mining is forbidden. Bergman, Murray-Rust, Piwowar and countless other academics are prevented from using the most modern research techniques because the big publishing companies such as Macmillan, Wiley and Elsevier, which control the distribution of most of the world's academic literature, by default do not allow text mining of the content that sits behind their expensive paywalls.  Any such project requires special dispensation from – and time-consuming individual negotiations with – the scores of publishers that may be involved.  ‘That's the key fact which is halting progress in this field,’ said Robert Kiley, head of digital services at the Wellcome Trust.  Bergman, an evolutionary biologist at the University of Manchester, used text mining to create a tool to help scientists make sense of the ever-growing research literature on genetics. Though genetic sequences of living organisms are publicly available, discussions of what the sequences do and how they interact with each other sits within the text of scientific papers that are mostly behind paywalls.  Working with Max Haeussler, of the University of California, Santa Cruz, Bergman came up with Text2genome, which identifies strings of text in thousands of papers that look like the letters of a DNA sequence – a gene, say – and links together all papers that mention or discuss that sequence. Text2genome could allow a clinician or researcher who may not be an expert on a particular gene to access the relevant literature quickly and easily. Haeussler's attempts to scale up Text2genome, however, have hit a wall, and his blog is a litany of the problems in trying to gain permissions from the scores of publishers to download and add papers to the project. ‘If we don't have access to the papers to do this text mining, we can't make those connections,’ says Bergman.  Murray-Rust, a chemist at the University of Cambridge, has used text mining to look for ways to make chemical compounds, such as pharmaceuticals, more efficiently. ‘If you have a compound you don't know how to make and it's similar to one you do know how to make, then the machine would be able to suggest a number of methods which would allow you to do it.’  But, although his university subscribes to the journals he needs to do this work, he is forbidden from using the content in what he calls ‘a modern manner using machines’.  A member of his research group accidentally tripped the alarms of a publisher's website when he downloaded several dozen papers at once from journals to which the university had already paid subscription fees. The publisher saw it as an attempt to illegally download content and immediately blocked access to its content for the entire university.  The UK government supports open access to publicly funded research and the text mining that it would allow. In a report for the Intellectual Property



From feeds:

Open Access Tracking Project (OATP) » abernard102@gmail.com


oa.new oa.business_models oa.publishers oa.policies oa.licensing oa.mining oa.comment oa.government oa.elsevier oa.copyright oa.uk oa.wiley oa.funders oa.fees oa.wellcome oa.macmillan oa.hargreaves oa.economic_impact oa.libre

Date tagged:

05/24/2012, 17:25

Date published:

05/24/2012, 13:25