Gene data to hit milestone : Nature News & Comment 2012-07-21


“Purvesh Khatri sits in front of an oversized computer screen, trawling for treasure in a sea of genetic data. Entering the search term ‘breast cancer’ into a public repository called the Gene Expression Omnibus (GEO), the postdoctoral researcher retrieves a list of 1,170 experiments, representing nearly 33,000 samples and a hoard of gene-expression data that could reveal previously unseen patterns. That is exactly the kind of search that led Khatri’s boss, Atul Butte, a bioinformatician at the Stanford School of Medicine in California, to identify a new drug target for diabetes. After downloading data from 130 gene-expression studies in mice, rats and humans, Butte looked for genes that were expressed at higher levels in disease samples than in controls. One gene was strikingly consistent: CD44, which encodes a protein found on the surface of white blood cells, was differentially expressed in 60% of the studies (K. Kodama et al. Proc. Natl Acad. Sci. USA 109,7049–7054; 2012). The CD44 protein is not widely investigated as a drug target for diabetes, but Butte’s team found that treating obese mice with an antibody against it caused their blood glucose levels to drop... Butte and his team are now using publicly available data to answer a diverse range of questions ... Since 2002, many scientific journals have required that data from gene-expression studies be deposited in public databases such as GEO, which is maintained by the National Center for Biotechnology Information in Bethesda, Maryland, and ArrayExpress, a large gene-expression repository at the European Bioinformatics Institute (EBI) in Hinxton, UK. Some time in the next few weeks, the number of deposited data sets will top one million (see ‘Data dump’). The result is an unprecedented resource that promises to drive down costs and speed up progress in understanding disease. Gene-sequence data are already shared extensively, but expression data are more complex and can reveal which genes are the most active in, say, liver versus brain cells, or in diseased versus healthy tissue. And because studies often look at many genes, researchers can repurpose the data sets, asking questions other than those posed by the original researchers. It is easy to track how many data sets are being deposited — much harder is working out how they are being used. Heather Piwowar, who studies data reuse with the National Evolutionary Synthesis Center from the University of British Columbia in Vancouver, Canada, found that 20% of data sets deposited in GEO in 2005 and 17% of those in 2007 had been cited by the end of 2010. But those rates are certainly underestimates, she says... More studies are reusing data every year, she says. ‘We have every reason to believe it is game-changing...’ Having access to such data is ‘immensely valuable,’ agrees Enrico Petretto, a genomicist at Imperial College London. ‘We would never be in a position to look across multiple tissues and species with the money we have.’ But he cautions that using other people’s data can be tricky. If data sets give contradictory outcomes, it is unclear whether that is because the underlying data contradict each other or because something went wrong with the analysis. ‘That’s why people sometimes don’t trust this,’ he says. Still, few researchers are using the data to their greatest potential, says Alvis Brazma, a bioinformatician at the EBI. ‘Being able to reuse functional genomics data is a really new thing,’ he says. Researchers rarely download more than half a dozen data sets, and most use the data only to compare with their own results. Studies that use only other scientists’ data to come up with new findings are still unusual. That makes Butte and Khatri trailblazers. Another pioneer is Gustavo Stolovitzky, a computational biologist at the IBM Thomas J. Watson Research Center in Yorktown Heights, New York, who has used publicly available data to train algorithms to recognize gene signatures for diseases such as lung cancer, chronic obstructive pulmonary disease (COPD) and psoriasis... Other efforts promise to unleash even more power from the growing repositories...”



08/16/2012, 06:08

From feeds:

Open Access Tracking Project (OATP) »


oa.medicine oa.mining oa.comment oa.open_science oa.costs oa.biomedicine oa.curation oa.ebi oa.ncbi oa.milestones oa.repositories



Date tagged:

07/21/2012, 07:54

Date published:

07/21/2012, 08:41