Many datasets are reused, not just an elite few 2012-07-17


“I’ve recently collected new data on data reuse.  Using the same methods as our Nature letter-to-the-editor analysis, I’ve looked for reuse of gene expression microarray data in PubMed Central by searching for dataset ID numbers in the full text of studies.  Studies that mention a dataset accession number but share author last names with those who deposited the dataset are excluded. The new results look at datasets deposited into the Gene Expression Omnibus (GEO) repository between 2001 and 2009. The figure below has one panel for every year: the panel reflects datasets deposited into GEO that year.   The line shows the cumulative probability of the number of reuses we observed.  As you can see in the first panel, almost every dataset deposited in 2001 has been mentioned in a PMC paper at least once, and most many times… the line quickly veers right: the probability of a 2001 dataset being reused only once or twice by 2010 is very small.  In 2009, in contrast, the line goes mostly straight up… 90% of the datasets deposited in GEO in 2009 had 0 reuses observed by our conservative method: the probability that a 2009 dataset has only 0 observed reuses by 2010 is very high! Results for the middle years are particularly important, since by then GEO had a lots of datasets, and between then and now there has been enough time for reuse to accumulate.  We observed reuse of more than 20% of the datasets deposited in 2003 and 17% of datasets deposited in 2007. Note: the method used to detect reuse here is VERY CONSERVATIVE so these are minimum estimates.  It only finds reuses by papers that are in PubMed Central, and only those that are attributed by mentioning the accession number (it misses those attributed by citation to the article, for example).  Nonetheless, it does serve as a lower bound.”



