Adventures in Data Citation: sorghum as a standard for data release 2012-05-16


“A correspondence we have contributed to has just been published in the BMC Research Notes "Data standardization, sharing and publication series" on the data-citation and data-release practices surrounding the Sorghum genome that is available in our GigaDB database and that was published last year in Genome Biology. We use Sorghum as an example to highlight the issues surrounding data release and use strong words, subtitling the paper "sorghum genome data exemplifies the new gold standard", justified in this case by the considerable efforts the authors made to go beyond the standards of the field and follow the latest best-practices. Despite genomics having a reputation as being the field of biology with the best established data-release practices and policies, compliance is still mixed. The authors of the Sorghum study went far beyond the usual minimal raw data deposition and spent six months working with the curators of four public repositories (on top of GigaDB) to make sure that all six data types featured in the paper were in their most usable forms. Making all of the supporting data freely available to allow transparency and reproducibility of work is a key goal of GigaScience, and we felt that this demonstration of leadership in the sharing, standardization and publication of biomedical research data should be applauded and highlighted.  We feel that the correspondence article fits the open-data related series scope and criteria well, and hope that it can be used to make the wider research community, on top of the usual digital curation experts, more aware of best practices and what is currently possible with data publication. Data citation arises from a recognition that data generated in the course of research are just as valuable to the ongoing academic discourse as papers, and DataCite (formed in 2009) provides a technical infrastructure using data-DOIs to aid this. To truly put data on a par with research publications and to credit and track their impact the same way, data DOIs need to be treated the same way as scientific articles and cited in the references section of papers... the biology community has not been citing data in this way despite published guidelines and recommendations by databases,.. Based on our early hiccups getting our dataset DOIs into other journals, the authors worked very closely with the editors of Genome Biology (and carefully following the guidelines of the DCC) to integrate data DOIs into the references of the research article - the first time that we are aware of that this has been accomplished in the field of genomics. Since this was originally highlighted in the BMC blog, there have been several more successes in this area: subsequent data DOIs have been referenced in Springer journals, one of our data DOIs made it into the references of a Nature series journal for the first time, PLoS journals are now referencing Figshare handles, and our publisher BioMed Central is using the Sorghum dataset as the example of how to cite data in their instructions for authors. The Sorghum study is also an excellent example for future data-submitters in regards to what can be done to not only comply with but also go beyond minimal journal data policies. On top of all of the data in the Genome Biology paper being available from GigaDB, the raw data (SRA), genome assemblies (in genbank here), and processed data such as SNPs, Structural Variations, Copy Number Variations and Indels were also deposited in their respective NCBI databases.Furthermore, the authors not only adhered to the standard journal editorial policies for genomics studies insisting on raw data deposition (and if possible genome assemblies) in one of the threeINSDC databases, but also deposited additionally processed data to the dbSNP and dbVardatabases. This additional effort is at best encouraged by journals but is not currently mandated. When the annotated data is fully integrated into these databases, detailed curation is a time-consuming process (particularly when having to get to grips with data produced by BGI's new SV-tools) and the staggered build releases mean that full integration can take several months, it will be the first plant data in the relatively new dbVar database. The advantages highlighted in the correspondence are that the GigaDB entry tied together all of these related datasets in one place and allowed them to be released rapidly in a stable and citable form before the associated analysis paper's publication. In addition to complementing the data deposited in the NCBI databases, being available in GigaDB makes the data more discoverable through other channels, such as the DataCite metadata search engine and eventually through citation indexes. In future papers, if additional data types that do not have established public repositories are included in the paper, the data could be made available in GigaDB, as GigaDB can provide a home for potentially any useful data type, supporting information, scripts or source-code. In Sorghum's case, depositing the data in GigaDB also allowed us to give it a clear CC0 publi



08/16/2012, 06:08

From feeds:

Open Access Tracking Project (OATP) »


oa.medicine oa.biology oa.npg oa.business_models oa.publishers oa.licensing oa.comment oa.societies oa.best_practices oa.plos oa.open_science oa.figshare oa.geo oa.bmc oa.biomedicine oa.springer oa.wiley-blackwell oa.databases oa.guides oa.dois oa.datacite oa.ncbi oa.gigadb oa.bgi oa.f1000research oa.ubiquity_press oa.dcc oa.libre oa.journals



Date tagged:

05/16/2012, 13:15

Date published:

05/16/2012, 14:01