Open data sharing: what could possibly go wrong?
peter.suber's bookmarks 2025-07-20
Summary:
"Complex trial data stripped of critical context poses significant challenges for secondary analysis. Randomized trials comparing multiple interventions often contain specific design features precluding certain types of analyses. Only 30% of shared trial datasets include sufficient metadata for replication.10 The scope extends beyond documentation issues. In a striking example, 70 independent teams analyzing an identical neuroimaging dataset reached widely divergent conclusions.7 This crucial context is even more likely to be lost in translation when data dictionaries are created as an afterthought following study completion. This challenge grows as automated analyses of public datasets increase, raising the risk of misapplied methods.8
This is no longer theoretical, in 1 study, 40% of shared trial datasets have been analyzed in ways that violated their primary design constraints.19 Clinical trial data labeled as “open” but lacking interoperability led to misinterpretation in 15% of secondary analyses due to missing context.24 Although scientific communities eventually correct errors through letters to editors or retraction requests, retracted articles persist as cited valid evidence long after retraction.12 This problem extends to journal policies as well. In a recent review of data sharing policies across health research globally, data sharing was required by only 19% (52/273) of health sciences journals, with widely varying conditions for implementation."