BishopBlog: A proposal for data-sharing that discourages p-hacking
peter.suber's bookmarks 2022-08-17
"One problem [with open data] is p-hacking. If you put a large and complex dataset in the public domain, anyone can download it and then run multiple unconstrained analyses until they find something, which is then retrospectively fitted to a plausible-sounding hypothesis. The potential to generate non-replicable false positives by such a process is extremely high - far higher than many scientists recognise. I illustrated this with a fictitious example here.
Another problem is self-imposed publication bias: the researcher runs a whole set of analyses to test promising theories, but forgets about them as soon as they turn up a null result. With both of these processes in operation, data sharing becomes a poisoned chalice: instead of increasing scientific progress by encouraging novel analyses of existing data, it just means more unreliable dross is deposited in the literature. So how can we prevent this?
In this Commentary paper, I noted several solutions. One is to require anyone accessing the data to submit a protocol which specifies the hypotheses and the analyses that will be used to test them. In effect this amounts to preregistration of secondary data analysis. This is the method used for some big epidemiological and medical databases. But it is cumbersome and also costly - you need the funding to support additional infrastructure for gatekeeping and registration. For many psychology projects, this is not going to be feasible.
A simpler solution would be to split the data into two halves - those doing secondary data analysis only have access to part A, which allows them to do exploratory analyses, after which they can then see if any findings hold up in part B. Statistical power will be reduced by this approach, but with large datasets it may be high enough to detect effects of interest. ..."