She wants to know what are best practices on flagging bad responses and cleaning survey data and detecting bad responses. Any suggestions from the tidyverse or crunch.io?

Statistical Modeling, Causal Inference, and Social Science 2024-10-13

A colleague who works in a field that uses a lot of survey research asks:

Can you recommend papers about detecting bad survey responses? We have some such methods where I work, but I’m curious what the Census Bureau and other big survey establishments do to flag bad responses. The Groves book doesn’t seem to have much:

My colleague continues:

I’ve looked through documentation and mostly see things they do during data collection to get good responses, which is of course great. I’m curious what’s their process during data cleaning. Are they looking for outliers relative to what someone’s other responses would predict?

I replied that this sounds like something that would be done in the tidyverse, so maybe someone from that world can offer some suggestions? Also I know that the people at the Yougov spinoff crunch.io do lots of data cleaning, so maybe they have some document they can point to.

In my own work, I’ve had to clean data—there’s an example in Appendix A of Regression and Other Stories—but I don’t have a systematic workflow for the process.

I remember when we were analyzing a How many X’s do you know survey, we somehow stumbled across one respondent who answered 7 to every question, so we threw that person out of our data, but that’s just something we happened to notice, and there could well be other bad responses there that we hadn’t noticed.

At the other extreme, what if you find yourself with possibly fabricated data, perhaps the work of some researcher such as Brian Wansink or Mary Rosh who can offer no convincing documentation that the study or survey in question ever took place, or perhaps a legitimate-sounding survey that you suspect was constructed using “curbstoning”—that’s what they call it when the lazy survey interviewer doesn’t bother knocking on doors to talk to people and instead sits on the curbstone outside the house and makes up plausible responses? Some researchers used statistical techniques to search for duplicate or near-duplicate records and have claimed that fabricated data is a big problem with international surveys, including respected organizations such as Afrobarometer, Arab Barometer, Americas Barometer, International Social Survey, Pew Global Attitudes, Pew Religion Project, Sadat Chair, and World Values Survey.