What's smelling fishy? Maybe your data

Numbers Rule Your World 2021-12-04

In the comments to the previous post, a PhD student asked for general advice on testing data for irregularities. This topic merits a separate post, indeed multiple posts.

Clay-banks-clorox sm

***

Here are some initial thoughts:

1. Your data is guilty until proven innocent

2. The top N rows of your data may be false friends

3. With experience, you develop an intuitive feel for the common types of problems to look for

4. Look for problems in slices of the data, because problems are not randomly distributed throughout your dataset

5. Avoid inferring metadata from the data - find the metadata or ask the data collector

6. Seek contradicting statistics: e.g. if A and B have these values, then it's impossible for C to have this value

7. Pushing bad data into your analysis pipeline, and then fixing problems as they surface does not save time; on the contrary, it will cost you much more time

8. Many problems are caused by data collectors who have no knowledge of how the collected data would be used in the future by data analysts