Data problems, coding errors…what can be done?

Statistical Modeling, Causal Inference, and Social Science 2013-04-22

This post is by Phil

A recent post on this blog discusses a prominent case of an Excel error leading to substantially wrong results from a statistical analysis. Excel is notorious for this because it is easy to add a row or column of data (or intermediate results) but forget to update equations so that they correctly use the new data. That particular error is less common in a language like R because R programmers usually refer to data by variable name (or by applying functions to a named variable), so the same code works even if you add or remove data.

Still, there is plenty of opportunity for errors no matter what language one uses. Andrew ran into problems fairly recently, and also blogged about another instance. I’ve never had to retract a paper, but that’s partly because I haven’t published a whole lot of papers. Certainly I have found plenty of substantial errors pretty late in some of my data analyses, and I obviously don’t have sufficient mechanisms in place to be sure that errors can’t persist all the way through to the end.

I have adopted some practices to try to reduce the kinds of errors that I tend to make. Below, I list some of the things that I have found to be risky, and some of the things I’ve tried to reduce the risk. Where I give a coding example, it’s in R.

I sometimes used to refer to the wrong column (or, occasionally, row) of data. The solution here is easy: assign column names, and refer to columns by name instead of number. Ideally, column headers are already in the data file; if not, and if it doesn’t make sense to edit the data file to put them in there, then read the datafile normally and assign column names with the very next command.
I sometimes make a change or fix an error in one place in my code but not another. Suppose I am comparing statistical distributions from two datasets and I decide to summarize them by 10th, 25th, 50th, 75th, and 90th percentile, in addition to calculating a lot of other stuff. I might have a block of code that analyzes the first dataset, and an almost duplicated block of code that analyzes the second dataset. If I decide to use the 5th and 95th percentile instead of the 10th and 90th, I might implement that change in the first block but forget to do it in the second. I try to make myself use two mechanisms to make sure I don’t have this problem: (a) label the outputs and look at the labels. In the example here, the “quantile” command automatically labels the results: quantile(data, probs=c(0.05, 0.25, 0.5, 0.75, 0.95)) generates a vector that has column names that tell what percentiles you’ve selected. Unfortunately, the way I usually work I often won’t end up looking at those when it comes time to do something with the results; for instance, I might make an output matrix with something like rbind(results1, results2), where results1 and results2 are outputs of the quantile function. But another option is (b) do as much as I reasonably can by using user-defined functions rather than repeating blocks of code. In the current example, rather than calling the quantile() function in two different places, I could write a really simple function like this: myquantile = function(datavec) { return(quantile(datavec,probs=c(0.05, 0.25, 0.5, 0.75, 0.95))) } and call it for both datasets. If I decide to use different quantiles, I edit the function. Simple, but for some reason it’s hard to make myself do it, especially for one-liners like the example I’ve given here.
I sometimes use the wrong variable name in a function, and this problem can be hard to find. Suppose I’m taking my own advice (from item 2, above). I’ve got a bunch of R code to do something, and I realize I’m going to want to repeat the same steps with some other data or with a subset of the data. Being a good boy for once, I take the code that I’ve generated and wrap a function around it. Great, except that I sometimes make an error like this: inside the function I define a variable like std = sqrt(var(dat)), and then later in the function I say something like x = (dat – mean(dat))/stdev. See how I said std in one place and stdev someplace else? Oops. This often isn’t a problem because this will generate an error when I call the function, and I’ll find the mistake and fix it. But sometimes it is a big problem because I have defined stdev in the global environment, and R will happily use that value. I don’t have a solution to this problem other than to try to be careful. I wish R had a “strict” option or something, that would make a function give an error or at least a warning if I use a variable that isn’t local.
I tend to run into problems when I do relational operations across data frames. Say I have data on a bunch of buildings, and frame1 has a list of buildings that have certain characteristics, and frame2 has a big list of lots of buildings and some data on each one. Then frame2[match(frame1$buildingname, frame2$buildingname),] gives me just the rows of frame2 that match the buildings in frame1. Right? Well, no: it should be frame2[match(frame2$buildingname, frame1$buildingname),]. Or was I right the first time? Anyway this is a pitfall for me, but at least I know it and I know to be careful and make sure I get it right. Even so, I can run into trouble if I want to do further subsetting, like look at only the rows in frame2 for which the name occurs in frame1 AND some other column of frame1 has a certain characteristic, or for which the name occurs in frame1 and some other condition of frame2 is also true. I don’t have a full solution here but I do some things to try to reduce the number of problems. The main one is that I try to put everything into a single data frame if at all possible. In the example here, I would probably create a new column in frame2 that indicates, for each building, whether it is listed in frame1 or not; and if there are additional columns in frame1 then I would propagate them into frame2 the same way. This seems simple and obvious, and in the example I’ve given here it is indeed simple and obvious, but sometimes when I have multiple data sources, and maybe data structures that are more complicated than data frames, it’s not so easy.
And finally, two solutions that helps detect many kinds of problems: (a) look at your tabulated data and any tables of results, in tabular form. What I mean is, don’t just call read.table() and then just go ahead and start referring to various rows and columns (by name or otherwise); open the datafile, or write out the results from your work, and take a look. This is one thing Excel is really great for: you can read in a giant table and scroll around in it. In R, if you have a wide dataframe or matrix it ends up getting split up when you display it, so you can’t really see what is going on. (b) Make lots of plots. Many problems will show up as suspicious patterns: you’ll find that a whole column has the same value, or that something that should always be positive has some negative values somehow, or that a variable you thought you had normalized to 1 has some values over 1, or whatever.

So, there’s a list of some of the common ways I screw up, and some of the things I do to try (sometimes unsuccessfully) to avoid those problems. I’d be interested in hearing other common pitfalls and solutions.