How to Consume Big Data

Numbers Rule Your World 2013-07-15

Over at the McGraw-Hill blog, I wrote about how to consume Big Data (link), which is the core theme of my new book. In that piece, I highlight two recent instances in which bloggers demonstrated numbersense in vetting other people's data analyses. (Since the McGraw-Hill link is not working as I'm writing this, I placed a copy of the post here in case you need it.)

Below is a detailed dissection of Zoë Harcombe's work.

***

Eating red meat makes us die sooner! Zoë Harcombe didn’t think so.

InMarch, 2013, nutritional epidemiologists from Harvard University circulated newresearch linking red meat consumption with increased risk of death. All majormass media outlets ran the story, with headlines such as “Risks: More Red Meat,More Mortality.” (link) Thishigh-class treatment is typical, given Harvard’s brand, the reputation of theresearch team, and the pending publication in a peer-reviewed journal. Readersare told that the finding came from large studies with hundreds of thousands ofsubjects, and that the researchers “controlled for” other potential causes ofdeath

Zoë Harcombe, an author of books on obesity, was oneof the readers who did not buy the story. She heard that noise in her head whenshe reviewed the Harvard study. In a blog post, titled “Red meat &Mortality & the Usual Bad Science,” (link)Harcombe outlined how she determined the research was junk science.

Howdid Harcombe do this?

Alarmbells rang in her head because she has seen similar studies in whichresearchers commit what I call “causation creep.”(link)

    Harcombe_cite1

She thenreviewed the two studies used by the Harvard researchers, looking especiallyfor the precise definition of meat consumption, the key explanatory variable. Shediscovered that the data came from dietary questionnaires administered everyfour years (this meant subjects who didn’t answer this question would have beendropped from the analysis). All subjects were divided into five equal-sizedgroups (quintiles) based on the amount of red meat consumption. Surprisingly,“unprocessed red meat” included pork, hamburgers, beef wraps, lamb curry and so on. This part ischecking off the box; it didn’t reveal anything too worrisome.

Harcombesuspected that the Harvard study does not prove causation but she needed morethan just a hunch. She found plenty of ammunition in Table 1 of the paper. There,she learned that the cohort of people who report eating more red meat alsoreport higher levels of unhealthy behaviors, including more smoking, moredrinking, and less exercise. For example,

    Harcombe_cite2

Theresearchers argue that their multivariate regression analysis “controlled for”these other known factors. But Harcombe understands that when effects areconfounded, it is almost impossible to disentangle them. For instance, if you're comparing two school districts, and one is in a really rich neighborhood, and the other in a poor neighborhood, then race and income will be confounded and there is no way to know if the difference in educational outcomes is due to income or due to race.

Next,Harcombe looked for data to help her interpret the researchers’ central claim:

Unprocessed and processed red meat intakeswere associated with an increased risk of total, CVD, and cancer mortality inmen and women in the age-adjusted and fully adjusted models. When treating redmeat intake as a continuous variable, the elevated risk of total mortality inthe pooled analysis for a 1-serving-per-day increase was 12% for total redmeat, 13% for unprocessed red meat, and 20% for processed red meat.

Herfirst inquiry was about the baseline mortality rate, which was 0.81%. Twentypercent of that is 0.16% so roughly speaking, if you decide to take an extraserving of processed red meat every day, you face a less-than-2-out-of-1000chance of earlier death. (Whether the earlier death is due to the red meat orjust more food consumed each day is another instance of confounding.)

Thisalso raises the issue of error bars. As Gary Taubes explained in his responseto the red-meat study (link), serious epidemiologists only pay attention toeffects of 300% or higher, acknowledging the limitations of the types of databeing analyzed. The 12- or 20-percent effect does not give much confidence.

Theresearchers are overly confident in the statistical models used to analyze thedata, Harcombe soon learned. She was able to find the raw data, allowing her tocompare them with the statistically adjusted data. Here is one of hercalculations.

    Harcombe_cite3

Thefive columns represent quintiles of red meat consumption from lowest (Q1) tohighest (Q5). The last row (“Multivariate”) is the adjusted death rates with Q1set to 1.00. The row labelled “Death Rate(Z)” is a simple calculation performedby Harcombe, without adjustment. The key insight is that the shape ofHarcombe’s line is U-shaped while the shape of the multivariate line ismonotonic increasing.

Thepurpose of this analysis is not todebunk the research. What Harcombe did here is delineating where the data end,and where the model assumptions take over. One of the themes in Numbersense is that every analysis combinesdata with theory. Knowing which is which is half the battle.

Atthe end of Harcombe’s piece, she checked the incentives of the researchers.

Harcombe_cite4

Harcombedid really impressive work here, and her blog post is highly instructive of howto analyze data analysis. Chapter 2 of Numbersenselooks at the quality of data analyses of the obesity crisis.

***

Reminder: You can win a copy of my new book. See here for details.