Another one of those “Psychological Science” papers (this time on biceps size and political attitudes among college students)

Statistical Modeling, Causal Inference, and Social Science 2013-05-29

Paul Alper writes:

Unless I missed it, you haven’t commented on the recent article of Michael Bang Peterson [with Daniel Sznycer, Aaron Sell, Leda Cosmides, and John Tooby]. It seems to have been reviewed extensively in the lay press. A typical example is here. This review begins with “If you are physically strong, social science scholars believe they can predict whether or not you are more conservative than other men…Men’s upper-body strength predicts their political opinions on economic redistribution, they write, and they believe that the link may reflect psychological traits that evolved in response to our early ancestral environments and continue to influence behavior today. . . . they surveyed hundreds of people in America, Denmark and Argentina about bicep size, socioeconomic status, and support for economic redistribution.”

Further, “Despite the fact that the United States, Denmark and Argentina have very different welfare systems, we still see that — at the psychological level — individuals reason about welfare redistribution in the same way,” says Petersen. “In all three countries, physically strong males consistently pursue the self-interested position on redistribution.

“Our results demonstrate that physically weak males are more reluctant than physically strong males to assert their self-interest — just as if disputes over national policies were a matter of direct physical confrontation among small numbers of individuals, rather than abstract electoral dynamics among millions.”

However, the actual journal article and its supplemental material show how shaky the all-encompassing conclusions are. For example, R-sq for each of the three countries is very low. The regression line and 90% confidence intervals drawn for each of the three countries is devoid of the individual data points and thus, no visual sense of the variability. SES assessment was highly subjective and different in each country. In the U.S. and Argentina, the study relied on college students measuring college students’ biceps while in Denmark “a protocol was devised and presented to the subjects over the internet instructing them on how to measure their biceps correctly.” As to SES, the questions used were different in each country, supposedly justified by taking “into account country-specific factors regarding the political discussions on redistribution.”

So, while the study’s conclusion may in fact be valid, is this yet another example of overreach by social scientists and embrace by the lay press which seeks eye-catching studies?

My reply: This article is worth discussing, partly because it appears to be of higher quality than that the other Psychological Science article we discussed recently. That other paper was so full of holes, and its claimed effect size was so large as to be utterly implausible. In particular, that other article claimed to find huge within-person effects (different attitudes for women in different parts of their menstrual cycles), but estimated this entirely with a between-person study. In contrast, the Peterson et al. article linked to above is much more sensible, claiming only between-person effects.

To be more specific, they claim that physically strong low-SES 21-year-old men are more likely to favor income redistribution, compared to physically weak low-SES 21-year-old men. They don’t suggest that going to the gym will make one of these individual young man more in favor of redistribution; they’re only making a claim about differences in the population.

Here are my reactions to the Petersen et al. paper:

1. As noted above, the correlations in the paper do not seem completely unreasonable; that is, it seems possible that they apply to the general population of 21-year-old men, not just to the small samples analyzed in the study.

2. The statistical evidence is not as clear as the authors seem to think (given the declarative, non-caveated style of the article’s abstract and conclusions). Most obviously, the authors report a statistically significant interaction with no statistically significant main effect. But, had they seen the main effect (in either direction), I’m sure they could’ve come up with a good story for that too. That’s fine—they should report what they saw in the data—but the p-values don’t quite have the pure implications implied in the presentation.

3. There also appear to be some degrees of freedom involved in the measurement. From the supplementary material:

The interaction effect is not significant when the scale from the Danish study are used to gauge the US subjects’ support for redistribution. This arises because two of the items are somewhat unreliable in a US context. Hence, for items 5 and 6, the inter-item correlations range from as low as .11 to .30. These two items are also those that express the idea of European-style market intervention most clearly and, hence, could sound odd and unfamiliar to the US subjects. When these two unreliable items are removed (alpha after removal = .72), the interaction effect becomes significant.

The scale measuring support for redistribution in the Argentina sample has a low α-level and, hence, is affected by a high level of random noise. Hence, the consistency of the results across the samples is achieved in spite of this noise. A subscale with an acceptable α=.65 can be formed from items 1 and 4.

Lots of options in this analysis. Again, these decisions may make perfect sense but they indicate the difficulty of taking these p-values at anything like face value. As always in such settings, the concern is not a simple “file-drawer effect” that a particular p-value was chosen out of some fixed number of options (so that, for example, a nominal p=0.003 should really be p=0.03) but that the data analysis can be altered at so many different points under the knowledge that low p-values are the goal. This can all be done in the context of completely reasonable scientific goals.

3. My first reaction when seeing a analysis of young men’s bicep size is that this could be a proxy for age. And, indeed, for the analyses from the two countries where the samples were college students, when age is thrown into the model, the coefficient for bicep size (or, as the authors put it, “upper-body strength”) goes away.

But then comes the big problem. The key coefficient is the interaction between bicep size and socioeconomic status. But the analyses don’t adjust for the interaction between age and socioeconomic status. Now, it’s well known that political attitudes and political commitments change around that time: people start voting, and their attitudes become more partisan. I suppose Petersen et al. might argue that all this is simply a product of changing upper-body strength, but to me such a claim would be more than a bit of a stretch.

This is a general problem with the language of regression modeling, which leads researchers to think that including a variable in a regression “controls for it” so that they can interpret the remaining coefficients causally.

4. I agree with Alper that the authors should’ve presented raw data. For example, Figure 1 could easily have included points showing average support for income redistribution for respondents broken into bins characterized by SES and bicep size. The dots could be connected into lines, thus for each of their graphs you’d see three lines showing avg attitude vs. SES for respondents in the lower, middle, and upper terciles of bicep size. Such a graph would still have be problem of being contaminated by correlation between age and bicep size, but at least it would show the basic patterns in the data.

5. Finally, the authors engage in the usual tabloid practice of dramatically overselling their findings. What they actually found were some correlations among three samples, two of which were of college students. But their abstract says nothing about college students, instead presenting their claims entirely generally, referring only to “men,” never to “young men” or “students.” And then there is the causal language. The abstract is clean here (the use the term “predicted” rather than “caused”) but later on they unambiguously write, “Does upper-body strength influence support for eco- nomic redistribution in men? Yes.” Such a statement is simply wrong. Or, to be more precise, it could be correct but it’s not good scientific practice to make such a casual claim based on a correlation. Later they write, “Does upper-body strength influence support for economic redistribution in women? No.” This statement is even more wrong, in that, if you accept their causal interpretation, you still have to remember that lack of statistical significance is not the same as a zero effect.

Then they go deep into story time: “the results indicate that physically stronger males (rich and poor) are more prone to bargain in their own self-interest . . .” Also recall that quote above, where Petersen claimed that their results say something about how “individuals reason about welfare redistribution.”

And, from the conclusion of the paper, here are all the overstatements at once:

We showed that upper-body strength in modern adult men influences their willingness to bargain in their own self-interest over income and wealth redistribution. These effects were replicated across cultures and, as expected, found only among males. The effects in the Danish sample were especially informative because it was a large and representative national sample.

Actually, I didn’t see anything in the data about bargaining, nor are the causal claims supported by the analysis, and nor do college students represent “modern adult men.” A careful reader may have noted that the U.S. and Argentina samples were students, but the authors managed to get through the abstract, intro, and conclusion without mentioning this restriction.

Again, I don’t think any malign intent among the authors is required here. They believe their story (of course they do, that’s why they’re putting in the effort to study it), and so it’s natural for them, when reflecting on problems of measurement, causal identification, and representativeness of the sample, to see these as minor nuisances rather than as fundamental challenges to their interpretation of their data.

I have mixed feelings about criticizing this sort of study

On one hand, it’s a seriously flawed exercise in headline bait that is presented as scientifically definitive. On the other hand, you have to start somewhere. In the modern academic environment with the option of immediate publication, it’s too much to expect that a group of researchers would sit quietly for years re-designing and replicating their experiments, looking at all their claims with a critical eye, and plugging all the holes in their arguments, before submitting their paper for publication. Indeed, arguably it’s even better to publish these sorts of preliminary results right away so as to engage the larger scientific research community (and to get the sort of free post-publication peer review you’re seeing right here). Ideally, I think, they’d publish this sort of thing in Plos-One, with space in a top journal such as Psychological Science reserved for more careful work. Or, if the top journal Psychological Science really wants to publish this material (cutting-edge research and all that), it could have a section of each issue clearly labeled Speculations, so that media and other outsiders wouldn’t be misled into taking the article’s claims too seriously.

Just to say it again: it’s easy for me to stand on the sidelines taking potshots at other people’s work. It’s pretty clear that the work of people like me who stand around criticizing statistical analyses is relevant only in the context of the work of the people who do the actual research studies. The question is: how can statistical understanding better work its way into the applied research community. Traditionally we rely on referees to point out issues of measurement, causal identification, and representativeness—but that approach clearly isn’t working here. This blog provides some post-publication peer review, but it’s not attached to the article itself. I could try writing this up as a letter to the editor for the journal, but my impression is that editors don’t like to run critiques of papers they have published. And, again, there’s an important role in science for speculative studies. It would be a mistake for the system to be run so strictly that only airtight findings get published.

Public data

The other big, big thing is that the data should already be public. As discussed above, one can come up with all sorts of explanations of their findings in this paper, and it would be good if other researchers (or even bloggers!) could try out their own analyses. The survey data could be anonymized (if they are not already) so that confidentiality wouldn’t be an issue.

The post Another one of those “Psychological Science” papers (this time on biceps size and political attitudes among college students) appeared first on Statistical Modeling, Causal Inference, and Social Science.