When the numbers don’t look right, check them! (Mississippi education update)
Statistical Modeling, Causal Inference, and Social Science 2025-12-04
Part 1: Reading what different sources say
The other day, as part of a long discussion about the estimated effects of Mississippi’s education plan, I quoted some education researchers, Wainer et al., who wrote:
The 2024 NAEP fourth-grade mathematics scores rank the state at a tie at 50th! The eighth-grade scores also qualify for 50th place.
I also quoted a different critic of the Mississippi claims, Ravitch, who wrote:
In math, [Mississippi’s test scores] zoomed from fiftieth to twenty-third. Adjusted for demographics, Mississippi now ranks near the top in fourth grade reading and math according to the Urban Institute’s America’s Gradebook report.
And I found this from the wikipedia page on the Mississippi Miracle:
After adjusting for demographics, in 2024, Mississippi was the nation’s #1 state in Reading as well as in Mathematics.
I wrote, “But Wainer et al. say that Mississippi is tied for 50th in math. Can they really be worst in the nation, but best after demographic adjustment? I guess it’s possible.”
Part 2: Anomalies!
Wainer et al. said Mississippi’s 4th and 8th grade math scores were the nation’s worst in 2024.
Ravitch said their 4th-grade math scores have increased to 23rd in the nation and that they’re near the top when adjusted for demographics.
Wikipedia said that Mississippi’s math scores were best after adjusting for demographics.
So, Wainer et al. and Ravitch flat-out disagree on Mississippi’s absolute ranking in 4th-grade math; Ravitch and Wikipedia disagree slightly on the result after demographic adjustment (“near the top” or “the nation’s #1 state”); and I can’t be sure, but it also seems doubtful that a state could be #50 unadjusted and #1 after adjustment. As I wrote, it’s theoretically possible but it seems like a stretch.
Part 3: I do nothing.
One of my sayings is that an important characteristic of a good scientist is the capacity to be upset, to recognize anomalies for what they are, and to track them down and figure out what in our understanding is lacking.
In this case, though, I just let the anomaly sit there like a rotting fish. I went around it and I kept writing.
Why did I not explore this 4th-grade math test thing more closely? Partly because I didn’t have the data and hand. It turned out that a quick google was all that was needed, but I didn’t take that step. Another thing is that, in any investigation, many anomalies will come up (one of these was the average age of the students being tested; more on that below), and we can’t look into everything at once. In that way, it’s a like an Agatha Christie-style mystery, where various inconsistencies and anomalies arise and are noted in turn, but then the story moves on, with the explanation happening later. The other day we saw the new Knives Out movie–it was really great! If the original Knives Out was a 10 and the sequel was a 3, this third installment was a solid 9–and it did that thing were anomalies would pop up and get discussed but then set aside. If you stopped the train at every anomaly, you’d never get to the destination.
And the math scores were not a key part of the story, so I just let my bafflement sit there and I did not follow up.
Part 4: Let’s look at the numbers.
In the discussion of our post, two commenters said that Wainer et al. were wrong on the math scores. Steve wrote:
You can look the data up on the 2024 NAEP report:
https://nces.ed.gov/nationsreportcard/
I have no idea how these researchers came up with these claims: “The 2024 NAEP fourth-grade mathematics scores rank the state at a tie at 50th! The eighth-grade scores also qualify for 50th place.”
My reading of the report is that Mississippi’s 8th grade math scores had trailed the national average by 18 points in 2000 but by only 3 points by 2024.
And SD wrote:
“The 2024 NAEP fourth-grade mathematics scores rank the state at a tie at 50th! The eighth-grade scores also qualify for 50th place.”
This is just literally made up
So I looked it up, and . . . yeah, Wainer et al. had it wrong! Here’s what it says on the NAEP page:
4th grade math: National avg 237, MS avg 239, above average! 8th grade math: National avg 272, MS avg 269, but rank is approx 35th, not 50th.
Also I went to the Urban Institute page to see their demographically adjusted numbers (“The demographics we use for the adjustment include gender, age, race or ethnicity, receipt of free and reduced-price lunch, special education status, and English language learner status”) for 2024:
4th grade math: MS 248.6, they are indeed #1! 8th grade math: MS 281.3, also #1!
You can make of this adjustment what you will. But, in any case, no way were they ranked #50. I contacted Wainer et al., and Dan Robinson, one of the authors on the paper, confirmed that this was a mistake and that they would remove those two sentences from their paper.
Part 5: Where are we now?
As I discussed a couple days ago, I’m coming at this from two directions.
On one side, Wainer, Grabovsky, and Robinson are experienced education researchers, and they are not impressed by the claimed large effects of Mississippi’s policies.
On the other side, Wainer et al. are making their arguments in general terms, and the specific numbers from Mississippi seem impressive. This “on the other side” point is even stronger when we consider that Wainer et al. based part of their argument on math scores on garbled numbers.
There’s also a political angle, which I did not discuss in my original post but which came up in the comments, and it’s interesting because both side’s arguments have a politically conservative flavor. It’s a conservative vs. conservative battle. The proponents of the Mississippi plan offer the conservative argument that back-to-basics education work, also the conservative (in the U.S. context) argument that Mississippians are as good as anyone else. The skeptics of the Mississippi plan offer the conservative argument that there are no miracle cures, that schooling can’t do much to alter the natural order of things, and that government statistics can’t be trusted. I’m exaggerating the political slant in both directions here, but I do think that the arguments are taking place on a conservative turf, which is interesting, and I guess reflects the discrediting in recent years of education practices associate with the left.
Before ending this discussion, though, I wanted to go back to the statistics. Not the details but more of a view from 30,000 feet.
– An intervention was done in Mississippi in the mid-2010s, and people studied state-level aggregate test scores before and after. Mississippi’s test scores improved a lot relative to the nation during this period. This was part of a longer-term improving trend.
– The estimates of the program’s effects are observational. There was no control group. The implicit control is to imagine that previous trends in the state would have continued, or that the trends in Mississippi would be like trends in other states afterward.
– We don’t have easily accessible data on individual students. Robinson asks, “For example, what students benefited most from the intervention? What happened to the scores of the retained students once they took the NAEP reading test again?”
– The critics were coming into this from a generally skeptical position based on their view of previous hype in the education field, also the clear statistical issue that if you delay the kids who are performing poorly on the test, that averages will go up, also the lack of a control group. They did not do the work to quantify these concerns in this particular case, in part because relevant data were not easily accessible, but their distance from the details was a problem, as we could see with the gross error regarding the math tests.
– Mississippi’s average test scores have been going up. How much is this due to selection of who takes the test and when they take it, how much is due to changes in accommodations for disabilities (as discussed by Kelsey Piper in comments), and how much is due to targeted test preparation, I don’t know. It is a luxury of blogging that I can openly admit my uncertainty here.
– Stepping back, it’s clear to me why Wainer et al. remain skeptical, while Piper and other reporters have a more positive take on the Mississippi program.
– Finally, it’s not all about average test scores and it’s not all about the students being held back. I’m still thinking that a key outcome is reading and math ability at the time of school leaving. The idea of the program seems to be that if you hold some kids back a year, that will help them learn by keeping them in classes that are closer to the right level for them, and that this will also allow a higher level of education for the kids who are not held back. Some commenters also argued that the threat of being held back would motivate kids to learn more in third grade. I don’t know about that, but the point is that the problem is complicated enough that I can see the virtue of a “reduced-form” approach that just looks at effects on average test scores–but then you have to be concerned about the lack of control group and about compositional effects, which is where we started!
Part 6: Summary
– I should’ve looked into those math-score claims more carefully! Once I noticed the discrepancy between different reports, that was the time to track down what was happening. I’ve criticized statisticians for just accepting unreasonable numbers without checking, so bad on me for sloppiness here.
– As before, I don’t have a strong take on what’s happening in Mississippi. I see good arguments on both sides and no easy way to resolve them. My Bayesian inclination is to split the difference and say there’s some evidence that these policies are working but not to the extent that is advertised, but I don’t really know. Indeed, I can think of this Bayesian splitting of the difference as a kind of frequentist procedure in the sense that, on average, I think we will do well by splitting the difference in this sort of dispute. In any given problem, I’ll often come down stronger on one side or another (as here, for example), but in this case, nah, I don’t really have more for you.