Why no serious researchers conduct "per protocol" analyses
Numbers Rule Your World 2022-10-26
All observational data must be adjusted in order to correct biases. But our science teems with irresponsible statistical adjustments - throwing a bunch of variables into a regression model, and declaring that all known biases have been satisfactorily resolved is all too common.
In the medical literature, publishing one's adjustment model is rarely done. I don't trust any adjusted models unless the researcher discloses the equations, the coefficients and goodness of fit statistics. Even better, the researcher should compare the adjusted to unadjusted results, note how the adjustments affected the outcome, and explain why these adjustments make sense.
Handwaving about the structure of the adjustment model, referencing other papers that supposedly use similar methods, etc. suggest that the researchers lack confidence in their methods.
In this blog post, I demonstrate how to evaluate statistical adjustments, using the recently released, highly controversial results from the colonoscopy clinical trial. (See my related writeup here.) It took a lot of work to put this together, because the disclosure of details in the paper is scarce (this is the norm rather than the exception, unfortunately.)
***
The colonoscopy trial was the first randomized clinical trial (RCT) conducted to measure the effect of colonoscopy screening on risks of diagnoses or deaths from colon cancer. The trial enrolled two groups of participants: those invited to screening, and the "usual care" group which means no screening (as is the practice in Poland, Norway and Sweden, where the trial took place).
For this post, I focus only on risk of colon cancer diagnoses (i.e. cases). That's because the disclosure on mortality outcomes is even worse, making it impossible to judge the adjustment procedure. The trial specifies an intent-to-screen analysis (more generally known as intent-to-treat, ITT), The ITT analysis shows that 0.93% of the invited group were diagnosed during the 10-year follow-up window, compared to 1.13% of the usual-care group. This result is described as a 18% reduction in risk of colon cancer, even though the difference in diagnosis rate is only 0.2%, that is to say, out of 1,000 people invited to colonoscopy, only two fewer cases will be detected relative to usual care.
As discussed in the previous blog, the medical profession was none too happy with this outcome because doctors, esp. American ones, have convinced themselves that colonoscopy could reduce both cases and deaths by 70% or more. Neither did the authors of the paper disclosing the trial results - in the section of the paper in which they discussed the findings, they did not mention the primary endpoint, but only talked about an "adjusted per-protocol" analysis, which moved the reported result from 18% reduction in risk to 31%. Still far from the 70% previously believed, but deemed more palatable.
The media then mis-reported this analysis as "per-protocol" analysis, dropping the all-important word "adjusted". By the end of this post, you'll understand why I say this.
***
Intent-to-treat vs per-protocol is a long-running debate but any serious researcher goes with the former, which is exactly what the designers of the colonoscopy trial did. Let's look at this debate using a marketing example.
Let's say Target is running a Black Friday promotion in which online customers get an extra 30% discount if they type the code SAVE30 when checking out. In order to measure the effectiveness of this promotion, Target data scientists design an A/B test in which shoppers are randomly divided into the exposed group and the non-exposed group upon visiting Target's website. They intend to analyze the results of this experiment by looking at metrics such as proportion of visitors who made a purchase, and average purchase amount, comparing the two groups. This analysis is known as an intent-to-treat (ITT) analysis.
Ouch! The ITT analysis showed that the additional sales generated by exposure to the SAVE30 promotion failed to cover the cost of giving out these extra discounts. Inevitably, there will be at least one business manager who complains that the wrong analysis was done. They argue that we should not include people who did not notice the promotion. Being exposed to the promotion just means Target shows notifications while the customers browse around but some fraction of customers did not pay attention. So, what do they think is the right analysis? We should only analyze those people for whom we know for sure they noticed the promotion - in other words, we only analyze those people who entered the promotion code on checkout (or other variants, such as only those who clicked on a promotion banner). This style of analysis is "per protocol".
Why would all serious researchers shy away from per-protocol analyses? In this example, everyone in the exposed group who subsequently entered the promotion code on checkout is someone who is surely, or almost surely, making a purchase. In the non-exposed group, there is no such thing as entering the promotion code, thus we effectively narrowed the exposed group to almost sure shoppers against the unexposed group that contains the average online shopper. Which group will show better metrics? Will the better results on the exposed group be due to the promotion or inherent differences in the group composition?
***
Back to the colonoscopy trial. In the per-protocol analysis, the data analyst threw out roughly half of the invited-to-screen group because they did not "comply". The reasoning is that they could not have benefited from the screening (read: SAVE30) and thus, they drag down the efficacy. But the "compliers" typically have different characteristics than the "non-compliers", and so per-protocol destroys the randomization aspect of the "randomized" controlled trial. In other words, the complier subset is different from the usual-care group not just in being screened but in unknowable dimensions. We've turned the RCT into an observational study.
People who advocate PP analysis say not to worry - they will throw a bunch of demographic variables into a regression model, and all biases will disappear. I find this quite fascinating. If such adjustment strategies indeed can cure all biases, then why would the community embrace RCTs as the gold standard? It would appear that observational data blended in a mixer known as regression modeling is equally as good as a randomized experiment!
In fact, the colonoscopy controversy clearly shows the opposite. Prior observational studies have consistently shown extremely positive results of 70% or so, and the first time a research team conducts an RCT, the outcome is below 20%. If anything, this proves that carefully analyzed observational studies cannot replace the gold-standard RCT.
***
Even if all biases are cured, and we are able to believe the PP analysis, the real-world results are guaranteed to disappoint.
Let's go back to the Target example. Let's say the PP analysis showed that customers who used the SAVE30 promotion were 80% more likely to complete the transaction than the customers who were not shown the promotion. So now, the head of marketing tells everyone if all shoppers are shown the promotion, then the experiment proves that they should see an 80% jump in number of transactions.
Of course, reality bites. The problem is that the 80% incremental sales apply only to the subset who used the promotion code. If only 10% of those exposed used the code during the A/B test, then only about 10% would contribute an 80% jump in sales. The other 90% are exposed but would not be using the code.
If half the trial population elect not to take the colonoscopy screening during the trial, it is likely the case that half the population would make this decision if colonoscopy was rolled out to all. The 30% reduction in colon cancer risk would still only apply to the subset choosing to be screened, and would not make a dent on those who chose to forego it.
***
It looks like I have to defer showing you my analysis till the next post. The plan is to compare the ITT analysis and the "adjusted PP analysis", pointing out how the adjustments affected the outcomes. Come back tomorrow.
(By the way, all Covid-19 vaccine trials were per-protocol analyses as well, even though such analyses are problematic.)