What is early termination? It is just that: stopping the trial before enrolling the pre-planned number of subjects. First, it is important to be explicit in the planning phases about how many subjects will need to be enrolled. This is known as the power calculation and is based on the anticipated effect size and the uncertainty in this effect. Termination can happen for efficacy (the intervention works so splendidly that it becomes unethical not to offer it to everyone), safety (the intervention is so dangerous that it becomes unethical to offer it to anyone) or for other reasons (e.g., the recruitment is taking too long, etc.).

Who makes the decision to terminate early and how is the decision made? Well, under the best of circumstances, there is a Data Safety Monitoring Board, a body that is specifically in place to look at the data at certain points in the recruitment process and look for certain pre-specified differences between groups. This DSMB is fire-walled from both the investigators and the patients. The interim looks at the data should be pre-specified by the protocol also, as the number of these looks actually influences the initial power calculation, since the more you look, the more differences you are likely to find by chance alone.

So, without going into too much detail on these interim looks, understand that they are not to be taken lightly, and their conditions and reporting require full transparency. To their credit, the semi-recumbent position investigators reported their plan for one interim analysis upon reaching 50% enrollment. Neither the Methods section nor the Acknowledgements, however, specify who was the analyst and the decision-maker. Most likely it was the investigators themselves that ended up taking the look and deciding on the subsequent course of action. And this itself is not that methodologically clean.

Now, let's talk about one problem early termination. This gargantuan effort led by the team from McMaster in Canada and published last year in JAMA sheds the needed light on what had been suspected before: early termination leads to inflated effect estimates. The sheer massiveness of the work done is mind boggling -- over 2,500 studies were reviewed! The investigators elegantly paired meta-analyses of truncated RCTs with meta-analyses of matched but nontruncated ones, and compared the magnitude of the inter-group differences between the two categories of RCTs. Here is one interesting tidbit (particularly for my friend @ivanoransky):

Compared with matching nontruncated RCTs, truncated RCTs were more likely to be published in high-impact journals (30% vs 68%, P<.001).But here is what should really grab the reader:

Of 63 comparisons, the ratio of RRs was equal to or less than 1.0 in 55 (87%); the weighted average ratio of RRs was 0.71 (95% CI, 0.65-0.77; P <.001)(FIGURE2).The authors went on to conclude the following:In 39 of 63 comparisons (62%), the pooled estimates for nontruncated RCTs were not statistically significant. Comparison of the truncated RCTs with all RCTs (including the truncated RCTs) demonstrated a weighted average ratio of RRs of 0.85; in 16 of 63 comparisons (25%), the pooled estimate failed to demonstrate a significant effect. [Emphasis mine]

In this empirical study including 91 truncated RCTs and 424 matching nontruncated RCTs addressing 63 questions, we found that truncated RCTs provide biased estimates of effects on the outcome that precipitated early stopping. On average, the ratio of RRs in the truncated RCTs and matching nontruncated RCTs was 0.71. This implies that, for instance, if the RR from the nontruncated RCTs was 0.8 (a 20% relative risk reduction), the RR from the truncated RCTs would be on average approximately 0.57 (a 43% relative risk reduction, more than double the estimate of benefit). Nontruncated RCTs with no evidence of benefit—ie, with an RR of 1.0—would on average be associated with a 29% relative risk reduction in truncated RCTs addressing the same question.

So, what does this mean? It means that truncated RCTs do indeed tend to inflate the effect size substantially and to show differences by chance alone where none exists.

This is concerning in general, and specifically for our example of the semi-recumbent positioning study. Let us do some calculations to see just how this effect inflation would play out in the said study. Recall that microbiologically confirmed pneumonia occurred in 2 of 39 (5%) semi-recumbent cases and in 11 of 47 (23%) supine cases. The investigators calculated the adjusted odds ratio of VAP in the supine compared to semi-recumbent to be 6.8 (95% CI 1.7 - 26.7). This, as I mentioned before is an inflated estimate as odds ratios tend to be with frequent events. Furthermore, I obviously cannot do the adjusted calculation, as I would need the primary patient data for this. What we need is the relative reduction in VAP due to the intervention being investigated anyway, which is the reciprocal of what we have. So, I can derive the unadjusted relative risk thusly: (2/39)/(11/47) = 0.22. Now, if the RCT truncation alone reduces this risk by 29%, then if the trial had been allowed to go to completion, this relative risk would have been ~0.3. In this range, the difference does not seem all that impressive. But as all of the threats to validity we discussed in the original post begin to chisel mercilessly away at this risk reduction, the 29% inflation becomes a proportionally bigger deal.

Well, that does it.

However, expecting people to continue a trial in the face of such evidence, even if it is by chance is not reasonable. The fact that they distrust the alternative treatment could well have a negative effect on the patients assigned to it.

ReplyDeleteAn alternative would be to use the treatment that seemed better but watch the results closely for regression to the mean. If that were to occur, then a new trial would have to be done with prohibitions against early termination. If regression to the mean did not occur, then that would support the decision to terminate early.