I want to resume our recent discussions of statistical testing in the context of biologic plausibility. We discussed the latter at length a few months ago here, and came to the conclusion that our mere impression of biologic plausibility is not a good litmus test for an association. The oft-cited discovery of H. pylori as the cause of peptic ulcer disease is a tried and true example of the knowledge we would be missing today if we used biologic plausibility as the only yardstick for measuring the prospects of research.
At the same time, we spent a fair bit of time and energy talking about p values and how they need to be used in a Bayesian manner. To review, Bayes theorem relies on pre-test probability of an association to help us understand how much stock we need to put into a finding of an association. That is, the lower the pre-test probability, the more suspicious we should be of an observed association. To put it in concrete terms, for example the finding that intercessory prayer is associated with improved health outcomes requires a much greater amount of scrutiny than one that treating a bacterial infection with an antibiotic improves survival. There is a certain mechanistic elegance to the latter that is missing in the former, unless higher powers are invoked. Here is a quote from the Cochrane meta-analysis of intercessory prayer -- I especially love the last sentence [emphasis mine]:
REVIEWER'S CONCLUSIONS: Data in this review are too inconclusive to guide those wishing to uphold or refute the effect of intercessory prayer on health care outcomes. In the light of the best available data, there are no grounds to change current practices. There are few completed trials of the value of intercessory prayer, and the evidence presented so far is interesting enough to justify further study. If prayer is seen as a human endeavour it may or may not be beneficial, and further trials could uncover this. It could be the case that any effects are due to elements beyond present scientific understanding that will, in time, be understood. If any benefit derives from God's response to prayer it may be beyond any such trials to prove or disprove.At the same time, just because we do not have a mechanistic explanation at the ready does not mean that we should discount an association. In a rather lengthy post in October I wrote about my own conflicted feelings about applying Bayesian versus frequentist (this refers to all associations standing on similar probabilistic ground prior to testing) thinking in research. Although more Bayesian in my own thinking, I recognize metacognitively that it may at times be a trap:
Yet, there is something to be said about the frequentist approach, even though it is not my way generally. The frequentist approach, which is what underlies the bulk of our traditional clinical research, does not rely on differential prior probabilities for different possible associations, but treats them all equally. Despite many disadvantages, one obvious advantage is that we do not discount potential associations that do not have biologic plausibility, given our current understanding of biology, and sometimes help us stumble on brand new hypotheses. So, clearly, there is a tension here, and I am still working on what is the better way, if any.The last sentence here implies that there is a right and a wrong way, but having spent the last several months exploring these issues, I am beginning to think that this is incorrect. In fact, all of the p value discussions are leading me to believe that both approaches are useful, and it is the nuances of when either should predominate that need to be worked out.
Consider my examples above -- those of intercessory prayer and antibiotic treatment of a bacterial infection. Let us transport ourselves to, say 18th century England, where typhus, known as "gaol fever", killed more prisoners than the executioners did. How improbable would it have seemed to the medical profession of those days that a). the disease was caused by a microorganism, and b). it could be eradicated with an antibiotic? Why, I would guess that these assertions either would appear heretical or else confirm for the religious the divine presence. Either way, the biology was lacking and the plausibility was simply not there. Yet, this does not change the reality as we understand it today. What explanations will we have 200 years from now for the occasionally observed success of intercessory prayer? And more importantly, what do we do in the meantime to tread most sensibly that purgatory between accepting absurd associations and missing the unlikely ones that are nevertheless real?
The answer may be in the p value after all. Let us model qualitatively what things might look like for intercessory prayer. Let us pretend that we have just conducted the very first randomized controlled trial of the impact of intercessory prayer on the development of post-operative infection following coronary bypass surgery among 1,200 patients. We have found that there is indeed a lowered risk of infection in the intervention group, and the difference has the p value of 0.04. Great, right? We can walk away congratulating ourselves on a positive study. Well, of course this is absurd. Even though we can come up with some remotely plausible mechanism for this potentially causal association, our pre-test probability is still minuscule. The answer at this point should obviously be what has been suggested for genome-wide interaction studies: a much lower alpha level as the significance threshold. How low? This I cannot answer yet; while the rationale is, similar to genome-wide studies, a fishing expedition without much understanding of why we should find what we should find, here we are not merely engaging in multiple hypotheses testing, the number of which could help determine the appropriate significance level. No, here we are testing a single hypothesis whose mechanism is either absent or highly biologically implausible. So, how to determine the adequate threshold for significance under these circumstances remains unclear to me at this time. I can only say that the traditional 0.05 is highly inappropriate under the circumstances early in the research efforts.
As more studies are performed, their quality and directionality of results should impact how much stock we put in the results. That is, if well done studies consistently continue to demonstrate a positive association of intercessory prayer with clinical outcomes, despite inadequate mechanistic understanding, our level of skepticism should diminish, and commensurately the acceptable alpha can creep higher. In short, the more evidence and the stronger it is, despite poor understanding of why, the more liberal we can afford to be with what we consider a significant result.
So, my point? How we interpret the significance of results needs to be fluid. A p value is not a p value is not a p value. This much embattled and misunderstood statistic may yet be the bridge between Bayesian and frequentist approaches. If we get smarter about setting its thresholds, perhaps we can keep the baby while getting rid of the rancid bath water at the same time. Of course, I am not even attempting to address all of the cognitive biases that derail us in our pursuit of scientific truths. Incorporating them into our inference testing is definitely a discussion for another day.