Showing posts with label biologic plausibility. Show all posts
Showing posts with label biologic plausibility. Show all posts

Monday, January 3, 2011

Gaol fever and intercessory prayer: Redefining the role of p-value?

Happy 2011, everyone! I hope that it is everything you want it to be. Sorry for a brief hiatus in blogging -- needed to recharge my batteries and read others' writing for a change. Well, back now. And thanks to you all for coming back too.

I want to resume our recent discussions of statistical testing in the context of biologic plausibility. We discussed the latter at length a few months ago here, and came to the conclusion that our mere impression of biologic plausibility is not a good litmus test for an association. The oft-cited discovery of H. pylori as the cause of peptic ulcer disease is a tried and true example of the knowledge we would be missing today if we used biologic plausibility as the only yardstick for measuring the prospects of research.

At the same time, we spent a fair bit of time and energy talking about p values and how they need to be used in a Bayesian manner. To review, Bayes theorem relies on pre-test probability of an association to help us understand how much stock we need to put into a finding of an association. That is, the lower the pre-test probability, the more suspicious we should be of an observed association. To put it in concrete terms, for example the finding that intercessory prayer is associated with improved health outcomes requires a much greater amount of scrutiny than one that treating a bacterial infection with an antibiotic improves survival. There is a certain mechanistic elegance to the latter that is missing in the former, unless higher powers are invoked. Here is a quote from the Cochrane meta-analysis of intercessory prayer -- I especially love the last sentence [emphasis mine]:
REVIEWER'S CONCLUSIONS: Data in this review are too inconclusive to guide those wishing to uphold or refute the effect of intercessory prayer on health care outcomes. In the light of the best available data, there are no grounds to change current practices. There are few completed trials of the value of intercessory prayer, and the evidence presented so far is interesting enough to justify further study. If prayer is seen as a human endeavour it may or may not be beneficial, and further trials could uncover this. It could be the case that any effects are due to elements beyond present scientific understanding that will, in time, be understood. If any benefit derives from God's response to prayer it may be beyond any such trials to prove or disprove.
At the same time, just because we do not have a mechanistic explanation at the ready does not mean that we should discount an association. In a rather lengthy post in October I wrote about my own conflicted feelings about applying Bayesian versus frequentist (this refers to all associations standing on similar probabilistic ground prior to testing) thinking in research. Although more Bayesian in my own thinking, I recognize metacognitively that it may at times be a trap:
Yet, there is something to be said about the frequentist approach, even though it is not my way generally. The frequentist approach, which is what underlies the bulk of our traditional clinical research, does not rely on differential prior probabilities for different possible associations, but treats them all equally. Despite many disadvantages, one obvious advantage is that we do not discount potential associations that do not have biologic plausibility, given our current understanding of biology, and sometimes help us stumble on brand new hypotheses. So, clearly, there is a tension here, and I am still working on what is the better way, if any.
The last sentence here implies that there is a right and a wrong way, but having spent the last several months exploring these issues, I am beginning to think that this is incorrect. In fact, all of the p value discussions are leading me to believe that both approaches are useful, and it is the nuances of when either should predominate that need to be worked out.

Consider my examples above -- those of intercessory prayer and antibiotic treatment of a bacterial infection. Let us transport ourselves to, say 18th century England, where typhus, known as "gaol fever", killed more prisoners than the executioners did. How improbable would it have seemed to the medical profession of those days that a). the disease was caused by a microorganism, and b). it could be eradicated with an antibiotic? Why, I would guess that these assertions either would appear heretical or else confirm for the religious the divine presence. Either way, the biology was lacking and the plausibility was simply not there. Yet, this does not change the reality as we understand it today. What explanations will we have 200 years from now for the occasionally observed success of intercessory prayer? And more importantly, what do we do in the meantime to tread most sensibly that purgatory between accepting absurd associations and missing the unlikely ones that are nevertheless real?

The answer may be in the p value after all. Let us model qualitatively what things might look like for intercessory prayer. Let us pretend that we have just conducted the very first randomized controlled trial of the impact of intercessory prayer on the development of post-operative infection following coronary bypass surgery among 1,200 patients. We have found that there is indeed a lowered risk of infection in the intervention group, and the difference has the p value of 0.04. Great, right? We can walk away congratulating ourselves on a positive study. Well, of course this is absurd. Even though we can come up  with some remotely plausible mechanism for this potentially causal association, our pre-test probability is still minuscule. The answer at this point should obviously be what has been suggested for genome-wide interaction studies: a much lower alpha level as the significance threshold. How low? This I cannot answer yet; while the rationale is, similar to genome-wide studies, a fishing expedition without much understanding of why we should find what we should find, here we are not merely engaging in multiple hypotheses testing, the number of which could help determine the appropriate significance level. No, here we are testing a single hypothesis whose mechanism is either absent or highly biologically implausible. So, how to determine the adequate threshold for significance under these circumstances remains unclear to me at this time. I can only say that the traditional 0.05 is highly inappropriate under the circumstances early in the research efforts.

As more studies are performed, their quality and directionality of results should impact how much stock we put in the results. That is, if well done studies consistently continue to demonstrate a positive association of intercessory prayer with clinical outcomes, despite inadequate mechanistic understanding, our level of skepticism should diminish, and commensurately the acceptable alpha can creep higher. In short, the more evidence and the stronger it is, despite poor understanding of why, the more liberal we can afford to be with what we consider a significant result.

So, my point? How we interpret the significance of results needs to be fluid. A p value is not a p value is not a p value. This much embattled and misunderstood statistic may yet be the bridge between Bayesian and frequentist approaches. If we get smarter about setting its thresholds, perhaps we can keep the baby while getting rid of the rancid bath water at the same time. Of course, I am not even attempting to address all of the cognitive biases that derail us in our pursuit of scientific truths. Incorporating them into our inference testing is definitely a discussion for another day.  

Monday, December 13, 2010

Can a "negative" p-value obscure a positive finding?

I am still on my p-value kick, brilliantly fueled by Dr. Steve Goodman's correspondence with me and another paper by him aptly named "A Dirty Dozen: Twelve P-Value Misconceptions". It is definitely worth a read in toto, as I will only focus on some of its more salient parts.

Perhaps the most important point that I have gleaned from my p-value quest is that the word "significance" should be taken quite literally. Here is what Merriam-Webster dictionary says about it:

sig·nif·i·cance

 noun \sig-ˈni-fi-kən(t)s\
Definition of SIGNIFICANCE
1
a : something that is conveyed as a meaning often obscurely or indirectly

b : the quality of conveying or implying
2
a : the quality of being important : moment
b : the quality of being statistically significant
 It is the very meaning in point 2a that the word "significance" was meant to convey in reference to statistical testing. That is "worth noting" or "noteworthy". Nowhere do we find any references to God or truth or dogma. So, the first lesson is to drift away from taking statistical significance as the sign from God that we have discovered the absolute truth, and to focus on the fact that  we need to make note of the association. The follow up to the noting action is confirmation or refutation. That is, once identified, this relationship needs to be tested again (and sometimes again and again) before we can say that there may be something to it.

As an aside, how many times have you received comments from peer reviewers saying that what you are showing has already been shown? Yet, in all of our Discussion section we are quite cautious to say that "more research is needed" to confirm what we have seen. So, it seems we -- researchers, editors and reviewers -- just need to get on the same page.

To go on, as the title of the paper states, we are initiated into the 12 common misconceptions about what p-value is not. Here is the table that enumerates all 12 (since the paper is easy to find for no fee, I am assuming that reproducing the table with attribution is not a problem):

    
Even though some of them seem quite similar, it is worth understanding the degrees of difference, as they provide important insights.

The one I wanted to touch upon further today is Misconception #12, as it dovetails with our prior discussion vis-a-vis environmental risks. But before we do this, it is worth defining the elusive meaning of the p-value once again: "The p-value signifies the probability of obtaining the association (or difference) of the magnitude obtained or one of greater magnitude when in reality there is no association (or difference)". So, let's apply this to an everyday example of smoking and lung cancer risk. Let's say a study shows a 2-fold increase in lung cancer among smokers compared to non-smokers, and the p-value for this association is 0.06. What this really means is that "under conditions of no true association between smoking and lung cancer, there is a 6% or less chance that a study would find a 2-fold or greater increase in cancer associated with smoking". Make sense? Yet, according to the "rules" of statistical significance, we would call this study negative. But is this a true negative? (To the reader of this blog this is obvious, but assure you that, given how cursory our reading of the literature tends to be, and how often I hear my peers discount findings with the careless "But the p-value was not significant", this is a point worth harping on).

The bottom line answer to this is found in the discussion of the bottom line misconception in the Table: "A scientific conclusion or treatment policy should be based on whether or not the p-value is significant". I would like to quote directly from Goodman's paper here, as it really drives home the idiocy of this idea:
This misconception encompasses all of the others. It is equivalent to saying that the magnitude of effect is not relevant, that only evidence relevant to a scientific conclusion is in the experiment at hand, and that both beliefs and actions flow directly from the statistical results. The evidence from a given study needs to be combined with that from prior work to generate a conclusion. In some instances, a scientifically defensible conclusion might be that the null hypothesis is still probably true even after a significant result, and in other instances, a nonsignificant P value might still lead to a conclusion that a treatment works. This can be done formally only through Bayesian approaches. To justify actions, we must incorporate the seriousness of errors flowing from the actions together with the chance that the conclusions are wrong.
When the author advocates Bayesian approaches, he is referring to the idea that a positive result in the setting of a low pre-test probability still has a very low chance of describing a truly positive association. This is better illustrated by the Bayes theorem, which allows us to quantify the result at hand ("posterior probability") to what the bulk of prior evidence and/or thought has indicated about the association ("prior probability"). This implies that the lower our prior probability, the less convinced we can be by a single positive result. As a corollary, the higher our prior probability for an association, the less credence we can put in a single negative result. So, Bayesian approach to evidence, as Goodman indicates here, can merely move us in the direction of either a greater or a lesser doubt about our results, NOT bring us to the truth or falsity.

Taken together, all these points merely confirm my prior assertion that we need to be a lot more cautious about calling results negative when deciding about potentially risky exposures than about beneficial ones. Similarly, we need to set a much higher bar for all threats to validity in studies designed to look at risky rather than beneficial outcomes (more on this in a future post). These are the principles we should be employing when evaluating environmental exposures. This becomes particularly critical in view of the startling revelations of the genome-wide association experiments findings that our genes determine a very small minority of diseases to which we are subject. This means that the ante has been upped dramatically for environmental exposures as culprits, and demands a much more serious push for the precautionary principle as the foundation for our environmental policy.    



 

Wednesday, November 17, 2010

Some implications of biologic plausibility

Ever since my... ahem... skirmish... with the folks over at the SBM, I have been contemplating the issue of biologic plausibility. They contend that our tax dollars are wasted by being allocated to the NCCAM to pursue research into CAM. Their reasoning is that there is no biological plausibility to any of it having any therapeutic effect. Now, this is a big bite to swallow. As I have said before there is CAM and then there is CAM. CAM seems to be a convenient wastebasket of modalities that we feel justified in bashing as "woo" since there is limited scientific evidence behind them. But really, I am more willing to give acupuncture and massage the benefit of the doubt than, say, healing crystals (even though I confess I really like rocks!).

So, what of this biologic plausibility, and who came up with it anyway? And is it truly fiscally irresponsible, and possibly even unethical, to test interventions that do not fit our biological plausibility criteria? As a corollary, is there a level of our understanding of biology that makes testing equally wasteful or even unethical? And finally, should plausibility of benefit and harm be required to reach the same evidentiary bar?

For the definition of biological plausibility we apparently thank the milestone 1964 Surgeon General's report linking smoking to cancer. This report was the first official US government document to state that there was enough evidence to implicate cigarette smoking in the rise in lung cancer and cancer deaths. Since the limitations of observational research were used by the critics for decades to derail this definitive statement, the report itself does a nice job laying out the methodologic considerations and the need to rely on the Bradford-Hill criteria. It was in the "coherence" criterion that biologic plausibility entered the picture.

A quick check of my favorite crowd-sourced information site, the Wikipedia, uncovers this treasure from Sir Bradford Hill himself:

It will be helpful if the causation we suspect is biologically plausible. But this is a feature I am convinced we cannot demand. What is biologically plausible depends upon the biological knowledge of the day. To quote again from my Alfred Watson Memorial Lecture [1962], there was
"…no biological knowledge to support (or to refute) Pott’s observation in the 18th century of the excess of cancer in chimney sweeps. It was lack of biological knowledge in the 19th that led to a prize essayist writing on the value and the fallacy of statistics to conclude, amongst other “absurd” associations, that 'it could be no more ridiculous for the strange who passed the night in the steerage of an emigrant ship to ascribe the typhus, which he there contracted, to the vermin with which bodies of the sick might be infected.' And coming to nearer times, in the 20th century there was no biological knowledge to support the evidence against rubella."

In short, the association we observe may be one new to science or medicine and we must not dismiss it too light-heartedly as just too odd. As Sherlock Holmes advised Dr. Watson, "when you have eliminated the impossible, whatever remains, however improbable, must be the truth."[1]
Aha, so biologic plausibility is a function of the state of our current knowledge, today. By this litmus test, Marshall and Warren should have been laughed out of all funding agencies. Instead, they rewrote our understanding of what can live in the stomach, and how a microorganism can cause peptic ulcer disease and stomach cancer. And got themselves a cool Nobel to boot. So much for the ethics and finances of biologic plausibility informing meaningful research.

Now, on to the question of whether there exist relationships with such high biologic plausibility that they do not require irrefutable proof. Well, how about tobacco and its health effects? How about radiation exposure? Now, how about what we know today about the evolution of microbial resistance to antibiotics? Is it enough that the biologic plausibility for ill-effects of antibiotics in our food chain is strong? Can we now stop the madness? If my colleagues over at SBM are given to the same logic, they would say yes to this. However, extrapolating from this post about organic food production, I somehow think that they would not. So, I am guessing that, although they believe that lack of biologic plausibility should preclude attempts at study, they will nevertheless be reluctant to set a threshold for biologic plausibility that might obviate the need for further research. I am just guessing, and would love to hear what they really think.

And finally, what of the plausibility of benefit vs. that of harm? Should our bar for biologic plausibility for harm be lower than that for benefit? Well, the question really boils down to this: How many bodies do we need to see lying in the streets before we concede that there is a problem? My point is that we Americans have a hard time subscribing to the precautionary principle, applied generously in other parts of the world. If we were a tad less reckless with our need for irrefutable evidence, how many decades of equivocation about tobacco and cancer would we have avoided? How many lives might have been saved? Biologic plausibility for the connection was known even in the 1930s, yet it took another three decades for us to act. What are we obfuscating today that will come back to bite us (and our children) tomorrow? Could it be the cynical injection of doubt that our food production system is causing irreversible damage to us and life around us?

So, what I am saying is that biologic plausibility has several facets. We have to admit humbly that its assumption relies on our necessarily incomplete knowledge, and denying this may prevent us from awe-inspiring discoveries that will advance science in leaps. However, if we feel strongly about the need for it in order to justify our research allocation, some careful soul searching is in order for those thresholds of probability, especially of harm, where we may admit that science makes us sure enough, and, instead of awaiting perfect evidence, we must act promptly.