Monday, January 31, 2011

Reviewing medical literature, part 5: Inter-group differences and hypothesis testing

Happy almost February to everyone. It is time to resume our series. First I am grateful for the vigorous response to my survey about interest in a webinar to cover some of this stuff. Over the next few months one of my projects will be to develop and execute one. I will keep you posted. In the meantime, if anyone has thoughts or suggestions on the logistics, etc., please, reach out to me.

OK, let's talk about group comparisons and hypothesis testing. Scientific method that we generally practice demands that we articulate an hypothesis prior to conducting a study which will test this hypothesis. The hypothesis is generally advanced as the so-called "null hypothesis" (or H0), wherein we express our skepticism that there is a difference between groups or an association between the exposure and outcome. By starting out with this negative formulation, we set the stage for "disproving" the null hypothesis, or demonstrating that the data support the "alternative hypothesis" (HA, or the presence of the said association or difference). This is where all the measures of association that we have discussed previously come in, and most particularly the p value. The definition of the p value once again is "the probability that the found inter-group difference, or one that is greater than what was found, would have been found under the condition of no true difference." Following through on this reasoning, we can appreciate that the H0 can never be "proven." That is, the only thing that can be said statistically when no difference is found between groups is that we did not disprove the null hypothesis. This may be because there truly is no difference between the groups being compared (that is the null hypothesis approximates reality) or because we did not find the difference that in fact exists. The latter is referred to as the Type II error, and can be present for various reasons, the most common of which is a sample size that is too small to detect statistically significant difference.

This is a good place to digress and talk a little about the distinction between "absence of evidence" and "evidence of absence." The distinction, though ostensibly semantic, is quite important. While "evidence of absence" implies that studies to look for associations have been done, done well, published, and have consistently shown the lack of association between a given exposure and outcome or a difference between two groups, "absence of evidence" means that we have just not done a good job looking for this association or difference. Absence of evidence does not absolve the exposure from causing the outcome, yet so often it is confused with the definitive evidence of absence of an effect. Nowhere is this more apparent than in the history of the tobacco debate, which is the poster child for this obfuscation. And we continue to rely on this confusion in other environmental debates, such as chemical exposures and cell phone radiation. One of the most common reasons for finding no association when one exists, or the type II error, is, as I have already mentioned, a sample size that is too small to detect the difference. For this reason, in a published study that fails to show a difference between groups it is critical to assure that the investigators performed the power calculation. This maneuver, usually found in the Methods section of the paper, lets us know that the sample size is adequate to detect a difference if one exists, thus minimizing the probability of type II error. The trouble is that, as we know, there is a phenomenon called "publication bias." This refers to the scientific journals' reluctance to publish negative results. And while it may be appropriate to reject studies prone to type II error due to poor design (although even these studies may be useful in the setting of a meta-analysis, where pooling of data overcomes small sample sizes), true negative results must be made public. But this is a little off topic.

I will ask you to indulge me in one other digression. I am sure that in addition to "statistical significance" (this is simplistically represented by the p value), you have heard of "clinical significance." This is an important distinction, since even a finding that is statistically significant may have no clinical significance whatsoever. Take for example a therapy that cuts the risk of a non-fatal heart attack by 0.05% in a certain population. This means that in a population at a 10% risk for a heart attack in one year, the intervention will bring this risk on average to 9.95%. And though we can argue whether or not this is an important difference, at the population level, this does not seem all that clinically important. So, if I have the vested interest and the resources to run the massive trial that will give me this minute statistical significance, I can do that and then say without blushing that my treatment works. Yet, statistical significance always needs to be examined in the clinical context. This is why it is not enough to read the headlines that tout new treatments. The corollary to this is that the lack of statistical significance does not equate to the lack of clinical significance. Given what I just said above about type II error, if the difference appears significant clinically (e.g., reducing the incidence of fatal heart attacks from 10% to 5%), but does not reach statistical significance, the result should not be discarded as negative, but examined as to the probability of the type II error. This is also where Bayesian thinking must come into play, but I do not want to get into this now, as we have covered these issues in previous posts on this blog.

OK, back to hypothesis testing. There are several rules to be aware of when reading how the investigators tested their hypotheses, as different types of variables require different methods. A categorical variable (one characterized by categories, like gender, race, death, etc.) can be compared using the chi square method if there is an abundance of events or the Fisher's exact test when values are scant. A normally distributed continuous variable (e.g., age is a continuum that is frequently distributed normally) can be tested using the Student's t-test, while one that has a skewed distribution (e.g., hospital length of stay, costs), requires testing with the Mann-Whitney U-test or the Wilcoxon rank-sum test or the Kruskall-Wallis test. Each of these "non-parametric" tests is appropriate in the setting of a skewed distribution. You do not need to know any more than this: the test for the hypothesis depends on the variable's distribution. And recognizing some of the situations and test names may be helpful to you in evaluating the validity of a study.

One final frequent computation you may encounter is survival analysis. This is often depicted as a Kaplan-Meier curve, and does not have to be limited to examining survival. This is a time-to-event analysis, regardless of what the event is. In studies of cancer therapies we frequently talk about median disease-free survival between groups, and this can be depicted by the K-M analysis. To test the difference between times to event, we employ the log-rank test.

Well, this is a fairly complete primer for most common hypothesis testing situations. In the next post we will talk a little more about measures of association and their precision, types I and II errors, as well as measures of risk alteration.                                         


  1. Another issue is effectiveness and efficacy. Which unfortunately can relate to the issue of reading medical journals. While the suggestions you make are certainly efficacious, that is, they will increase the proficiency of the reader, I am not sure that they will be very effective, that is, they are not practical to implement in the field.

    I'm a psychiatrist and I work a lot with health-care providers. Burnout is a problem and its getting worse not better.

    Most physicians in primary care specialties are already overwhelmed by patient care responsibilities, over-booked by their employers, and then having to spend any free time dealing with insurance paperwork and prior-authorization requests. Asking them to devote significant amounts of quality time to each journal article in order to make sure that it of sufficient quality is simply not going to work.

    A solution that would be both effective and efficacious would be for the reviewers and journal editors to screen out a lot of this tripe so that only high-quality studies got into press.

  2. Joe, I could not agree more. As I mentioned in my inaugural post for this series, my mission is precisely to educate peer reviewers on how to be more effective peer reviewers. If as a byproduct this series is also helpful to clinicians and patients, I will be quite happy.