Wednesday, January 12, 2011

Reviewing medical literature, part 3: Threats to validity

You have heard this a thousand times: no study is perfect. But what does this mean? In order to be explicit about why a certain study is not perfect, we need to be able to name the flaws. And let's face it: some studies are so flawed that there is no reason to bother with them, either as a reviewer or as an end-user of the information. But again, we need to identify these nails before we can hammer them into a study's coffin. It is the authors' responsibility to include a Limitations paragraph somewhere in the Discussion section, in which they lay out all of the threats to validity and offer educated guesses as to the importance of these threats and how they may be impacting the findings. I personally will not accept a paper that does not present a coherent Limitations paragraph. However, reviewers are not always, as, shall we say, hard assed about this as I am, and that is when the reader is on her own. Let us be clear: even if the Limitations paragraph is included, the authors do not always do a complete job (and this probably includes me, as I do not always think of all the possible limitations of my work). So, as in everything, caveat emptor! Let us start to become educated consumers.

There are four major threats to validity that fit into two broad categories. They are:
A. Internal validity
  1. Bias
  2. Confounding/interaction
  3. Mismeasurement or misclassification
B. External validity
  4. Generalizability
Internal validity refers to whether the study is examining what it purports to be examining, while external validity, synonymous with generalizability, gives us an idea about how broadly the results are applicable. Let us define and delve into each threat more deeply.

Bias is defined as "any systematic error in the design, conduct or analysis of a study that results in a mistaken estimate of an exposure's effect on the risk of disease" (the reference for this is Schlesselman JJ, as cited in Gordis L, Epidemiology, 3rd edition, page 238). I think of bias as something that artificially makes the exposure and the outcome either occur together or apart more frequently than they should. For example, the INTERPHONE study has been criticized for its biased design, in that it defined exposure as at least one cellular phone call every week. Now enrolling such light users can really result in such a small exposure as not to be able to detect any increase in adverse events. This is an example of a selection bias, by far the most common form that bias takes. Another example of a frequent bias is encountered in retrospective case-control studies where people are asked to recall distant exposures. Take for example middle-aged women with breast cancer who are asked to recall their diets when they were in college. Now, ask the same of similar women without breast cancer. What you are likely to get is the effect, absent in women without cancer, of seeking an explanation for the cancer that expresses itself in a bias in what women with cancer recall eating in their youth. So, a bias in the design can make the association seem either stronger or weaker than it is in reality.

I want to skip over confounding and interaction at the moment, as these threats deserve a post of their own, which is forthcoming. Suffice it to say here that a confounder is a factor related to both, the exposure and the outcome. An interaction is also referred to as effect modification or effect heterogeneity. This means that there may be population characteristics that alter the response to the exposure of interest. Confounders and effect modifiers are probably the trickiest concepts to grasp. So, stay tuned for a discussion of those.

For now, let us move on to measurement error and misclassification. Measurement error, resulting in misclassification, can happen at any step of the way: it can be in the primary exposure, a confounder, or the outcome of interest. I run into this problem all the time in my research. Since I rely on administrative coding for a lot of the data that I use, I am virtually certain that the codes routinely misclassify some of the exposures and confounders that I deal with. Take Clostridium difficile as an example. There is an ICD-9 code to identify it in administrative databases. However, we know from multiple studies that it is not all that sensitive or all that specific; it is merely good enough, particularly for making observations over time. But even for laboratory values there is a certain potential for measurement error, though we seem to think that lab results are sacred and immune to mistakes. And need I say more about other types of medical testing? Anyhow, the possibility of error and misclassification is ubiquitous. What needs to be determined by the investigator and the reader alike is the probability of that error. If the probability is high, one needs to understand whether it is a systematic error (for example, a coder always more likely than not to include C. diff as a diagnosis) or a random one (a coder is just as likely to include as not to include a C diff diagnosis). And while a systematic error may result in either a stronger or a weaker association between the exposure and the outcome, a random, or non-differential, misclassification will virtually always reduce the strength of this association.

And finally, generalizability is a concept that helps the reader understand what population the results may be applicable to. In other words, will the data be applied strictly to the population represented in the study? If so, is it because there are biological reasons to think that the results would be different in a different population? And if so, is it simply the magnitude of the association that can be expected to be different or is it possible that even the direction could change? In other words, could something found to be beneficial in one population be either less beneficial or even more harmful in another? The last question is the reason that we perseverate on this idea of generalizability. Typically, a regulatory RCT is much less likely to give us adequate generalizability than a well designed cohort study, for example.

Well, these are the threats to validity in a nutshell. In the next post we will explore much more fully the concepts of confounding and interaction and how to deal with them either at the study design or study analysis stage.            

No comments:

Post a Comment