Healthcare, etc.: threats to validity

Showing posts with label threats to validity. Show all posts

Saturday, April 2, 2011

Invalidated Results Watch, Ivan?

My friend Ivan Oransky runs a highly successful blog called Retraction Watch; if you have not yet discovered it, you should! In it he and his colleague Adam Marcus document (with shocking regularity) retractions of scientific papers. While most of the studies are from the bench setting, some are in the clinical arena. One of the questions they have raised is what should happen with citations of these retracted studies by other researchers? How do we deal with this proliferation of oftentimes fraudulent and occasionally simply mistaken data?

A more subtle but no less difficult conundrum arises when papers cited are recognized to be of poor quality, yet they are used to develop defense for one's theses. The latest case in point comes from the paper I discussed at length yesterday, describing the success of the Keystone VAP prevention initiative. And even though I am very critical of the data, I do not mean to single out these particular researchers. In fact, because I am intimately familiar with the literature in this area, I can judge what is being cited. I have seen similar transgressions from other authors, and I am sure that they are ubiquitous. But let me be specific.

In the Methods section on page 306, the investigators lay out the rationale for their approach (bundles) by stating that the "ventilator care bundle has been an effective strategy to reduce VAP..." As supporting evidence they cite references #16-19. Well, it just so happens that these are the references that yours truly had included in her systematic review of the VAP bundle studies, and the conclusions of that review are largely summarized here. I hope that you will forgive me for citing myself again:

A systematic approach to understanding this research revealed multiple shortcomings. First, since all of the papers reported positive results and none reported negative ones, there is a potential for publication bias. For example, a recent story in a non-peer-reviewed trade publication questioned the effectiveness of bundle implementation in a trauma ICU, where the VAP rate actually increased directionally from 10 cases per 1,000 MV days in the period before to 11.9 cases per 1,000 MV days in the period after implementation of the bundle (24). This was in contradistinction to the medical ICU in the same institution, which achieved a reduction from 7.8 to 2.0 cases per 1,000 MV days with the same intervention (24). Since the results did not appear in a peer-reviewed form, it is difficult to judge the quality or significance of these data; however, the report does highlight the need for further investigation, particularly focusing on groups at heightened risk for VAP, such as trauma and neurological critically ill (25).

Second, each of the four reported studies suffers from a great potential for selection bias, which was likely present in the way VAP was diagnosed. Since all of the studies were naturalistic and none was blinded, and since all of the participants were aware of the overarching purpose of the intervention, the diagnostic accuracy of VAP may have been different before as compared to after the intervention. This concern is heightened by the fact that only one study reports employing the same team approach to VAP identification in the two periods compared (23). In other studies, although all used the CDC-NNIS VAP definition, there was either no reporting of or heterogeneity in the personnel and methods of applying these definitions. Given the likely pressure to show measurable improvement to the management, it is possible that VAP classification suffered from a bias.

Third, although interventional in nature, naturalistic quality improvement studies can suffer from confounding much in the same way that observational epidemiologic studies do. Since none of the studies addressed issues related to case mix, seasonal variations, secular trends in VAP, and since in each of the studies adjunct measures were employed to prevent VAP, there is a strong possibility that some or all of these factors, if examined, would alter the strength of the association between the bundle intervention and VAP development. Additional components that may have played a role in the success of any intervention are the size and academic affiliation of the hospital. In a study of interventions aimed at reducing the risk of CRBSI, Pronovost et al. found that smaller institutions had a greater magnitude of success with the intervention than their larger counterparts (26). Similarly, in a study looking at an educational program to reduce the risk of VAP, investigators found that community hospital staff were less likely to complete the educational module than the staff at an academic institution; in turn, the rate of VAP was correlated with the completion of the educational program (27). Finally, although two of the studies included in this review represent data from over 20 ICUs each (20, 22), the generalizability of the findings in each remains in question. For example, the study by Unahalekhaka and colleagues was performed in the institutions in Thailand, where patient mix and the systems of care for the critically ill may differ dramatically from those in the US and other countries in the developed world (22). On the other hand, while the study by Resar and coworkers represents a cross section of institutions within the US and Canada, no descriptions are given of the particular ICUs with respect to the structure and size of their institutions, patient mix or ICU care model (e.g., open vs. closed; intensivists present vs. intensivists absent, etc.) (20). This aggregate presentation of the results gives one little room to judge what settings may benefit most and least from the described interventions. The third study includes data from only two small ICUs in two community institutions in the US (21), while the remaining study represents a single ICU in a community hospital where ICU patients are not cared for by an intensivist (23). Since it is acknowledged that a dedicated intensivist model leads to improved ICU outcomes (28, 29), the latter study has limited usefulness to institutions that have a more rigorous ICU care model.

OK, you say, maybe the investigators did not buy into my questions about the validity of the "findings." Maybe not, but evidence suggests otherwise. In the Discussion section on page 311 they actually say

While the bundle has been published as an effective strategy for VAP prevention and is advocated by national organizations, there is significant concern about its internal validity.

And guess what they cite? Yup, you guessed it, the paper excerpted above. So, to me it feels like they are trying to have it both ways -- the evidence FOR implementing the bundle is the same evidence AGAINST its internal validity. Much like Bertrand Russell, I am not that great at dealing with paradoxes. Will this contradiction persist in our psyche, or will sense prevail? Perhaps Ivan and Adam need to start a new blog: Invalidated Results Watch. Oh? Did you say that peer review is supposed to be the answer to this? Right.

Wednesday, January 12, 2011

Reviewing medical literature, part 3: Threats to validity

You have heard this a thousand times: no study is perfect. But what does this mean? In order to be explicit about why a certain study is not perfect, we need to be able to name the flaws. And let's face it: some studies are so flawed that there is no reason to bother with them, either as a reviewer or as an end-user of the information. But again, we need to identify these nails before we can hammer them into a study's coffin. It is the authors' responsibility to include a Limitations paragraph somewhere in the Discussion section, in which they lay out all of the threats to validity and offer educated guesses as to the importance of these threats and how they may be impacting the findings. I personally will not accept a paper that does not present a coherent Limitations paragraph. However, reviewers are not always, as, shall we say, hard assed about this as I am, and that is when the reader is on her own. Let us be clear: even if the Limitations paragraph is included, the authors do not always do a complete job (and this probably includes me, as I do not always think of all the possible limitations of my work). So, as in everything, caveat emptor! Let us start to become educated consumers.

There are four major threats to validity that fit into two broad categories. They are:
A. Internal validity
  1. Bias
  2. Confounding/interaction
  3. Mismeasurement or misclassification
B. External validity
  4. Generalizability
Internal validity refers to whether the study is examining what it purports to be examining, while external validity, synonymous with generalizability, gives us an idea about how broadly the results are applicable. Let us define and delve into each threat more deeply.

Bias is defined as "any systematic error in the design, conduct or analysis of a study that results in a mistaken estimate of an exposure's effect on the risk of disease" (the reference for this is Schlesselman JJ, as cited in Gordis L, Epidemiology, 3rd edition, page 238). I think of bias as something that artificially makes the exposure and the outcome either occur together or apart more frequently than they should. For example, the INTERPHONE study has been criticized for its biased design, in that it defined exposure as at least one cellular phone call every week. Now enrolling such light users can really result in such a small exposure as not to be able to detect any increase in adverse events. This is an example of a selection bias, by far the most common form that bias takes. Another example of a frequent bias is encountered in retrospective case-control studies where people are asked to recall distant exposures. Take for example middle-aged women with breast cancer who are asked to recall their diets when they were in college. Now, ask the same of similar women without breast cancer. What you are likely to get is the effect, absent in women without cancer, of seeking an explanation for the cancer that expresses itself in a bias in what women with cancer recall eating in their youth. So, a bias in the design can make the association seem either stronger or weaker than it is in reality.

I want to skip over confounding and interaction at the moment, as these threats deserve a post of their own, which is forthcoming. Suffice it to say here that a confounder is a factor related to both, the exposure and the outcome. An interaction is also referred to as effect modification or effect heterogeneity. This means that there may be population characteristics that alter the response to the exposure of interest. Confounders and effect modifiers are probably the trickiest concepts to grasp. So, stay tuned for a discussion of those.

For now, let us move on to measurement error and misclassification. Measurement error, resulting in misclassification, can happen at any step of the way: it can be in the primary exposure, a confounder, or the outcome of interest. I run into this problem all the time in my research. Since I rely on administrative coding for a lot of the data that I use, I am virtually certain that the codes routinely misclassify some of the exposures and confounders that I deal with. Take Clostridium difficile as an example. There is an ICD-9 code to identify it in administrative databases. However, we know from multiple studies that it is not all that sensitive or all that specific; it is merely good enough, particularly for making observations over time. But even for laboratory values there is a certain potential for measurement error, though we seem to think that lab results are sacred and immune to mistakes. And need I say more about other types of medical testing? Anyhow, the possibility of error and misclassification is ubiquitous. What needs to be determined by the investigator and the reader alike is the probability of that error. If the probability is high, one needs to understand whether it is a systematic error (for example, a coder always more likely than not to include C. diff as a diagnosis) or a random one (a coder is just as likely to include as not to include a C diff diagnosis). And while a systematic error may result in either a stronger or a weaker association between the exposure and the outcome, a random, or non-differential, misclassification will virtually always reduce the strength of this association.

And finally, generalizability is a concept that helps the reader understand what population the results may be applicable to. In other words, will the data be applied strictly to the population represented in the study? If so, is it because there are biological reasons to think that the results would be different in a different population? And if so, is it simply the magnitude of the association that can be expected to be different or is it possible that even the direction could change? In other words, could something found to be beneficial in one population be either less beneficial or even more harmful in another? The last question is the reason that we perseverate on this idea of generalizability. Typically, a regulatory RCT is much less likely to give us adequate generalizability than a well designed cohort study, for example.

Well, these are the threats to validity in a nutshell. In the next post we will explore much more fully the concepts of confounding and interaction and how to deal with them either at the study design or study analysis stage.