Healthcare, etc.: misclassification

Showing posts with label misclassification. Show all posts

Friday, April 1, 2011

Another swing at the windmill of VAP

Sorry, folks, but I have been so swamped with work that I have been unable to produce anything cogent here. I see today as a gift day, as my plans to travel to SHEA were foiled by mother nature's sense of humor. So, here I am trying to catch up on some reading and writing before the next big thing. To be sure, I have not been wasting time, but have completed some rather interesting analyses and ruminations, which, if I am lucky, I will be able to share with you in a few weeks.

Anyhow, I am finally taking a very close look at the much touted Keystone VAP prevention study. I have written quite a bit about VAP prevention here, and my diatribes about the value proposition of "evidence" in this area are well known and tiresome to my reader by now. Yet, I must dissect the most recent installment in this fallacy-laden field, where random chance occurrences and willful reclassifications are deemed causal of dramatic performance improvements.

So, the paper. Here is the link to the abstract, and if you subscribe to the journal, you can read the whole study. But fear not, I will describe it to you in detail.

In its design it was quite similar to the central line-associated blood stream infection prevention study published in the New England Journal in 2006, and similarly the sample frame included Keystone ICUs in Michigan. Now, recall that the reason this demonstration project happened in Michigan is because of their astronomical healthcare-associated infection (HAI) rates. Just to digress briefly, I am sure you have all heard of MRSA; but have you heard of VRSA? VRSA stands for vancomycin-resistant Staphylococcus aureus, MRSA's even more troubling cousin, vancomycin being a drug that MRSA is susceptible to. Now, thankfully, VRSA has not yet emerged as an endemic phenomenon, but of the handful of cases of this virtually untreatable scourge that has been reported, Michigan has had plurality of them. So, you get the picture: Michigan is an outlier (and not in the desirable direction) when it comes to HAIs.

Why is it important to remember Michigan's outlier status? Because of the deceptively simple yet devilishly confounding concept of regression to the mean. The idea is that in an outlier situation, at least some of the effect is due to random luck. Therefore, if the performance of an extreme outlier is measured twice, the second time it will be closer to the population mean just by pure luck alone. But I do not want to get too deeply into this somewhat muddy concept right now -- I will reserve a longer discussion of it for another post. For now I would like to focus on some of the more tangible aspects of the study. As usual, two or three features of the study design reduce substantially the likelihood that the causal inference is correct.

First feature is the training period. Prior to the implementation of the protocol, which by the way consisted of the famous VAP bundle, which we have discussed on this blog ad nauseam, there was intensive educational training of the personnel on a "culture of change", as well as the proper definitions of the interventions and outcomes. It is at this time that the "trained hospital infection prevention personnel" were intimately focused on the definition of VAP that they were using. And even though the protocol states that the surveillance definition of VAP would not change throughout the study period, what are the chances that this intensified education and emphasis did not alter at least some of the classification practices?

Skeptical? Good. Here is another piece of evidence supporting my stance. A study from Michael Klompas from Harvard examined inte-rater variability in the assessment of VAP looking at the same surveillance definition applied in the Keystone (and many other) study. Here is what he wrote:

Three infection control personnel assessing 50 patients for VAP disagreed on 38% of patients and reported an almost 2-fold variation in the total number of patients with VAP. Agreement was similarly limited for component criteria of the CDC VAP definition (radiographic infiltrates, fever, abnormal leukocyte count, purulent sputum, and worsening gas exchange) as well as on final determination of whether VAP was present or absent.

And here is his conclusion:

High interobserver variability in the determination of VAP renders meaningful comparison of VAP rates between institutions and within a single institution with multiple observers questionable. More objective measures of ventilator-associated complication rates are needed to facilitate benchmarking and quality improvement efforts.

Yet, the Keystone team writes this in their Methods section:

Using infection preventionists minimized the potential for diagnosis bias because they are trained to conduct surveillance for VAP and other healthcare-associated infections by using standardized definitions and methods provided by the CDC in its National Healthcare Safety Network (NHSN).

Really? Am I cynical to invoke circular reasoning here? Have I convinced you yet that CAP diagnosis is a moving target? And as such it can be moved by cognitive biases, such as the one introduced by the pre-implementation training of study personnel? No? OK, consider this additional piece from the Keystone study. The investigators state that "teams were instructed to submit at least 3 months of baseline VAP data." What they do not state is whether this was a retrospective collection or a prospective one, and this matters a little. First, retrospective reporting in this case would be a lot more representative of what has been, since these rates of VAP are already recorded for posterity and cannot presumably be altered. On the other hand, if the reporting is prospective, I can still conceive of ways to introduce a bias into this baseline measure. Imagine, if you will, that you are employed by a hospital that is under scrutiny for a particular transgression, and that you know the hospital will look bad if you do not demonstrate improvement following a very popular and "common-sense" intervention. Might you be a tad more liberal with identifying these transgressive episodes in your baseline period that after the intervention has been instituted? This is a subtle, yet all too real conflict of interest, which, as we know so well, can introduce a substantial bias into any study. Still don's believe me? OK, come to my office after school and we will discuss. In the meantime, let's move on.

The next nugget is in the graph in Figure 1, where VAP trends over the pre-specified time periods are plotted (you can find the identical graph in this presentation on slide #20). Look at the mean, rather than the median line. (The reason I want you to look at the mean is that the median is zero, and therefore not credible. Additionally, if we want to assess the overall impact of the intervention, we need to be embracing the outliers, which the median ignores). What is tremendously interesting to me is that there is a precipitous drop in VAP during the period called "intervention", followed by much smaller fluctuations around the new mean across the subsequent time periods. This to me confirms the high probability of reclassification (and Hawthorne effect), rather than an actual improvement in VAP rates, as the cause of the drop.

Another piece of data makes me think that it was not the bundle that "did it." Figure 2 in the paper depicts the rates of compliance with all 5 of the bundle components in the corresponding time periods. Again, here as in the VAP rates graph, the greatest jump in adherence to all 5 strategies is observed in the intervention period. However, there is still a substantial linear increase in this metric between the intervention period and through to 25-27 months period. Yet, looking back at the VAP data, no such robust commensurate reduction is observed. While this is somewhat circumstantial, it makes me that much more wary of trusting this study.

So, does this study add anything to our understanding of what bundles do for VAP prevention? I would say not, and it actually muddies the waters. What would have been helpful to see is whether any of the downstream outcomes, such as antibiotics administration, time on the ventilator and length of stay were impacted. Without impacting these outcomes, our efforts are Quixotic, merely swinging at windmills, mistaking them for a real threat.

Wednesday, January 12, 2011

Reviewing medical literature, part 3: Threats to validity

You have heard this a thousand times: no study is perfect. But what does this mean? In order to be explicit about why a certain study is not perfect, we need to be able to name the flaws. And let's face it: some studies are so flawed that there is no reason to bother with them, either as a reviewer or as an end-user of the information. But again, we need to identify these nails before we can hammer them into a study's coffin. It is the authors' responsibility to include a Limitations paragraph somewhere in the Discussion section, in which they lay out all of the threats to validity and offer educated guesses as to the importance of these threats and how they may be impacting the findings. I personally will not accept a paper that does not present a coherent Limitations paragraph. However, reviewers are not always, as, shall we say, hard assed about this as I am, and that is when the reader is on her own. Let us be clear: even if the Limitations paragraph is included, the authors do not always do a complete job (and this probably includes me, as I do not always think of all the possible limitations of my work). So, as in everything, caveat emptor! Let us start to become educated consumers.

There are four major threats to validity that fit into two broad categories. They are:
A. Internal validity
  1. Bias
  2. Confounding/interaction
  3. Mismeasurement or misclassification
B. External validity
  4. Generalizability
Internal validity refers to whether the study is examining what it purports to be examining, while external validity, synonymous with generalizability, gives us an idea about how broadly the results are applicable. Let us define and delve into each threat more deeply.

Bias is defined as "any systematic error in the design, conduct or analysis of a study that results in a mistaken estimate of an exposure's effect on the risk of disease" (the reference for this is Schlesselman JJ, as cited in Gordis L, Epidemiology, 3rd edition, page 238). I think of bias as something that artificially makes the exposure and the outcome either occur together or apart more frequently than they should. For example, the INTERPHONE study has been criticized for its biased design, in that it defined exposure as at least one cellular phone call every week. Now enrolling such light users can really result in such a small exposure as not to be able to detect any increase in adverse events. This is an example of a selection bias, by far the most common form that bias takes. Another example of a frequent bias is encountered in retrospective case-control studies where people are asked to recall distant exposures. Take for example middle-aged women with breast cancer who are asked to recall their diets when they were in college. Now, ask the same of similar women without breast cancer. What you are likely to get is the effect, absent in women without cancer, of seeking an explanation for the cancer that expresses itself in a bias in what women with cancer recall eating in their youth. So, a bias in the design can make the association seem either stronger or weaker than it is in reality.

I want to skip over confounding and interaction at the moment, as these threats deserve a post of their own, which is forthcoming. Suffice it to say here that a confounder is a factor related to both, the exposure and the outcome. An interaction is also referred to as effect modification or effect heterogeneity. This means that there may be population characteristics that alter the response to the exposure of interest. Confounders and effect modifiers are probably the trickiest concepts to grasp. So, stay tuned for a discussion of those.

For now, let us move on to measurement error and misclassification. Measurement error, resulting in misclassification, can happen at any step of the way: it can be in the primary exposure, a confounder, or the outcome of interest. I run into this problem all the time in my research. Since I rely on administrative coding for a lot of the data that I use, I am virtually certain that the codes routinely misclassify some of the exposures and confounders that I deal with. Take Clostridium difficile as an example. There is an ICD-9 code to identify it in administrative databases. However, we know from multiple studies that it is not all that sensitive or all that specific; it is merely good enough, particularly for making observations over time. But even for laboratory values there is a certain potential for measurement error, though we seem to think that lab results are sacred and immune to mistakes. And need I say more about other types of medical testing? Anyhow, the possibility of error and misclassification is ubiquitous. What needs to be determined by the investigator and the reader alike is the probability of that error. If the probability is high, one needs to understand whether it is a systematic error (for example, a coder always more likely than not to include C. diff as a diagnosis) or a random one (a coder is just as likely to include as not to include a C diff diagnosis). And while a systematic error may result in either a stronger or a weaker association between the exposure and the outcome, a random, or non-differential, misclassification will virtually always reduce the strength of this association.

And finally, generalizability is a concept that helps the reader understand what population the results may be applicable to. In other words, will the data be applied strictly to the population represented in the study? If so, is it because there are biological reasons to think that the results would be different in a different population? And if so, is it simply the magnitude of the association that can be expected to be different or is it possible that even the direction could change? In other words, could something found to be beneficial in one population be either less beneficial or even more harmful in another? The last question is the reason that we perseverate on this idea of generalizability. Typically, a regulatory RCT is much less likely to give us adequate generalizability than a well designed cohort study, for example.

Well, these are the threats to validity in a nutshell. In the next post we will explore much more fully the concepts of confounding and interaction and how to deal with them either at the study design or study analysis stage.