Friday, February 25, 2011

Guidelines: What really constitutes level I evidence?

There has been some interesting buzz in the blogosphere about where evidence-based guideline recommendations come from, and I wanted to add a little fuel to that fire today.

As you know, I think a lot about the nature of evidence, about the "science" in clinical science, and about pneumonia, specifically ventilator-associated pneumonia or VAP. Last week I wrote here and here about a specific recommended intervention to prevent VAP consisting of semi-recumbent, as opposed to supine, positioning. This recommendation, one of 21 maneuvers aimed at modifiable risk factors for VAP, had level I evidence behind it. Given my recent deconstruction of this level I evidence, consisting of a single unblinded RCT in a single academic urban center in Spain, and given that we already know that level I data represent a very small proportion of all the evidence behind guideline recommendations, I got curious about this level I stuff. How is level I really defined? Is there a lot of room for subjective judgment? So, I went to the source.

In its HAP/VAP guideline, the ATS and IDSA committee define the levels of evidence in the following way:
Level I (high)
Level II (moderate) 








Level III (low)
     Evidence comes from well conducted, randomized controlled trials


Evidence comes from well designed, controlled trials without randomization (including cohort, patient series, and case-control studies). Level II studies also include any large case series in which systematic analysis of disease patterns and/or microbial etiology was conducted, as well as reports of new therapies that were not collected in a randomized fashion

Evidence comes from case studies and expert opinion. In some instances therapy recommendations come from antibiotic susceptibility data without clinical observations
So, well conducted, randomized controlled trials. But what does "well conducted" mean? Seems to me that one person's well conducted may be another person's garbage. Well, I went to the text of the document for clarification:
The grading system for our evidence-based recommendations was previously used for the updated ATS Community-acquired Pneumonia (CAP) statement, and the definitions of high-level (Level I), moderate-level (Level II), and low-level (Level III) evidence are summarized in Table 1 (8). 
OK, then. We have to go to reference #8, or the CAP guideline to get to the bottom of the definition. And here is what that document states:
Therefore, in grading the evidence supporting our recommendations, we used the following scale, similar to the approach used in the recently updated Canadian CAP statement (46): Level I evidence comes from well-conducted randomized controlled trials; Level II evidence comes from well-designed, controlled trials without randomization (including cohort, patient series, and case control studies); Level III evidence comes from case studies and expertopinion. Level II studies included any large case series in which systematic analysis of disease patterns and/or microbial etiology was conducted, as well as reports of new therapies that were not collected in a randomized fashion. In some instances therapy recommendations come from antibiotic susceptibility data, without clinical observations, and these constitute Level III recommendations.
Again, we are faced with the nebulous "well-conducted" descriptor with no further defining guidance on how to discern this quality. I resigned myself to going to the next source citation, #46 above, the Canadian CAP statement:
We applied a hierarchical evaluation of the strength of evidence modified from the Canadian Task Force on the Periodic Health Examination [4]. Well-conducted randomized, controlled trials constitute strong or level I evidence; well-designed controlled trials without randomization (including cohort and case-control studies) constitute level II or fair evidence; and expert opinion, case studies, and before-and-after studies are level III (weak) evidence. Throughout these guidelines, ratings appear as roman numerals in parentheses after each recommendation.
Another "well-conducted" construct, another reference, another wild goose chase. The reference #4 above clarified the definition for me thus:
OK, so, now we have "at least one properly randomized controlled trial." So, having gotten to the origin of this broken telephone game, it looks like proper randomization trumps all other markers for a well-done trial. The price of such neglect is giving up generalizability, confirmation, appropriate analyses, and many other important properties that need to be evaluated before stamping the intervention with a seal of approval. 

And this is just one guideline for one syndrome. The bigger point that I wanted to illustrate is that, even though we now know that only 14% of all IDSA guideline recommendations have so-called level I evidence behind them, what is dubious is the value and validity of assigning this highest level of evidence to these recommendations, given the room for subjectivity and misclassification. So, what does all of this mean? Well, for me it means no foreseeable shortage of fodder for blogging. But for our healthcare policy and our public's health? Big doo-doo.

Thursday, February 24, 2011

New treatments: What benefits at what costs

Yesterday brought quite a bit of press coverage to a small biotechnology company in Cambridge called Vertex. All this attention was spurned by their gene therapy trial results in cystic fibrosis. The treatment, aimed at a genetic mutation present in about 4% of all CF sufferers, was able to improve the volume that a patient can force out of his lungs in 1 second by over 10%, from about 65% to 75%. Matthew Herper of Forbes on his blog, while being duly impressed by the results, also cautioned that the annual price tag for this medicine is likely to reach $250,000 per patient. So, what does all of this mean in the context of our ongoing national discussion about the value of therapies? Well, let's break things down a bit.

First, let's talk about CF. This is a genetic disorder that essentially makes mucus very sticky. Among its many effects, in its most familiar manifestation this mucus plugs up the airways making it difficult to breathe and predisposing the person to frequent and serious lung infections. When I was a resident back in the early '90s, I remember a devastating case of a young man in his late teens with CF whom we all knew so well from his frequent admissions for exacerbations. Though he was pretty high on the lung transplant list, he ended up succumbing to a devastating pneumonia in our ICU, leaving behind a devoted sister who had been fortunate enough to benefit from a transplant several years earlier. This was a typical course in those days: a brief life punctuated by frequent exacerbations, hospitalizations, antibiotics, gastrointestinal complications, and early death in the second or at best third decade of life with very little hope of procreation. Over the last 20 years things have changed dramatically in the treatment of CF: fewer exacerbations, much lengthened life expectancy and a good chance of having children. Yet we cannot attribute most of these changes to dramatic new breakthrough therapies. To be sure, while there have been tweaks to how we give antibiotics and how pancreatic enzymes are administered to replace the digestive enzymes that the pancreas in CF is unable to produce, most of the progress can be attributed to the increased attention to detail and the advent of almost ruthless care coordination at specialty centers. As a Fellow in the '90s I participated in a clinic where CF patients were transitioning from care by pediatric Pulmonologists to that by adult doctors. The CF specialist running this clinic did not only know all of his patients and their family members by names, but was available 24/7 to them and to his staff for consultation. This is the kind of dedication and vigilance necessary to improve the outcomes in CF.

Now, let's talk about the lesion addressed in the Vertex trial. The type of chronic lung disease caused by CF is called "obstructive." Simply put, it makes exhaling the air in the lungs difficult to do. On lung testing one manifestation of obstruction is the amount of air one is able to force out of his lungs in the first second of the effort, and this is called the FEV1, or forced expiratory volume in 1 second. Another important measure of the degree of obstruction is the amount of air that this volume expired in 1 second represents as a proportion of all of the air in the lungs that can be expired, known as the FVC or forced vital capacity. We say that if the FEV1/FVC ratio is under 75%, then obstruction is present. The size of FEV1 helps us understand how bad the obstruction is.

With this as a background, the primary outcome in many obstructive lung disease trials is the improvement in the FEV1. In the specific trial discussed, the average starting FEV1 in the intervention group was about 65%, which falls in the mild-to-moderate category of obstruction. What this means in terms of symptoms can vary widely. The 10% absolute improvement seen in the intervention group resulted in the average FEV1 of about 75% after treatment, definitely representing fairly mild obstruction (generally FEV1 over 80% is considered to be in the normal range). And this truly is impressive. However, equally interesting is the information that is not in the press coverage, largely based on press releases and sound bites from company executives, since the peer reviewed study is not available at this point. We are not told, but led to assume that, the control group started out on average with a similar deficit in lung function. We are informed that the treatment patients were 55% less likely than placebo patients to have an exacerbation of their disease, yet we do not know what the absolute numbers are; that is we are not told what proportion in each group had an exacerbation, how frequently or how severely. So, this 55%, in the absence of context, while an attention grabber, is not a substantive number. Herper does tell us that there was a remarkable difference in the weight gain (a desirable outcome in the CF population), on average 6.8 lb in the treatment vs. 0.9 lb in the placebo group. This is truly impressive, though it would be even more so if I knew that the trial was double blind, a piece of information I did not notice in any of the reports. Some of the reports have also alluded to symptomatic improvement in shortness of breath, though nowhere did I see this quantified.

The most important piece of data, however, is conspicuously absent from all the stories. What is the proportion of patients who responded to therapy? Why is this important? Well, we know that far from everyone responds to every treatment that they ostensibly qualify for; this is referred to as the heterogeneous treatment effect, or HTE. It is very likely that the 10% improvement in the FEV1 represents at once an inflated estimate referent to those non- or under-responders and a muted one for those patients with a terrific response. The question of a minimal clinically significant change in the FEV1 has haunted the lung trials community for a long time now. Yet, without setting some threshold for a minimum FEV1 improvement that correlates with a meaningful improvement in symptoms, one cannot quantify how well the drug works and hence articulate its value. This is crucial when trying to justify the ostensibly exorbitant price tag anticipated for this drug. How many patients will we need to treat in order to have one of them respond meaningfully with an improvement not just in a laboratory number, but also in their lives? If this targeted drug produces a desirable response even in 50% of all patients with the specific mutation it targets, then it means that we need to spend $500,000 annually to obtain a meaningful improvement in symptoms in one CF patient. But what if it only works this way in 20%? Then we will need to treat 5 patients with this drug to obtain 1 meaningful response at the price of $1.25 million annually. This becomes a bit more daunting, particularly given that the costs will have to be covered through some kind of public or pooled funds and given that this is one of many therapies in the pipeline likely to come with a similar conundrum.

I am not implying that improving a single life is not worth $1.25 million annually. In fact, it may well be a bargain. My point is that these are the serious discussions we need to have as a society, so that when the time comes to make these choices, the discussion will not be subverted by a few loud voices sensationalizing "death panel" slogans. Manufacturers need to know that they should disclose full data, not just selective tidbits that highlight benefits only, but also those difficult pieces of information that shed light on their costs. On our part, we need to understand the gargantuan effort and resources these companies expend to tame these elusive wild therapies that hold so much more promise in the abstract than they end up embodying.

We tread a fine line here. Information and how we assimilate it are the next frontier for cogent decision making. We need to get educated about this now because this train is leaving the station regardless of how we feel about it.                            

Tuesday, February 15, 2011

The rose-colored glasses of early trial termination

The other day I did a post on semi-recumbent positioning to prevent VAP. The point I wanted to make was that an already existing quality measure for a condition that is well on its way to becoming a CMS "never event" is based on one unreplicated single-center small unblinded randomized controlled trial that was terminated early for efficacy. In my post I cited several issues with the study that question its validity. Today I want to touch upon the issue of early termination, which in and of itself is problematic.

What is early termination? It is just that: stopping the trial before enrolling the pre-planned number of subjects. First, it is important to be explicit in the planning phases about how many subjects will need to be enrolled. This is known as the power calculation and is based on the anticipated effect size and the uncertainty in this effect. Termination can happen for efficacy (the intervention works so splendidly that it becomes unethical not to offer it to everyone), safety (the intervention is so dangerous that it becomes unethical to offer it to anyone) or for other reasons (e.g., the recruitment is taking too long, etc.).

Who makes the decision to terminate early and how is the decision made? Well, under the best of circumstances, there is a Data Safety Monitoring Board, a body that is specifically in place to look at the data at certain points in the recruitment process and look for certain pre-specified differences between groups. This DSMB is fire-walled from both the investigators and the patients. The interim looks at the data  should be pre-specified by the protocol also, as the number of these looks actually influences the initial power calculation, since the more you look, the more differences you are likely to find by chance alone.

So, without going into too much detail on these interim looks, understand that they are not to be taken lightly, and their conditions and reporting require full transparency. To their credit, the semi-recumbent position investigators reported their plan for one interim analysis upon reaching 50% enrollment. Neither the Methods section nor the Acknowledgements, however, specify who was the analyst and the decision-maker. Most likely it was the investigators themselves that ended up taking the look and deciding on the subsequent course of action. And this itself is not that methodologically clean.

Now, let's talk about one problem early termination. This gargantuan effort led by the team from McMaster in Canada and published last year in JAMA sheds the needed light on what had been suspected before: early termination leads to inflated effect estimates. The sheer massiveness of the work done is mind boggling -- over 2,500 studies were reviewed! The investigators elegantly paired meta-analyses of truncated RCTs with meta-analyses of matched but nontruncated ones, and compared the magnitude of the inter-group differences between the two categories of RCTs. Here is one interesting tidbit (particularly for my friend @ivanoransky):
Compared with matching nontruncated RCTs, truncated RCTs were more likely to be published in high-impact journals (30% vs 68%, P<.001).
But here is what should really grab the reader:

Of 63 comparisons, the ratio of RRs was equal to or less than 1.0 in 55 (87%); the weighted average ratio of RRs was 0.71 (95% CI, 0.65-0.77; P <.001)(FIGURE2). In 39 of 63 comparisons (62%), the pooled estimates for nontruncated RCTs were not statistically significant. Comparison of the truncated RCTs with all RCTs (including the truncated RCTs) demonstrated a weighted average ratio of RRs of 0.85; in 16 of 63 comparisons (25%), the pooled estimate failed to demonstrate a significant effect. [Emphasis mine]
The authors went on to conclude the following:

In this empirical study including 91 truncated RCTs and 424 matching nontruncated RCTs addressing 63 questions, we found that truncated RCTs provide biased estimates of effects on the outcome that precipitated early stopping. On average, the ratio of RRs in the truncated RCTs and matching nontruncated RCTs was 0.71. This implies that, for instance, if the RR from the nontruncated RCTs was 0.8 (a 20% relative risk reduction), the RR from the truncated RCTs would be on average approximately 0.57 (a 43% relative risk reduction, more than double the estimate of benefit). Nontruncated RCTs with no evidence of benefit—ie, with an RR of 1.0—would on average be associated with a 29% relative risk reduction in truncated RCTs addressing the same question.

So, what does this mean? It means that truncated RCTs do indeed tend to inflate the effect size substantially and to show differences by chance alone where none exists.

This is concerning in general, and specifically for our example of the semi-recumbent positioning study. Let us do some calculations to see just how this effect inflation would play out in the said study. Recall that microbiologically confirmed pneumonia occurred in 2 of 39 (5%) semi-recumbent cases and in 11 of 47 (23%) supine cases. The investigators calculated the adjusted odds ratio of VAP in the supine compared to semi-recumbent to be 6.8 (95% CI 1.7 - 26.7). This, as I mentioned before is an inflated estimate as odds ratios tend to be with frequent events. Furthermore, I obviously cannot do the adjusted calculation, as I would need the primary patient data for this. What we need is the relative reduction in VAP due to the intervention being investigated anyway, which is the reciprocal of what we have. So, I can derive the unadjusted relative risk thusly: (2/39)/(11/47) = 0.22. Now, if the RCT truncation alone reduces this risk by 29%, then if the trial had been allowed to go to completion, this relative risk would have been ~0.3. In this range, the difference does not seem all that impressive. But as all of the threats to validity we discussed in the original post begin to chisel mercilessly away at this risk reduction, the 29% inflation becomes a proportionally bigger deal.

Well, that does it.  

Monday, February 14, 2011

Redefining compassion: "A spiritual technology"

This is a TEDxUN talk by Krista Tippett. It is fantastic!
If you are in a rush, just go to around minute 10:00 or so. But really the whole talk is well worh considering.

Friday, February 11, 2011

CMS never events: Evidence of smoke in mirrors?

Let me tell you a fascinating story. In 1999, I was still fresh out of my Pulmonary and Critical Care Fellowship, struggling for breath in the vortex of private practice, when a cute little paper appeared in the Lancet from a great group of researchers in Spain, describing a study performed in one large academic urban medical center's two ICUs: one respiratory and one medical. Its modest aim was to see if semi-recumbent (partly sitting up) compared to supine (lying flat on the back) positioning could reduce the incidence of that bane of the ICU, ventilator-associated pneumonia (VAP). The study was a well done randomized controlled trial, and the investigators even went so far as to calculate the power (the number needed to enroll in order to detect a pre-determined magnitude of effect [in this case an ambitious 50% reduction in clinically suspected VAP]), and this number was 182 based on the assumption of a 40% VAP prevalence in the control (supine) group. The primary endpoint was the prevalence (percentage of all mechanically ventilated [MV] patients developing) and the secondary the incidence density (number of cases among all MV patients spread over all the cumulative days of MV [patient-days of MV]) of clinically suspected VAP, based on the CDC criteria, while microbiologically confirmed VAP (also rigorously defined) served as the secondary endpoint.

Here is what they found. The study was stopped early due to efficacy (this means that the intervention was so superior to the control in reaching the endpoint that it was deemed unethical after the interim look to continue the study), enrolling only 86 patients, 39 in the intervention and 47 in the control groups. And here are the results for the primary and secondary outcomes:

So, this is great! No matter how you slice it, VAP is reduced substantially; there is a microbiologically confirmed prevalence reduction of nearly 6-fold (this is unadjusted for potential differences between groups; and there were differences!). Well, you know what's coming next. That's right, the "not so fast" warning. Let's examine the numbers in context.

First of all, if we look at the evidence-based guideline on HCAP, HAP and VAP from the ATS and IDSA, the prevalence of VAP is generally between 5 and 15%; in the current study the control group exceeds 20%. Now, for the incidence density, for years now the CDC has been keeping and reporting these numbers in the US, and the rate in patients comparable to the ones in the study should be around 2-4 cases per 1,000 MV days. In this study, no matter how you slice it, clinically or microbiologically, the incidence density is exceedingly high, more in line with some of the ex-US numbers reported in other studies. So, they started high and ended high, albeit with a substantial reduction.

Second of all, there is a wonderful flow chart in the paper that shows the enrollment algorithm. One small detail has always been somewhat obscure to me: the 4 patients in the semi-recumbent group that were excluded from analysis due to reintubation (this means that they were taken off MV, but had to go back on it within a day or two), which was deemed a protocol violation. Now, you might think that 4 patients is a pretty small number to worry about. But look at the total number of patients in the group: 39. If the excluded 4 all had microbiologically confirmed VAP, that would bring our prevalence from 5% to 14% (6 out of 43). This would certainly be a less than 6-fold reduction in VAP.

Thirdly, and this I think is critical, the study was not blinded. In other words, the people who took care of the patients knew the group assignment. So what, you ask. Well remember that VAP is a pretty difficult, elusive and unclear diagnosis. So, let us pretend that I am a doc who is also an investigator on the study, and I am really invested in showing how marvelous semi-recumbent positioning is for VAP prevention. I am likely to have a much lower threshold for suspecting and then diagnosing VAP in the comparator group than in my pet intervention group. And this is not an indictment of anyone's judgment or integrity; it is just how our brains are wired.

Next, there were indeed important differences between groups in their baseline risk factors for VAP. For example, more patients in the control (38%) than in the intervention (26%) group were on MV for a week or longer, the single most important risk factor for developing VAP. Likewise, the baseline severity of illness was higher in the control than the intervention group. To be sure, the authors did statistical analyses to adjust these differences away, and still found an adjusted odds ratio of VAP among the supine group to be 6.8, with the 95% confidence interval between 1.7 and 26.7. This is generally taken to mean that, on average, the risk of VAP increases nearly 7-fold for supine position as opposed to semi-recumbent, and if the trial was repeated 100 times, 95 of those times this estimate would fall between a 1.7 and a 26.7-fold increase. OK, so we can accept this as a possible viable strategy, right?

But wait, there is more. Remember what we said about the odds ratio? When the event happens in more than 10% of the sample, the odds ratio vastly overestimates the risk of this event. 28.4% anyone?

Now, let's put it all together. A single center study from a Spanish academic hospital, among respiratory and medical ICU patients, with a minuscule sample size, yet halted early for efficacy, an exceedingly high baseline rate of VAP, a substantial number of patients excluded for a nebulous reason, unblinded and therefore prone to biased diagnosis, reporting an inflated reduction in VAP development in the intervention group. It would be very easy to write this off as a flawed study (like all studies tend to be in one way or another) in need of confirmatory evidence, if it were not so critical in the current punitive environment of quality improvement. (By the way, to the best of my knowledge, there is no study that replicates these results). The ATS/IDSA guideline includes semi-recumbent positioning as a level I (highest possible level of evidence) recommendation for VAP prevention, and it is one of the elements of the MV bundle, as promoted by the Institute for Healthcare Improvement, which demands 95% compliance with all 5 elements of the bundle in order to get the "compliant" designation. And even this is not the crux of the matter. The diabolical detail here is that CMS is creeping up on making VAP into one of their magical "never" events, and the efforts by hospitals will most assuredly be including this intervention. So, ICU nurses are already expected to fall in step with this deceptively simple yet not-so-easily executable practice.

And this is what is under the hood of just one simple level I recommendation by two reputable professional organizations in their evidence-based guidelines. One shudders to think...              

Wednesday, February 9, 2011

Evidence and profit: An unhealthy alliance

My JAMA Commentary came out this week, and I am getting e-mail about it. It seems to have resonated with many docs who feel that the research enterprise is broken and its output fails them at the office. But what I want to do is tie a few ideas together, ideas that I have been exploring on this blog and elsewhere, ideas that may hold the key to our devastating healthcare safety problem.

The last four decades can be viewed as a nexus between the growth of evidence-based medicine (EBM) on the one hand, and the unbridled proliferation of the biopharmaceutical industry and its technologies. The result has been rapid development, maximization of profit, and a juggernaut of poorly thought-out and completely uncoordinated research geared initially at regulatory approval and subsequently to market growth. It is not that the clinical research has been of poor quality, no. It is that our research tools are primitive and allow us to see only slivers of reality. And these slivers are prone to many of our cognitive biases to boot. So, the drive to produce evidence and the drive to grow business colluded to bring us to where we are today: inundated with evidence of unclear validity, unbalanced with regard to where the biggest difference to public health can be made. Yet we are constantly poked and prodded by the eager bureaucracy to do better at implementing this evidence, while the system continues to perform in a devastatingly suboptimal fashion, causing more deaths every year than strokes.

A byproduct of this technological and financial race has been the rapid escalation of healthcare spending, with the consequent drive to contain it. The containment measures have, of course, had the "unintended consequence" of increased patient volume for providers and of the incredible shrinking appointment, all just to make a living. The end-result for clinicians and patients is the relentless pressure of time and the straight jacket of "evidence-based" interventions in the name of quality improvement. And in this mad race against the clock and demoralization, very few have had the opportunity to think rationally and holistically about the root causes of our status quo. The reality is that we are now madly spinning our wheels at the margins, getting bogged down in infinitesimal details and losing the forest for the trees (pardon all of the metaphor mixing). Our evidence-based quality improvement efforts, while commendable, are like trying to plug holes in a ship's hull with bandainds: costly and overall making little if any difference.

But if we step back and stop squinting, we can see the big picture: stagnated and outdated research enterprise still rewarding spending over substance, embattled clinicians trying to stay afloat, and a $2.5 trillion healthcare gorilla feeding the economy at the expense of human lives. Will technology fix this mess? Not by itself, no. Will more "evidence" be the answer? No, not if we continue to generate it as usual. Is throwing more money at the HHS the solution? I doubt it. A radical change of course is in order. Take profit out of evidence generation, or at least blunt its influence (this will reduce the clutter of marginal, hair-splitting technologies occupying clinicians' collective consciousness), develop new tools for better patient care rather than for maximizing the bottom line, give clinicians more time to think about their patients' needs rather than about how to maintain enough income to pay for the overhead, these are some of the obvious yet challenging solutions to the current crisis. Challenging because there needs to be political will to implement them. And because we are currently so invested in the path we are on that it is difficult and perhaps impossible to stray without losing face. But what is the alternative?

Tuesday, February 8, 2011

Medical decision making: More signal less noise, please!

It's official, I'm a country bumpkin! Driving in Boston last week I was distracted, annoyed, made anxious and confused by the constant traffic, billboards and signs. Even highway markings confused me, particularly one indicating a detour to Storrow Drive East, which never materialized. Despite the fact that I know the geography of Boston like the back of my hand, I nearly went down the wrong streets multiple times, including driving the wrong way on some one-way roads. Yes, I am now the menace I used to save my prize driving language for in my younger days.

But it seems that over the years of my living away, there has been a sharp increase in the information thrown at me from all directions, accompanied by a decline in places to rest my gaze without suffering the perseveration of conscious processing. And while the value of this information is at best questionable, the sum total of this overstimulation is clearly confusion, wrong road choices and possibly a reduction in the safety of my driving. This whole experience reminded me of Thomas Goetz's distaste for how medical results are reported. If you have not seen him preach about it, you really should. Here is his excellent TED talk on the subject.


It is ironic that during this overwhelming city visit I also had the chance to speak to a doctor about "routine" preoperative testing and its value. Before surgery, it is recommended that a patient get a screening evaluation. Yet the components of this evaluation vary widely, and may include blood work, urinalysis, electrocardiogram, a chest X-ray and the like. Although evidence suggests that most of the points of this evaluation are useless at best, many institutions continue to order a shotgun panel of preoperative testing for everyone. This one-size-fit-all medicine results in reams of useless and distracting information, a high frequency of abnormal findings of questionable significance, a potential for harm, worry and needless healthcare spending. In my particular conversation I asked the anesthesiologist what the pre-test probability for someone with my characteristics was for a useful chest X-ray result, for example, and whether the fancy electronic medical record used by the hospital could help her determine this. While the answer to the former question was "probably exceedingly low", the answer to the latter was a definitive "no." So, given some elementary thinking, it became clear that a patient like me should not in fact be subjected to a chest X-ray, since any pathology found on one would likely represent a false positive finding, which would nevertheless require potentially invasive follow-up. And guess what? By focusing on the particular individual in the office, rather than all comers, we could have gone through the entire menu of the possible preoperative tests "routinely" ordered and eliminated most if not all of them. But my bet is that not all patients, not even all e-patients, either know or are able to initiate this type of a critical discussion. And yet what tests to obtain, if any, should always be a thoughtful and individualized decision. To approach testing in any other way is to risk generating noise, distraction and harm.

And this brings me back to Thomas Goetz's idea of redesigning how test results are reported. I love his idea. But to me what needs to happen before making the data patient-friendly, is making the decision-making provider-friendly. So, great idea, Mr. Goetz, but let us move it upstream, to the office, where the decision to get chest X-rays, cholesterols and urinalyses is made, and help the doctor visualize her patient's risk for a disease being present, the characteristics of the test about to be ordered, the probability of a positive test result, and all the downstream probabilities that stem from this testing, so as to put a positive test result in the context of the individual's risk for having the disease. Because getting the results of tests that perhaps should never have been obtained in the first place is following the GIGO principle. It is generating noise, distraction and detours going wrong way down one-way roads. And when applied to medicine, these are definitely unwelcome metaphors.
    

Wednesday, February 2, 2011

Intervention in ICU reduces hospital mortality, but by how much?

Addendum #2, 12:09 PM EST, 2/2/11:
So, here is the whole story. Stephanie Desmon, the author of the JH press release, e-mailed me back and pointed me to Peter Pronovost as the source for the 10% reduction information. I e-mailed Peter, and he got back to me, confirming that 
"The 10 percent is the rounded differences in differences in odds ratios"
Moral of the story: The devil is in the details.


And speaking of details, I must admit to an error of my own. If you look at the figure reproduced below, I called out the wrong points. For adjusted data, you need to look at the open circles (for the intervention group) and squares (for the control group). In fact, the adjusted mortality went from about 20% at baseline to 16% in the 13-22 months interval for the Keystone cohort, while for the control group it went from a little over 20% to a little under 18%. This makes the absolute reduction a tad more impressive, though there is still less than a 2% absolute difference between the reduction seen in the intervention vs. the control group, leaving all of my other points still in need of addressing. 


Addendum #1, 11:00 AM EST, 2/2/11:
I just found what I think is the origin of the 10% mortality reduction rumor in this press release from Johns Hopkins. I just e-mailed Stephanie Desmon, the author of the release, to see where the 10% came from. Will update again should I hear from either Maggie Fox or Stephanie Desmon.  

Remember the Keystone project? A number of years ago when we started to pay close attention to healthcare-associated infections (HAI), and hospitals started to take introspective looks at their records, it turned out the the ICUs in the state of Michigan for one reason or another had very high rates of HAIs. As this information percolated through our collective consciousness, the stars aligned in such a away as to release funding from the AHRQ in Washington, DC, for a group of ICU investigators at the Johns Hopkins University School of Medicine in Baltimore, MD, headed by Peter Pronovost, to design and implement a study employing IHI-style (Boston, MA) bundled interventions to prevent catheter-associated blood stream infections (CABSI) and ventilator-associated pneumonia (VAP) across the consortium of ICUs in MI. Whew! This poly-geographic collaboration resulted in a landmark paper in 2006 in the New England Journal of Medicine, wherein the authors showed that the bundled interventions directed by a checklist aimed at CABSI were indeed associated with a satisfying reduction of CABSI. Since 2006 the ICU community has been eagerly awaiting the results of the VAP intervention from Keystone, but none has come out. When there is a void of information, rumors fill this void, and plenty of rumors have circulated about the alleged failure of the VAP trial.

I do not want to belabor here what I have written before with regard to VAP and its prevention, and what makes the latter so difficult, and how little evidence there really is that the IHI bundle actually does anything. You can find at least some of my thoughts on that here. But why am I bringing up the Keystone project again anyway? Well, it is because Pronovost's group has just published a new paper in BMJ, and this time their aim was even more ambitious: to show the impact of this state-wide QI intervention on hospital mortality and length of stay. This is a really reasonable question, mind you, since, we could argue that, if the intervention reduces HAI, it should also do something to those important downstream events that are driven by the particular HAI, namely mortality and LOS. But here are a couple of issues that I found of great interest.

First, as we have discussed before, whether or not VAP itself causes death in the ICU population (that is patients die from VAP), or whether VAP tends to attack those who are sicker and therefore more likely to die anyway (patients die with VAP) remains unclear in our literature. There is some evidence that late VAP may be associated with an attributable increase in mortality, but not early, and these data need to be confirmed. Why is this important? Because if VAP does not impart an increase in mortality, then trying to decrease mortality by reducing VAP is just swinging at windmills.

So, let's talk about the study and what it showed as reported in the BMJ paper. You will be pleased that I will not here go through the traditional list of potential threats to validity, but take the data at face value (well, almost). The authors took an interesting approach of comparing the performance of all eligible ICUs regardless of whether they actually chose to take part in the project. Of all the admissions examined in the intervention group, 88% came from Keystone participants. This is a really sound way to define the intervention cohort, and it actually biases the data away from showing an effect. So, kudos to the investigators. The comparator cohort came from ICUs in the hospitals surrounding Michigan, those that were not eligible for Keystone participation. One point about these institutions also requires clarification: I did not see in the paper whether the authors actually looked at the control hospitals' QI initiatives. Why is this important? Well, if many of the comparator hospitals had successful QI initiatives, then one could expect to see even less difference between the Keystone intervention and the control group. So, again, good on them that they biased the data against themselves.

This is the line of thinking that brings me to my second point. Reuters' Maggie Fox covered this paper in an article a couple of days ago, an article whose byline lede (thanks for the correction, @ivanoransky) floored me:
(Reuters) - A U.S. program to help make sure hospital staff maintain strict hygiene standards lowered death rates in intensive care units by 10 percent, U.S. researchers reported on Monday.
Mind you, I read the article before delving into the peer-reviewed paper, so my surprise came out of just knowing how supremely difficult it is to reduce ICU mortality by 10% with any intervention. In the ICU we celebrate when we see even a 2% absolute mortality reduction. So, it became obvious to me that something got lost in translation here. And indeed, it did. Here is how I read the data.

There are multiple places to look for the mortality data. One is found in this figure:

Now, look at the top panel and focus on the solid circles -- these depict the adjusted mortality in the Keystone intervention group. What do you see? I see mortality going from about 14% at the baseline to about 13.5% at implementation phase to about 13% at 13-22 months post implementation. I do not see a 10% reduction, but at best about a 1% mortality advantage. What is also of interest is that the adjusted mortality in the control group (solid squares) also went down, albeit not by as much. But almost at every point of measurement it was lower already than in the intervention group.
Then there is this table, where the adjusted odds ratios of death are given for the two groups at various time points:
And this is where things get interesting. If you look at the last line of the table, the adjusted odds ratios indeed look impressive, and, furthermore, the AOR for the intervention group is lower than that for the control group. And this is pleasing to any investigator. But what does it mean? Well it means that the odds of death in the intervention group went down roughly by 24% (give-or-take the 95% confidence interval) and by 16% in the control group,each compared to itself at baseline. This is impressive, no?

Well, yes, it is. But not as impressive as it sounds. A relative reduction of 24% with the baseline mortality of 14% means an absolute reduction in mortality of 14% x 24% = 3.4%. But, you notice that we did not actually observe even this magnitude of mortality reduction in the graph. What gives? There is an excellent explanation for this. It is a little known fact to the the reader (and only slightly more so to the average researcher and peer reviewer) that the odds ratio, while a fairly solid way to express risk when the absolute risk is small (say, under 10%), tends to overestimate the effect when the risk is higher than 10%. I know we have not yet covered the ins and the outs of odds ratios, relative risks and the like in the "reviewing literature" series, but let me explain briefly. The difference between odds and risk is in the denominator. While the denominator for the latter is the entire cohort at risk for the event (here all patients at risk for dying in the hospital), that for the former is that part of the cohort that did not experience the event. See the difference? By definition, the denominator for the odds ratio is smaller than for the relative risk calculation, thus yielding a more impressive, yet inaccurate, reduction in mortality. 

Bottom line? Interesting results. Not clear if the actual intervention is what produced the 1% mortality reduction -- could have been secular trends, regression to the mean or Hawthorne effect, to name just a few alternatives. But regardless, preventing death is good. The question is were these improvements in mortality sustained after hospital discharge, or were these patients merely kept alive so that they could die elsewhere? Also, what is the value balance here in terms of resources expended on the intervention versus the results that may not even be due to the particular intervention in question?

All of this is to say that I am really not sure what the data are showing. What I am sure of is that I did not find any evidence of a 10% reduction in mortality reported by Reuters (I did e-mail Maggie Fox and at this time still awaiting a reply; will update if and when I get it). In this time of aggressive efforts to bend the healthcare expenditures curve we need to pay attention to what we invest in and the return on this investment, even if the intervention is all "motherhood and apple pie."                  

Tuesday, February 1, 2011

The beautiful uncertainty of science

I am so tired of this all-or-nothing discussion about science! On the one hand there is a chorus singing praises to science and calling people who are skeptical of certain ideas unscientific idiots. On the other, with equal penchant for eminence-based thinking, are the masses convinced of conspiracies and nefarious motives of science and its perpetrators. And neither will stop and listen to the other side's objections, and neither will stop the name-calling. So, is it any wonder we are not getting any closer to the common ground? And if you are not a believer in the common ground, let me say that we are only getting farther away from the truth, if such a thing exists, by retreating further into our cognitive corners. These corners are comfortable places, with our comrades-in-arms sharing our, shall we say, passionate opinions. Yet this is not the way to get to a better understanding.

Because I spend so much time contemplating our larger understanding of science, the title "Are We Hard-Wired to Doubt Science" proved to be a really inflammatory way to suck me into thinking about everything I am interested in integrating: scientific method, science literacy and communication and brain science. The author, on the heels of doing a story on the opposition to smart meters in California, was led to try to understand why we are so quick to reject science:
But some very intelligent people I interviewed had little use for the existing (if sparse) science. How, in a rational society, does one understand those who reject science, a common touchstone of what is real and verifiable?
The absence of scientific evidence doesn’t dissuade those who believe childhood vaccines are linked to autism, or those who believe their headaches, dizziness and other symptoms are caused by cellphones and smart meters. And the presence of large amounts of scientific evidence doesn’t convince those who reject the idea that human activities are disrupting the climate.
She goes on to think about the different ways of perceiving risk, and how our brains play tricks on us by perpetuating our many cognitive biases. In essence, new data are unable to sway our opinion because of rescue bias, or our drive to preserve what we think we know to be true and to reject what our intuition tells us is false. If we follow this argument to its logical conclusion, it means that we just need to throw our hands up in the air and accept the status quo, whatever it is.

I happen to think that the author missed an opportunity to educate her readers about why we need to come to a better understanding and how to get there. The public (and even some of my fellow scientists) needs to understand what science is and, even more importantly, what it is not.

First, science is not dogma. Karl Popper had a very simple litmus test for scientific thinking: He asked how you would go about disproving a particular idea. If you think that the idea is above being disproved, then you are engaging in dogma and not science. The essence of scientific method is developing an hypothesis from either a systematically observed pattern or from a theoretical model. The hypothesis is necessarily formulated as the null, making the assumption of no association the departure point for proving the contrary. So, to "prove" that the association is present you need to rule out any other potential explanation for what may appear to be an association. For example, if thunder were always followed by rain, it might be easy to engage in the "post hoc ergo propter hoc" fallacy and conclude that thunder caused rain. But before this could become a scientific theory, you would have to show that there was no other explanation that would disprove this association.

So, the second point is that science is driven by postulating and then disproving the null hypotheses. By definition, an hypothesis can only be disproved if we 1). the association exists, and 2). the constellation of phenomena is not explained by something else. And here is the third and critical point, the point that produces equal parts frustration and inspiration to learn more: That "something else" as the explanation of a certain association is by definition informed only by what we know today. It is this very quality of knowledge production, the constancy of the pursuit, that lends the only certain property to science, the property of uncertainty. And our brains have a hard time holding and living with this uncertainty.

The tension between uncertainty and the need to make public policy has taken on a political life of its own. What started out as a modest storm of subversion of science by politics in the tobacco debate, has now escalated into a cyclone of everyday leveraging of the scientific uncertainties for political and economic gains. After all, how can we balance the accounting between the theoretical models predicting climate doom in the future and the robust current-day economic gains produced by the very pollution that feeds these models? How can we even conceive that our food production system, yielding more abundant and cheaper food than ever before, is driving the epidemic of obesity and the catastrophe of antimicrobial resistance? And because we are talking about science, and because, as that populist philosopher Yogi Berra famously quipped, "Predictions are hard, especially about the future," the uncertainty of our estimates overshadows the probability of their correctness. Yet by the time the future becomes present, we will be faced with potentially insurmountable challenges of a new world.

I have heard some scientists express reluctance about "coming clean" to the public about just how uncertain our knowledge is. Nonsense! What we need under the circumstances is greater transparency, public literacy and engagement. Science is not something that happens in the bastions of higher education or behind the thick walls of corporations. Science is all around and within us. And if you believe in God, you have to believe that God is a scientist, a tinkerer, always looking for a more elegant solution. The language of science that may seem daunting and obfuscatory. Yet do not be afraid -- patterns of a language are easy to decipher with some willingness and a dictionary. Our brains are attuned to the most beautiful explanations of the universe. Science is what provides them.

Self-determination is predicated upon knowledge and understanding. Abdicating our ability to understand the scientific method leaves us subject to political demagoguery. Don't be a puppet. We are all born scientists. Embrace your curiosity, tune out the noise of those at the margins who are not willing to engage in a sensible dialogue, leave them to their schoolyard brawling. And likewise, leave the politicians, corporate interests, and, alas, many a journalist, and start learning the basics of scientific philosophy and thought. Allow the uncertainty of knowledge excite and delight you. You will not be disappointed.