Showing posts with label reviewing lit. Show all posts
Showing posts with label reviewing lit. Show all posts

Monday, January 31, 2011

Reviewing medical literature, part 5: Inter-group differences and hypothesis testing

Happy almost February to everyone. It is time to resume our series. First I am grateful for the vigorous response to my survey about interest in a webinar to cover some of this stuff. Over the next few months one of my projects will be to develop and execute one. I will keep you posted. In the meantime, if anyone has thoughts or suggestions on the logistics, etc., please, reach out to me.

OK, let's talk about group comparisons and hypothesis testing. Scientific method that we generally practice demands that we articulate an hypothesis prior to conducting a study which will test this hypothesis. The hypothesis is generally advanced as the so-called "null hypothesis" (or H0), wherein we express our skepticism that there is a difference between groups or an association between the exposure and outcome. By starting out with this negative formulation, we set the stage for "disproving" the null hypothesis, or demonstrating that the data support the "alternative hypothesis" (HA, or the presence of the said association or difference). This is where all the measures of association that we have discussed previously come in, and most particularly the p value. The definition of the p value once again is "the probability that the found inter-group difference, or one that is greater than what was found, would have been found under the condition of no true difference." Following through on this reasoning, we can appreciate that the H0 can never be "proven." That is, the only thing that can be said statistically when no difference is found between groups is that we did not disprove the null hypothesis. This may be because there truly is no difference between the groups being compared (that is the null hypothesis approximates reality) or because we did not find the difference that in fact exists. The latter is referred to as the Type II error, and can be present for various reasons, the most common of which is a sample size that is too small to detect statistically significant difference.

This is a good place to digress and talk a little about the distinction between "absence of evidence" and "evidence of absence." The distinction, though ostensibly semantic, is quite important. While "evidence of absence" implies that studies to look for associations have been done, done well, published, and have consistently shown the lack of association between a given exposure and outcome or a difference between two groups, "absence of evidence" means that we have just not done a good job looking for this association or difference. Absence of evidence does not absolve the exposure from causing the outcome, yet so often it is confused with the definitive evidence of absence of an effect. Nowhere is this more apparent than in the history of the tobacco debate, which is the poster child for this obfuscation. And we continue to rely on this confusion in other environmental debates, such as chemical exposures and cell phone radiation. One of the most common reasons for finding no association when one exists, or the type II error, is, as I have already mentioned, a sample size that is too small to detect the difference. For this reason, in a published study that fails to show a difference between groups it is critical to assure that the investigators performed the power calculation. This maneuver, usually found in the Methods section of the paper, lets us know that the sample size is adequate to detect a difference if one exists, thus minimizing the probability of type II error. The trouble is that, as we know, there is a phenomenon called "publication bias." This refers to the scientific journals' reluctance to publish negative results. And while it may be appropriate to reject studies prone to type II error due to poor design (although even these studies may be useful in the setting of a meta-analysis, where pooling of data overcomes small sample sizes), true negative results must be made public. But this is a little off topic.

I will ask you to indulge me in one other digression. I am sure that in addition to "statistical significance" (this is simplistically represented by the p value), you have heard of "clinical significance." This is an important distinction, since even a finding that is statistically significant may have no clinical significance whatsoever. Take for example a therapy that cuts the risk of a non-fatal heart attack by 0.05% in a certain population. This means that in a population at a 10% risk for a heart attack in one year, the intervention will bring this risk on average to 9.95%. And though we can argue whether or not this is an important difference, at the population level, this does not seem all that clinically important. So, if I have the vested interest and the resources to run the massive trial that will give me this minute statistical significance, I can do that and then say without blushing that my treatment works. Yet, statistical significance always needs to be examined in the clinical context. This is why it is not enough to read the headlines that tout new treatments. The corollary to this is that the lack of statistical significance does not equate to the lack of clinical significance. Given what I just said above about type II error, if the difference appears significant clinically (e.g., reducing the incidence of fatal heart attacks from 10% to 5%), but does not reach statistical significance, the result should not be discarded as negative, but examined as to the probability of the type II error. This is also where Bayesian thinking must come into play, but I do not want to get into this now, as we have covered these issues in previous posts on this blog.

OK, back to hypothesis testing. There are several rules to be aware of when reading how the investigators tested their hypotheses, as different types of variables require different methods. A categorical variable (one characterized by categories, like gender, race, death, etc.) can be compared using the chi square method if there is an abundance of events or the Fisher's exact test when values are scant. A normally distributed continuous variable (e.g., age is a continuum that is frequently distributed normally) can be tested using the Student's t-test, while one that has a skewed distribution (e.g., hospital length of stay, costs), requires testing with the Mann-Whitney U-test or the Wilcoxon rank-sum test or the Kruskall-Wallis test. Each of these "non-parametric" tests is appropriate in the setting of a skewed distribution. You do not need to know any more than this: the test for the hypothesis depends on the variable's distribution. And recognizing some of the situations and test names may be helpful to you in evaluating the validity of a study.

One final frequent computation you may encounter is survival analysis. This is often depicted as a Kaplan-Meier curve, and does not have to be limited to examining survival. This is a time-to-event analysis, regardless of what the event is. In studies of cancer therapies we frequently talk about median disease-free survival between groups, and this can be depicted by the K-M analysis. To test the difference between times to event, we employ the log-rank test.

Well, this is a fairly complete primer for most common hypothesis testing situations. In the next post we will talk a little more about measures of association and their precision, types I and II errors, as well as measures of risk alteration.                                         

Wednesday, January 26, 2011

Webinar survey results

Last week I posted a survey link to gauge interest in and potential content for a webinar on how to review medical literature critically. I had a great response, and wanted to share the data with you.

The web page got 302 hits, resulting in 82 survey responses. This is a 27% rate of response, which certainly sets the results up to be biased and non-generalizable. But what the heck? I was looking to hear from people with some interest in this, not all-comers. So, here are the questions and the aggregated answers.

Q1: "I am thinking about creating a webinar based on some of the posts I have done on how to review medical literature. Would this be of interest to you?
R1: 82 people responded, of whom 81 (99%) answered "yes".

Q2: Are you a healthcare professional/researcher, an e-patient, or just an innocent bystander?
R2: 82 responses, 60 (72%) healthcare professionals/researchers, 5 (6%) e-patients, 17 (21%) innocent bystanders

Q3: Why do you feel the need to understand how to review medical literature
R3: This was a free text field, and I got 73 responses. Of these, many had to do with gaining a better understanding of the subject in order to help others (patients, clients, trainees) learn how to read and understand medical literature.

Q4: This question was only for those who responded "yes" to being a healthcare professional/researcher: Do you engage in journal peer review as a reviewer?
R2: Of 60 responses, only 7 (12%) were "yes".

Q5: Similar to Q4, this question was for only those who responded "yes" to Q4: Have you had formal training on how to be an effective peer reviewer?
R2: All 7 responded, of whom only 2 (29%) had formal training through a journal or a professional society, The remaining 5 (71%) have gained pertinent knowledge through reading about it. None of the responders got any reviewing courses during their medical training. Although the sample size is small, the responses are revealing and go along with my experience.

Q6: This question was targeted to only those responders who identified themselves as e-patients: How technical do you want the webinar information to get?
R2: All 5 e-patients answered this question, of whom 2 were comfortable with some degree of technicality, while the remaining 3 were comfortable with a greater degree of it.

Q7: This question was for all responders who expressed interest in having a webinar: Would you want one session or multiple sessions?
R2: Of the 80 responders, 21 (26%) felt that 1 session would suffice, 40 (50%) would be amenable to up to 3 sessions, and 11 (14%) would do up to 5 sessions. The remaining 8 (10%) of the responders chose "other", where their replies ranged from "no clue" to "as many as you see fit" to "let's start an ongoing discussion."

Q8: This was for those who would prefer a single session: How long should the session be?
R2: Of the 20 responses, 10 (50%) indicated 1 hour, while the majority of the rest indicated 2 hours.

Q9: If you are a part of an institution, do you think this would be of interest to your institution?
R9: 70 people responded, with 39 (56%) saying "yes" and 31 (44%) saying "no".

Q10: This was for those responding "yes" to Q9: What type of an institution are you a part of?
R10: All 39 people responded, and there was a range of institutions from medical schools to hospitals to government organizations to academic libraries. What was interesting here was that none of the "yes" responses to Q9 came from anyone in Biopharma or a professional organization or a patient advocacy organization. This I found surprising.

Overall, I am very pleased with the response. I am grateful to Janice McCallum (@janicemccallum on Twitter) for spreading the word to a lestserv of medical librarians. It certainly looks like there is enough interest in a webinar, and now I have to figure out how to execute one. If anyone has ideas, please, let me know in comments here or via e-mail.

Thanks again to all who took the time to respond!                

Friday, January 21, 2011

A webinar survey

Hi, folks,

I am conducting a survey to see how much interest there may be in a webinar on reviewing medical literature. This should take no more than 10 minutes of your time and would be enormously helpful to me to a). gauge interest and b). create appropriate content.

Thank you so much for doing this!
To get to the survey, click on this url:

Friday, January 14, 2011

Reviewing medical literature, part 4: Statistical analyses -- measures of central tendency

Well, we have come to the part of the series you have all been waiting for: discussion of statistics. What, you are not as excited about it as I am? Statistics are not your favorite part of the study? I am frankly shocked! But seriously, I think this is the part that most people, both lay public and professionals, find off-putting. But fear not, for we will deconstruct it all in simple terms here. Or obfuscate further, one or the other.

So, let's begin with a motto of mine: If you have good results, you do not need fancy statistics. This goes along with the ideas in math and science that truth and computational beauty go hand in hand. So, if you see something very fancy that you have never heard of, be on guard for less than important results. This, of course, is just a rule of thumb, and, as such, will have exceptions.

The general questions I like to ask about statistics are 1). Are the analyses appropriate to the study question(s), and 2). Are the analyses optimal to the study question(s). The first thing to establish is the integrity and completeness of the data. If the authors enrolled 365 subjects but were only able to analyze 200 of them, this is suspicious. So, you should be able to discern how complete the dataset was, and how many analyzable cases there were. A simple litmus test is that if more than 15% of the enrolled cases did not have complete data for analysis or dropped out of the study for other reasons, the study becomes suspect for a selection bias. The greater the proportion of dropouts, the greater the suspicion.

Once you have established that the set is fairly complete, move on to the actual analyses. Here, first thing is first: the authors need to describe their study group(s); hence, descriptive statistics. Usually this includes so-called "baseline characteristics", consisting of demographics (age, gender, race), comorbidities (heart failure, lung disease, etc.), and some measure of the primary condition in question (e.g., pneumonia severity index [PSI] in a study of patients with pneumonia). Other relevant characteristics may be reported as well, and this is dependent on the study question. As you can imagine, categorical variables (once again, these are variables that have categories, like gender or death) are expressed as proportions or percentages, while continuous ones (those that are on a continuum, like age) are represented by their measures of central tendency.

It is important to understand the latter well. There are three major measures of central tendency: mean, median and mode. The mean is the sum of all individual values of a particular variable divided by the number of values. So, mean age among a group of 10 subjects would be calculated by adding all 10 individual ages and then dividing by 10. The median is the value that occurs in the middle of a distribution. So, if there are 25 subjects with ages ranging from 5 to 65, the median value is the one that occurs in subject number 13 when subjects are arranged in ascending or descending order by age. The mode, a measure used least frequently in clinical studies, signifies, somewhat paradoxically, the value in a distribution that occurs most frequently.

So, let's focus on the mean and the median. The mean is a good representation of the central value in a normal distribution. Also referred to as a bell curve (yes, because of its shape), or a Gaussian distribution, in this type of a distribution there are roughly equal numbers of points to the left and to the right of the mean value. It looks like this (from
For a distribution like the one above it hardly matters which central value is reported, the mean or the median, as they are the same or very similar to one another. Alas, most descriptors of human physiology are not normally distributed, but are more likely to be skewed. Skewed means that there is a tail at one end of the curve or the other (figure from here):
For example, in my world of health economics, many values for such variables as length of stay and costs spread out to the right of the center, similar to the blue curve in the right panel of the above figure. In this type of a distribution the mean and the median values are not the same, and they tell you different things. While the median gives you an idea of the central tendency of the entire distribution, the mean will tell you the central tendency of the majority of the distribution that is tightly clustered at the end opposite the tail. For a distribution similar to the one in the right panel, the mean will underestimate the central measure.

To round out the discussion of central values, we need to say a few words about scatter around these values. Because they represent a population and not a single individual, measures of central tendency will have some variation around them that is specific to the population. For a mean value, this variation is usually represented by standard deviation (SD), though sometimes you will see a 95% confidence interval as the measure of the scatter. Variation around the median is usually expressed as the range of values falling into the central one-half of all the values in the distribution, discarding the 25% at each end, or the interquartile range (IQR 25, 75) around the median. These values represent the stability and precision of our estimates and are important to look for in studies.

We'll end this discussion here for the moment. In the next post we will tackle inter-group differences and  hypothesis testing.      

Thursday, January 13, 2011

Reviewing medical literature part 3 continued: threats to validity

As promised, today we talk about confounding and interaction.

A confounder is a factor related to both, the exposure and the outcome. Take for example the relationship between alcohol and head and neck cancer. While we know that heavy alcohol consumption is associated with a heightened risk of head and neck cancer, we also know that people who consume a lot of alcohol are also more likely to be smokers, and smoking in turn raises the risk of H&N CA. So, in this case smoking is a confounder of the relationship between alcohol consumption and the development of H&N CA. It is virtually impossible to get rid of all confounding completely in any study design, save for possibly in a well designed RCT, where randomization presumably assures equal distribution of all characteristics; and even there you need an element of luck. In observational studies our only hope to deal with confounding is through statistical manipulation we call "adjustment", as it is virtually impossible to chase it away any other way. And in the end we still sigh and admit to the possibility of residual confounding. Nevertheless, going through the exercise is still necessary in order to get closer to the true association of the main exposure and the outcome of interest.

There are multiple ways of dealing with the confounding conundrum. The techniques used are matching, stratification, regression modeling, propensity scoring and instrumental variables. By far the most commonly used method is regression modeling. This is a rather complex computation that requires much forethought (in other words, "Professional driver on a closed circuit; don't try this at home"). The frustrating part is that, just because the investigators did the regression, does not mean that they did it right. Yet word limits for journal articles often preclude authors from giving enough detail on what they did. At the very least they should tell you what kind of a regression they ran and how they chose the terms that went into it. Regression modeling relies on all kinds of assumptions about the data, and it is my personal belief, though I have no solid evidence to prove it, that these assumptions are not always met.

And here are the specific commonly encountered types of regressions and when each should be used:
1. Linear regression. This is a computation used for outcomes that are continuous variables (i.e., variables represented by a continuum of numbers, like age, for example). This technique's main assumption is that the exposure and outcome are related to each other in a linear fashion. The resulting beta coefficient is the slope of this relationship if it is graphed.
2. Logistic regression. This is done when the outcome variable is categorical (i.e., one of two or more categories, like gender, for example, or death). The result of a logistic regression is an adjusted odds ratio (OR). It is interpreted as an increase or a decrease in the odds of the outcome occurring due to the presence of the main exposure. Thus, a OR of 0.66 means that there is a 34% reduction in the odds (used interchangeably with risk, though this is not quite accurate) of the outcome due to the presence of the exposure. Conversely, a OR of 1.34 means the opposite, or a 34% increase in the odds of the outcome if the exposure is present.
3. Cox proportional hazards. This is a common type of a model developed for a time to event, also known as "survival analysis" (even if not done for survival per se as the outcome). The resulting value is a hazard ratio (HR). For example, if we are talking about a healthcare-associated infection's impact on the risk of remaining in the hospital longer, a HR of, say, 1.8 means that a HAI increases the risk of being in the hospital by 80% at any time during the hospitalization. To me this tends to be the most problematic technique in terms of assumptions, as it requires that the risk of an even stays constant throughout the time frame of the analysis, and how often does this hold true? For this reason the investigators should be explicit about whether or not they tested for the assumption of proportional hazards and whether this was met.

Let's now touch upon the other techniques that help us to unravel confounding. Matching is just that: it is a process of matching subjects with the primary exposure to those without in a cohort study or subjects with the outcome to those without in a case-control study, based on certain characteristics, such as age, gender, comorbidities, disease severity, etc.; you get the picture. By its nature, matching reduces the amount of analyzable data, and thus reduces the power of the study. So, is is most efficiently applied in a case-control setting, where it actually improves the efficiency of enrollment.

Stratification is the next technique. The word "stratum" means "layer", and stratification refers to describing what happens to the layers of the population of interest with and without the confounding characteristic. In the above example of smoking confounding the alcohol and H&N CA relationship, stratifying the analyses by smoking (comparing the H&N CA rates among drinkers and non-drinkers in the smoking group separately from the non-smoking group) can divorce the impact of the main exposure from that of the confounder on the outcome. This method has some distinct intuitive appeal, though its cognitive effectiveness and efficiency dwindle the more strata we need to examine.

Propensity scoring is gaining popularity as an adjustment method in the medical literature. A propensity score is essentially a number, usually derived from a regression analysis, giving the propensity of each subject for a particular exposure. So, in terms of smoking, we can create a propensity score based on other common characteristics that predict smoking. Interestingly, some of these characteristics will be present also in people who are not smokers, yielding a similar propensity score in the absence of this exposure. Matching smokers to non-smokers based on the propensity score and examining their respective outcomes allows us to understand the independent impact of smoking on, say, the development of coronary artery disease. As in regression modeling, the devil is in the details. Some studies have indicated that most papers that employ propensity scoring as the adjustment method do not do this correctly. So, again, questions need to be asked and details of the technique elicited. There is just no shortcut to statistics.

Finally, a couple of words about instrumental variables. This method comes to us from econometrics. An instrumental variable is one that is related to the exposure but not the outcome. One of the most famous uses of this method was published by a fellow you may have heard of, Mark McClellan, where he looked at the proximity to a cardiac intervention center as the instrumental variable in the outcomes of acute coronary events. Essentially, he argued, the randomness of whether or not you are close to a center randomizes you to the type of treatment you get. Incidentally, in this study he showed that invasive interventions were responsible for a very small fraction of the long-term outcomes of heart attacks. I have not seen this method used that much in the literature I read or review, but am intrigued by its potential.

And now, to finish out this post, let's talk about interaction. "Interaction" is a term mostly used by statisticians to describe what epidemiologists call "effect modification" or "effect heterogeneity". It is just what the name implies: there may be certain secondary exposures that either potentiate or diminish the impact of the main exposure of interest on the outcome. Take the triad of smoking, asbestos and lung cancer. We know that the risk of lung cancer among smokers who are also exposed to asbestos is far higher than among those who have not been exposed to asbestos. Thus, asbestos modifies the effect of smoking on lung cancer. So, to analyze those smokers exposed to asbestos together with those who were not will result in an inaccurate measure of the association of smoking with lung cancer. More importantly, it will fail to recognize this very important potentiator of tobacco's carcinogenic activity. To deal with this, we need to be aware of the potentially interacting exposures, and either stratify our analyses based on the effect modifier or work the interaction term (usually constructed as a product of the two exposures, in out case smoking and asbestos) into the regression modeling. In my experience as a peer reviewer, interactions are rarely explored adequately. In fact, I am not even sure that some investigators understand the importance of recognizing this phenomenon. Yet, the entire idea of heterogeneous treatment effect (HTE) and our pathetic lack of understanding of its impact on our current bleak therapeutic landscape, is the result of this very lack of awareness. The future of medicine truly hinges on understanding interaction. Literally. Seriously. OK, at least in part.

In the next installment(s) of the series we will start tackling study analyses. Thanks for sticking with me.        

Wednesday, January 12, 2011

Reviewing medical literature, part 3: Threats to validity

You have heard this a thousand times: no study is perfect. But what does this mean? In order to be explicit about why a certain study is not perfect, we need to be able to name the flaws. And let's face it: some studies are so flawed that there is no reason to bother with them, either as a reviewer or as an end-user of the information. But again, we need to identify these nails before we can hammer them into a study's coffin. It is the authors' responsibility to include a Limitations paragraph somewhere in the Discussion section, in which they lay out all of the threats to validity and offer educated guesses as to the importance of these threats and how they may be impacting the findings. I personally will not accept a paper that does not present a coherent Limitations paragraph. However, reviewers are not always, as, shall we say, hard assed about this as I am, and that is when the reader is on her own. Let us be clear: even if the Limitations paragraph is included, the authors do not always do a complete job (and this probably includes me, as I do not always think of all the possible limitations of my work). So, as in everything, caveat emptor! Let us start to become educated consumers.

There are four major threats to validity that fit into two broad categories. They are:
A. Internal validity
  1. Bias
  2. Confounding/interaction
  3. Mismeasurement or misclassification
B. External validity
  4. Generalizability
Internal validity refers to whether the study is examining what it purports to be examining, while external validity, synonymous with generalizability, gives us an idea about how broadly the results are applicable. Let us define and delve into each threat more deeply.

Bias is defined as "any systematic error in the design, conduct or analysis of a study that results in a mistaken estimate of an exposure's effect on the risk of disease" (the reference for this is Schlesselman JJ, as cited in Gordis L, Epidemiology, 3rd edition, page 238). I think of bias as something that artificially makes the exposure and the outcome either occur together or apart more frequently than they should. For example, the INTERPHONE study has been criticized for its biased design, in that it defined exposure as at least one cellular phone call every week. Now enrolling such light users can really result in such a small exposure as not to be able to detect any increase in adverse events. This is an example of a selection bias, by far the most common form that bias takes. Another example of a frequent bias is encountered in retrospective case-control studies where people are asked to recall distant exposures. Take for example middle-aged women with breast cancer who are asked to recall their diets when they were in college. Now, ask the same of similar women without breast cancer. What you are likely to get is the effect, absent in women without cancer, of seeking an explanation for the cancer that expresses itself in a bias in what women with cancer recall eating in their youth. So, a bias in the design can make the association seem either stronger or weaker than it is in reality.

I want to skip over confounding and interaction at the moment, as these threats deserve a post of their own, which is forthcoming. Suffice it to say here that a confounder is a factor related to both, the exposure and the outcome. An interaction is also referred to as effect modification or effect heterogeneity. This means that there may be population characteristics that alter the response to the exposure of interest. Confounders and effect modifiers are probably the trickiest concepts to grasp. So, stay tuned for a discussion of those.

For now, let us move on to measurement error and misclassification. Measurement error, resulting in misclassification, can happen at any step of the way: it can be in the primary exposure, a confounder, or the outcome of interest. I run into this problem all the time in my research. Since I rely on administrative coding for a lot of the data that I use, I am virtually certain that the codes routinely misclassify some of the exposures and confounders that I deal with. Take Clostridium difficile as an example. There is an ICD-9 code to identify it in administrative databases. However, we know from multiple studies that it is not all that sensitive or all that specific; it is merely good enough, particularly for making observations over time. But even for laboratory values there is a certain potential for measurement error, though we seem to think that lab results are sacred and immune to mistakes. And need I say more about other types of medical testing? Anyhow, the possibility of error and misclassification is ubiquitous. What needs to be determined by the investigator and the reader alike is the probability of that error. If the probability is high, one needs to understand whether it is a systematic error (for example, a coder always more likely than not to include C. diff as a diagnosis) or a random one (a coder is just as likely to include as not to include a C diff diagnosis). And while a systematic error may result in either a stronger or a weaker association between the exposure and the outcome, a random, or non-differential, misclassification will virtually always reduce the strength of this association.

And finally, generalizability is a concept that helps the reader understand what population the results may be applicable to. In other words, will the data be applied strictly to the population represented in the study? If so, is it because there are biological reasons to think that the results would be different in a different population? And if so, is it simply the magnitude of the association that can be expected to be different or is it possible that even the direction could change? In other words, could something found to be beneficial in one population be either less beneficial or even more harmful in another? The last question is the reason that we perseverate on this idea of generalizability. Typically, a regulatory RCT is much less likely to give us adequate generalizability than a well designed cohort study, for example.

Well, these are the threats to validity in a nutshell. In the next post we will explore much more fully the concepts of confounding and interaction and how to deal with them either at the study design or study analysis stage.            

Monday, January 10, 2011

Reviewing medical literature, part 2b: Study design continued

To synthesize what we have addressed so far with regard to reading medical literature critically:
1. Always identify the question addressed by the study first. The question will inform the study design.
2. Two broad categories of studies are observational and interventional.
3. Some observational designs, such as cross-sectional and ecological, are adequate only for hypothesis generation and NOT for hypothesis testing.
4. Hypothesis testing does not require an interventional study, but can be done in an appropriately designed observational study.

In the last post, where we addressed at length both cross-sectional and ecologic studies, we introduced the following scheme to help us navigate study designs:
Let's now round out our discussion of the observational studies and move on to the interventional ones.

Case-control studies are done when the outcome of interest is rare. These are typically retrospective studies, taking advantage of already existing data. By virtue of this they are quite cost-effective. Cases are defined by the presence of a particular outcome (e.g., bronchiectasis), and controls have to come from a similar underlying population. The exposures (e.g., chronic lung infection) are identified backwards, if you will. In all honesty, case-control studies are very tricky to design well, analyze well and interpret well. Furthermore, it has been my experience that many authors frequently confuse case-control with cohort designs. I cannot tell you how many times as a peer-reviewer I have had to point out to the authors that they have erroneously pegged their study as a case-control when in reality it was a cohort study. And in the interest of full disclosure, once, many years ago, an editor pointed out a similar error to me in one of my papers. The hallmark of case-control is that the selection criteria are the end of the line, or the presence of a particular outcome, and all other data are collected backwards from this point.

Cohort studies, on the other hand, are characterized by defining exposure(s) and examining outcomes occurring after these exposures. Similar to case-control design, retrospective studies are opportunistic in that they look at already collected data (e.g., administrative records, electronic medical records, microbiology data). So, although retrospective here means that we are using data collected in the past, the direction of the events of interest is forward. This is why they are named cohort studies, to evoke a vision of Caesar's army advancing on their enemy.

Some of the well known examples of prospective cohort studies are The Framingham Study, The Nurses Study, and many others. These are bulky and enormously expensive undertakings, going on over decades, addressing myriad hypotheses. But the returns can be pretty impressive -- just look at how much we have learned about coronary disease, its risk factors and modifiers from the Framingham cohort!

Although these observational designs have been used to study therapeutic interventions and their consequences, the HRT story is a vivid illustration of the potential pitfalls of these designs to answer such questions. Case-control and cohort studies are better left for answering questions about such risks as occupational, behavioral and environmental exposures. Caution is to be exercised when testing hypotheses about the outcomes of treatment -- these hypotheses are best generated in observational studies, but tested in interventional trials.

Which brings us to interventional designs, the most commonly encountered of which is a randomized controlled trial (RCT). I do not want to belabor this, as RCT has garnered its (un)fair share of attention. Suffice it to say that matters of efficacy (does a particular intervention work statistically better than the placebo) are best addressed with an RCT. One of the distinct shortcomings of this design is its narrow focus on very controlled events, frequently accompanied by examining surrogate (e.g., blood pressure control) rather than meaningful clinical (e.g., death from stroke) outcomes. This feature makes the results quite dubious when translated to the real world. In fact, it is well appreciated that we are prone to see much less spectacular results in everyday practice. What happens in the real world is termed "effectiveness", and, though ideally also addressed via an RCT, is, pragmatically speaking, less amenable to this design. You may see mention of pragmatic clinical trials of effectiveness, but again they are pragmatic in name only, being impossibly labor- and resource-intensive.

Just a few words about before-and after studies, as this is the design pervasive in quality literature. You may recall the Keystone project in Michigan, which put checklists and Peter Pronovost on the map. The most publicized portion of the project was aimed at eradication of central line-associated blood stream infections (CLABSI) (you will find a detailed description in this reference, Pronovost et al. N Engl J Med 2006;355:2725-32). The exposure was a comprehensive evidence-based intervention bundle geared ultimately at building a "culture of safety" in the ICU. The authors call this a cohort design, but the deliberate nature of the intervention arguably puts it into an interventional trial category. Regardless of what we call it, the "before" refers to measurement of CLABSI rates prior to the intervention, while the "after", of course, is following it. There are many issues with this type of a design, ranging from confounding to Hawthorne effect, and I hope to address these in later posts. For now, just be aware that this is a design that you will encounter a lot if you read quality and safety literature.

I will not say much about the cross-over design, as it is fairly self-explanatory and is relatively infrequently used. Suffice it to say that subjects can serve as their own controls in that they get to experience both the experimental treatment and the comparator in tandem. This is also fraught with many methodologic issues, which we will be touching upon in future posts.

The broad category of "Other" in the above schema is basically a wastebasket for me to put designs that are not amenable to being categorized as observational or interventional. Cost effectiveness studies frequently fall into this category, as do decision and Markov models.

Let's stop here for now. In the next post we will start to address threats to study validity. I welcome your questions and comments -- they will help me to optimize this series' usefulness.                

Friday, January 7, 2011

Reviewing medical literature, part 2a: Study design

It is true that the study question should inform the study design. I am sure you are aware of the broadest categorization of study design -- observational vs. interventional. When I read a study, after identifying the research question I go through a simple 4-step exercise:
1. I look for what the authors say their study design is. This should be pretty easily accessible early in the Methods section of the paper, though that is not always the case. If it is available,
2. I mentally judge whether or not it is feasible to derive an answer to the posed question using the current study design. For example, I spend a lot of time thinking about issues of therapeutic effectiveness and cost-effectiveness, and a randomized controlled trial exploring efficacy of a therapy cannot adequately answer the effectiveness questions.
If the design of the study appears appropriate,
3. I structure my reading of the paper in such a way as to verify that the stated design is, in fact, the actual design. If it is, then I move on to evaluate other components of the paper. If it is not what the authors say,
4. I assign my own understanding to the actual design at hand an go through the same mental list as above with the current understanding in mind.

Here is a scheme that I often use to categorize study designs:
As already mentioned, the first broad division is between observational studies and interventional trials. An anecdote from my course this past semester illustrates that this is not always a straight-forward distinction to make. In my class we were looking at this sub-study of the Women's Health Initiative (WHI), that pesky undertaking that sank the post-menopausal hormone replacement enterprise. The data for the study were derived from the 3 randomized controlled trials (RCT) of HRT, diet and calcium and vitamin D, as well as from the observational component of the WHI. So, is it observational or interventional? The answer to this is confusing to the point of pulling the wool over even experienced clinicians' eyes, as became obvious in my class. To answer the question, we need to go back to definitions of "interventional" and "observational". To qualify as an interventional, a study needs to have the intervention be a deliberate part of the study design. A common example of this type of a study is the randomized controlled trial, the sine qua non of drug evaluation and approval process. Here the drug is administered as a part of the study, not as a background of regular treatment. In contradistinction, an observational study is just that: an opportunistic observation of what is happening to a group of people under ordinary circumstances. Here no specific treatment is predetermined by the study design. Given that the above study looked at multivitamin supplementation as the main exposure, despite its utilization of the data from RCTs, the study was observational. So, the moral of this tale is to be vigilant and examine the design carefully and thoroughly.

We often hear that observational designs are well suited to hypothesis generation only. Well, this is both true and false. Some studies actually can test hypotheses, while others are relegated to generation only. For example, cross-sectional and ecological studies are well suited to generating hypotheses to be tested by another design. To take a recent controversy as an example, the debunked link between vaccinations and autism initially gained steam from the observation that as the vaccination rates were rising, so was the incidence of autism. The type of a study that shows two events changing at the group/population level either in the same or in the opposite direction is called "ecologic". Similar types of studies gave rise to the vitamin D and cancer association hypothesis, showing geographic variation in cancer rates based on the availability of sun exposure. But, as demonstrated well by the vaccine-autism debacle, running with the links from ecological studies is dangerous, as they are prone to a so-called "ecological fallacy". It occurs when, despite the finding in groups of a linked change of the two factors under investigation, there is absolutely no connection between them at the individual level. So, don't let anyone tell you that they tested an hypothesis in an ecological study!

Similarly in cross-sectional studies an hypothesis cannot be tested, and, therefore, causation cannot be "proven". This is due to the fundamental property of "a snapshot in time" that defines a cross sectional study. Since all events (with few minor exceptions) happen at the same time, it is not possible to assign causation to the exposure-outcome couplet. These studies can merely help us think of further questions to test.

So, to connect the design back to the question, if a study purports to "explore a link between exposure X and outcome Y", either an ecologic or a cross-sectional design is OK. On the other hand, if you see one of these designs used to "test the hypothesis that exposure X causes outcome Y", run the other way screaming.

We will stop here for now, and in the next post will continue our discussion of study designs. Not sure yet if we can finish it in one more post, or if it will require multiple postings. Start praying to the goddess of conciseness now!


Reviewing medical literature, part 1: The study question

Let's start at the beginning. Why do we do research and write papers? No, not just to get famous, tenured or funded. The fundamental task of science is to answer questions. The big questions of all time get broken down into infinitesimally small chunks that can be answered with experimental or observational scientific methods. These answers integrated together provide the model for life as we understand it.

Clearly, the question is the most important part of the equation, and this is why in my semester-long graduate epidemiology course on the evaluative sciences we spend fully the first four to five weeks talking about how to develop a valid and answerable question. The cornerstone of this validity is its importance. Hence, the first question that we pose is: Is the study question important?

This is a bit of a loaded question, though. Important to whom? How is "important" defined? This is somewhat subjective, yet needs to be scrutinized nevertheless. In the context of an individual patient, the question may become: Is the study question important to me? So, importance is dependent on perspective. Nevertheless, there are questions upon whose importance we can all agree. For example, the importance of the question of whether our current fast-food life style promotes obesity and diabetes is hard to dispute.

Regardless of how we feel about the importance of the question, we must first identify the said research question. At least some of the time you will be able to find it in the primary paper, buried in the last paragraph of the Introduction section. Most of the questions we ask relate to etiologic relationships ("etiology" is medicalese for "causation"). Now, you have heard many times that an observational study cannot answer a causal question. Yet, why do we bother with the time, energy and money needed to run observational studies? Without getting too much into the weeds, philosophers of science tell us that no single study design can give us unequivocal evidence of causality. We can merely come close to it. What does this mean in practical terms? It means that, although most observational studies are still interested in causality rather than a mere association, we have to be more circumspect in how we interpret the results from such studies than from interventional ones. But I am jumping ahead.

Once we have identified and established the importance of the question, we need to evaluate its quality. A question of high quality is 1). clear, 2). specific, and 3). answerable. The question that I posed above regarding fast food and obesity possesses none of these characteristics. It is too broad and open to interpretation. If I were really posing a question in this vein, I would choose a single well defined exposure (consuming 3 cans of soda per day) influencing a single outcome (10% body weight gain) over a specific period of time (over 30 weeks). While this is a much narrower question that the one I proposed above, it is only by answering bundles of such narrow questions and putting the information together that we can arrive at the big picture.

A general principle that I like to teach to my student is the PICO or PECOT model (I did not come up with it, but am its avid user). In PICO, P=population, I=intervention or exposure, C=comparator, and O=outcome. The PECOT model is an adaptation of the PICO for observations over time, resulting in P=population, E=exposure, C=comparator, O=outcome, T=time. These models can help not only pose the question, but to unravel the often mysterious and far from transparent intent of the investigators.

Once you have identified the question and dealt with its importance, you are ready to move on to the next step: evaluating the study design as it relates to the question at hand. We will discuss this in the next post.

Series launch: Critical review of medical literature

Today I am launching a series of posts on how to read medical literature critically. The series should provide a solid foundation for this task and dove-tail nicely with some of the more dense methods themes that occur on this blog. Who should read the series? Everyone. Although the current model of dissemination of medical information relies on a layer of translators (journalists and clinicians), it is my belief that every educated patient must at the very least understand how these interpreters of medical knowledge (should) examine it to arrive at the information imparted to the public. At the same time, both journalists and clinicians may benefit from this refresher. Finally, my own pet project is to get to a better place with peer reviews -- you know how variable the quality of those can be from my previous posts. So, I particularly encourage new peer reviewers for clinical journals to read this series.  

First, a conflict of interest statement. What comes first -- the chicken or the egg? What comes first -- expertise in something or a company hiring you to develop a product? Well, in my case I would like to think that it was the expertise that came first and that Pfizer asked me to develop this content based on what I know, not on the fact that they funded the effort. At any rate, this is my disclaimer: I developed this presentation about three years ago with (modest) funding from Pfizer, and they had it on a web site intended for physician access. Does this mere fact invalidate what I have to say? I don't think so, but you be the judge.

Roughly, the series will examine how to evaluate the following components of any study:
1. Study question
2. Study design
3. Study analyses
4. Study results
5. Study reporting
6. Study conclusions
I am not trying to give you a comprehensive course on how all of this is done, but merely make the reader aware of what entails a critical review of a paper.

Look for the first installment of the series shortly.