We got a little into the weeds last week about significance testing and test characteristics. Because information is power, I realized that it may be prudent to back up a bit and do a very explicit primer on medical testing. I am hoping that this will provide some vocabulary for improved patient-clinician communication. But, alas, please do not be surprised if your practitioner looks at you as if you were an alien -- it is safe to say that most clinicians do not think in these terms in their everyday practices. So, educate them!

Let's dig a little deeper into some of the ideas we batted around last week, specifically those pertaining to testing. Let's start by explicitly establishing the purpose of a medical test. The purpose of a medical test is to detect disease when such disease is present. This fact alone should underscore the importance of your physician's ability to arrive at the most likely reasons for your symptoms. This exercise that every doc should go through as he/she is evaluating you is called "differential diagnosis". When I was a practicing MD, my strategy was to come up with 3-5 most likely and 3-5 most deadly if missed potential diagnoses and explore them further with appropriate testing. Arranging these possible diagnoses as a hierarchy can help the clinician to assign informal probabilities to each, a task that is central to Bayesian thinking. From this hierarchy then follows the tactical sequential work-up, avoiding the frantic shotgun approach.

So, having established a hierarchy of diagnoses, we now engage in adjunctive testing. And here is where we really need to be aware not only of our degree of suspicion for each diagnosis, but also the test characteristics as they are reported in the literature and the test characteristics as they exist in the local center where the testing takes place. Why do I differentiate between the literature and practice? We know very well that the mere fact of observation, not to mention experimental cleanliness of trials, often tends to exaggerate the benefits of an intervention. In other words, real world is much messier than the laboratory of clinical research (which of course itself is messy enough). So, it is this compounded messiness that each clinician has to contend with when making testing decisions.

OK, so let us now deconstruct test characteristics even further. We have used the terms sensitivity, specificity, positive and negative predictive values. We've even explored their meanings to an extent. But let's break them down a bit further. Epidemiologists find it helpful to construct 2-by-2 (or 2 x 2) tables to think through some of these constructs, so, let's engage in that briefly. Below you see a typical 2 x 2 table.

In it we traditionally situate disease information in columns and test information in rows. A good test picks up signal when the signal is there while adding minimal noise. The signal is the disease, while the noise is the imprecise nature of all tests. Even simple blood tests, whose "objective accuracy" we take for granted, are subject to these limitations.

It is easiest to think of sensitivity as how well the test picks up the corresponding disease. In the case of mammography from last week, this number is 80%. This means that among 100 women who actually harbor breast cancer a mammogram will recognize 80. This is the "true positive" value, or disease actually present when the test is positive. What about the remaining 20? Well, those real cancers will be missed by this test, and we call them a "false negative". If you look at the 2 x 2 table, it should become obvious that the sum of the true positives and the false negatives adds up to the total number of people with the disease. Are you shocked that our wonderful tests may miss so much disease? Well, stay tuned.

The flip side of sensitivity is "specificity". Specificity refers to whether or not the test is identifying what we think it is identifying. The noise in this value comes from the test in effect hallucinating disease when the person does not have the disease. A test with high specificity will be negative in the overwhelming proportion of people without the disease, so the "true negative" cell of the table will contain almost the entire group of people without disease. Alas, for any test we develop we walk the tight-rope between sensitivity and specificity. That is, depending on our priorities for testing, we have to give up some accuracy in either the sensitivity or the specificity. The more sensitive the test, the higher our confidence that we will not miss the disease when it is there. Unfortunately, what we gain in sensitivity we usually lose in specificity, thus creating higher odds for a host of false positive results. So, there really is no free lunch when it comes to testing. In fact, it is this very tension between sensitivity and specificity that is the crux of the mammography debate. Not as straight-forward as we had thought, right? And this is not even getting into pre-test probabilities or positive and negative predictive values!

Well, let's get into these ideas now. I believe that the positive and negative predictive values of the test are fairly well understood at this point, no? Just to reiterate, a positive predictive value, which is the ratio of true positives to all positive test results (the latter is the sum of the true and false positives, or the sum of the values across the top row of the 2 x 2 table), tells us how confident we can be that a positive test result corresponds to disease being present. Similarly, the negative predictive value, the ratio of true negative test results to all negative test results (again, the latter being the sum across the second row of the 2 x 2 table, or true and false negatives), tells us how confident we can be that a negative test result really represents the absence of disease. The higher the positive and negative predictive values, the more useful the test becomes. However, when one is likely to be quite high but the other quite low, it is a pitfall of our irrationality to rush head first into the test in hopes of obtaining the answer with a high value (as in the case of the negative predictive value for mammography in women aged 40-50 years), since the opposite test result creates a potentially difficult conundrum. This is where pre-test probability of disease comes in.

Now, what is this pre-test probability and how do we calculate it? Ah, this is the pivotal question. The pre-test probability is estimated based on population epidemiology data. In other words, given the type of a person you are (no, I do not mean nice or nasty or funny or droll) in terms of your demographics, heredity, chronic disease burden and current symptoms, what category of risk you fit into based on these population studies of disease. This approach relies on filing you into a particular cubby hole with other subjects whose characteristics are most similar to yours. Are you beginning to appreciate the complexity of this task? And the imprecision of it? Add to this barely functioning crystal ball the clinician's personal cognitive biases, and is it any wonder that we do not do better? And need I even overlay this with another bugaboo, that of the overwhelming amount of information in the face of the incredible shrinking appointment, to demonstrate to you just how NOT straightforward any of this medicine stuff is?

OK, get your fingers out of that Prozac bottle -- it is not all bad! Yes, these are significant barriers to good healthcare. But guess what? The mere fact that you now know these challenges and can call them by their appropriate names gives you more power to be your own steward of your healthcare. Next time a doc appears certain and recommends some sexy new test, you will know that you cannot just say OK and await further results. Your healthcare is a chess match: you and your healthcare provider need to plan 10 moves ahead and play out many different contingencies.

On our end, researchers, policy makers and software developers all need to do better developing more useful and individualizable information, integrating this information into user-friendly systems, and encouraging thoughtful healthcare encounters. I am convinced that patient empowerment with this information followed by teaming up with providers in advocacy in this vein is the only thing that can assure course correction for our mammoth, unruly, dangerous and irrational healthcare system.

These posts are extremely interesting. First I am reminded of a compulsive radiologist who, a number of years ago, insisted on follow-up mammograms for every bump....and would have had me biopsied if we weren't moving to Texas a few days after his last go round. In Texas (San Antonio) I immediately went for follow-up. There, the radiologist looked at all the mammograms, saw nothing but fibrous cysts, ordered a sonogram she said to assure me, and dismissed me. This was maybe fourteen years ago.

ReplyDeleteNow I am seeing a Mexican doctor who seems to have an uncanny ability to order fairly basic tests (bloodwork, urinalysis)on an individual basis that lead him down a useful trail. After reading your columns, I am now learning a lot more Spanish since my medical Spanish is limited and I have to look up words in order to ask him about what led him to do what he did.

Anyway, mil gracias for your blog!

One problem with figuring out how to apply medical data is that they give conditional probabilities, and if any of the information changes the conditional probabilities change.

ReplyDeleteThe change in probability is often counter-intuitive, and often seemingly irrelevant information changes the probabilities.

For example: If Mary and John have two children then the probability that both children are girls is 0.25. If I learn that one child is a girl, then the probability that both are girls is 0.33. If I learn that the older child is a girl the probability that both are girls is 0.50. Notice as my prior information changes the probability of the outcome changes.

Now I learn that one of the children is a girl with curly hair. What is the probability that both are girls? There is no exact answer to that question. The probability will be in the range 0.5 to 0.33. If I learn that Mary and John are natives of Seoul, then the probability that both their children are girls will be close to 0.5. If Mary and John are natives of Nairobi, then the probability that both their children are girls will be close to 0.33.

This just doesn't make sense, but if you work out the sample spaces this is what you come up with. And here were are dealing with a very simple situation compared with medical testing and analysis.

Another problem is that as I get more precise in assigning my patient to a category, the fewer patients there are in that category, and the greater the range of error in my estimates. Its like an application of Heisenberg's uncertainty principle applied to statistics.

For a popular, though not "dumbed-down", read see The Drunkard's Walk by Leonard Mlodinow.

Dag frickin nab, I wrote a nice juicy comment and tried to post via my WordPress profile; it told me to go sign in, which I was already, and when I came back here the comment was gone.

ReplyDeleteTwice. I'd be *astounded* if the WordPress login thingie is broken in Blogger's competing platform (not).

Oh well. Long story short, it said I just spent a week at a seminar on informed medical decision making, and yeah, these tests and their interpretation are important, and not trivial to understand!

On a different note, my first up-close-and-personal encounter with test craziness was about 15 years ago. I was self-pay at the time. I had something that the local clinic figured was a bladder infection. He sent out a specimen. I came back later for the results, which said no infection. The doc said "That's crazy" and set them aside. I said "The lab screwed up?" and he said "Looks like it." I said "So I won't have to pay for it, right?"

ReplyDeleteAnd he looked at me like *I* was crazy.

In what other field do we routinely pay for stuff that's obviously wrong?

Seems to me we've gone collectively crazy, not even EXPECTING things to be reliable, because it's all beyond our comprehension, but also thinking SOME genius out there must be managing things, despite evidence to the contrary. : )

(And that's not to mention how we interpret results, even we presume they ARE executed perfectly.)

So if I wanted to learn the sensitivity and specificity of a routine test, say A1C, for example, where would I look?

ReplyDeletePlease don't tell me to ask my doctor. By the way, what did you post score relative to its' health literacy? Just kidding..great post.

Steve Wilkins

www.healthecommunications.wordpress.com

Thanks, Dave for you anecdote -- I am not even sure we know how often we get gobbledegooky results and still charge for them! More worrisome is the situation when we get an error, yet we act on it with further testing instead of repeating the current test.

ReplyDeleteSteve, you ask a great question. I am not sure if there is some central repository of this information that all of us can access. Also, I did not get into this, but when a variable is continuous (that is a range of values) like the HbA1C, the PPV and NPV may not be the best calculations. This is where a likelihood ratio or an area under the receiver operator curve maybe of greater service. Regardless of the statistical test, I will look into where, other than the weeds of the medical literature, you might be able to access these values. AHRQ comes to mind, though I have not looked there.

If any of my readers know of a place like that, please, educate us.