With all data, we need to exercise caution. Where did it come from? How was it obtained? And what is it really saying? These are all good questions that we should ask when presented with data. Sometimes what the data seem to be saying is not really the case. Consider the very surprising case of Simpson’s paradox.
A paradox is something that on the surface seems contradictory. Paradoxes help to reveal underlying truth beneath the surface of what appears to be absurd. In particular Simpson's paradox demonstrates what kinds of problems result from combining data from several groups.
An Overview of the Paradox
Suppose we are observing several groups, and establish a relationship or correlation for each of these groups. Simpson’s paradox says that when we combine all of the groups together, and look at the data in aggregate form, the correlation that we noticed before may reverse itself. This is most often due to lurking variables that have not been considered, but sometimes it is due to the numerical values of the data.
To make a little more sense of Simpson's paradox, let's look at the following example. In a certain hospital there are two surgeons. Surgeon A operates on 100 patients, and 95 survive. Surgeon B operates on 80 patients and 72 survive. We are considering having surgery performed in this hospital and living through the operation is something that is important. We want to choose the better of the two surgeons.
We look at the data and use it to calculate what percentage of surgeon A's patients survived their operations and compare it to the survival rate of the patients of surgeon B.
- 95 patients out of 100 survived with surgeon A, so 95/100 = 95% of them survived.
- 72 patients out of 80 survived with surgeon B, so 72/80 = 90% of them survived.
What if we did some further research into the data and found that originally the hospital had considered two different types of surgeries, but then lumped all of the data together to report on each of its surgeons. Not all surgeries are equal, some were considered high-risk emergency surgeries, while others were of a more routine nature that had been scheduled in advance.
Of the 100 patients that surgeon A treated, 50 were high risk, of which three died. The other 50 were considered routine, and of these 2 died. This means that for a routine surgery, a patient treated by surgeon A has a 48/50 = 96% survival rate .
Now we look more carefully at the data for surgeon B and find that of 80 patients, 40 were high risk, of which seven died. The other 40 were routine and only one died. This means that a patient has a 39/40 = 97.5% survival rate for a routine surgery with surgeon A.
Now which surgeon seems better? If your surgery is to be a routine one, then surgeon B is actually the better surgeon. However if we look at all surgeries performed by the surgeons, A is better. This is quite counterintuitive. In this case the lurking variable of the type of surgery affectsthe combined data of the surgeons.
Simpson’s paradox is named after Edward Simpson, who first described this paradox in the 1951 paper "The Interpretation of Interaction in Contingency Tables." Pearson and Yule each observed a similar paradox half a century earlier than Simpson, so Simpson’s paradox is sometimes also referred to as the Simpson-Yule effect. There are many wide ranging applications of the paradox in areas as diverse as sports statistics and unemployment data. Anytime that data is aggregated, watch out for this paradox to show up.