Within sets of data there are a variety of descriptive statistics. The mean, median and mode all give measures of the center of the data, but they calculate this in different ways:
- The mean is calculated by adding all of the data values together, then dividing by the total number of values.
- The median is calculated by listing the data values in ascending order, then finding the middle value in the list.
- The mode is calculated by counting how many times each value occurs. The value that occurs with the highest frequency is the mode.
Theoretical vs. Empirical
Before we go on, it is important to understand what we are talking about when we refer to an empirical relationship, and contrast this with theoretical studies. Some results in statistics and other fields of knowledge can be derived from some previous statements in a theoretical manner. We begin with what we know, and then use logic, mathematics and deductive reasoning and see where this leads us. The result is a direct consequence of other known facts.
Contrasting with the theoretical is the empirical way of acquiring knowledge. Rather than reasoning from already established principles, we can observe the world around us. From these observations we can then formulate an explanation of what we have seen. Much of science is done in this manner. Experiments give us empirical data. The goal then becomes to formulate an explanation that fits all of the data.
In statistics there is a relationship between the mean, median and mode that is empirically based. Observations of countless data sets have showed that most of the time the difference between the mean and the mode is three times the difference between the mean and the median. This relationship in equation form is:
Mean – Mode = 3(Mean – Median).
To see the above relationship with real world data, let’s take a look at the U.S. state populations in 2010. In millions, the populations were: California - 36.4, Texas - 23.5, New York - 19.3, Florida - 18.1, Illinois - 12.8, Pennsylvania - 12.4, Ohio - 11.5, Michigan - 10.1, Georgia - 9.4, North Carolina - 8.9, New Jersey - 8.7, Virginia - 7.6, Massachusetts - 6.4, Washington - 6.4, Indiana - 6.3, Arizona - 6.2, Tennessee - 6.0, Missouri - 5.8, Maryland - 5.6, Wisconsin - 5.6, Minnesota - 5.2, Colorado - 4.8, Alabama - 4.6, South Carolina - 4.3, Louisiana - 4.3, Kentucky - 4.2, Oregon - 3.7, Oklahoma - 3.6, Connecticut - 3.5, Iowa - 3.0, Mississippi - 2.9, Arkansas - 2.8, Kansas - 2.8, Utah - 2.6, Nevada - 2.5, New Mexico - 2.0, West Virginia - 1.8, Nebraska - 1.8, Idaho - 1.5, Maine - 1.3, New Hampshire - 1.3, Hawaii - 1.3, Rhode Island - 1.1, Montana - .9, Delaware - .9, South Dakota - .8, Alaska - .7, North Dakota - .6, Vermont - .6, Wyoming - .5
The mean population is 6.0 million. The median population is 4.25 million. The mode is 1.3 million. Now we will calculate the differences from the above:
- Mean – Mode = 6.0 million – 1.3 million = 4.7 million.
- 3(Mean – Median) = 3(6.0 million – 4.25 million) = 3(1.75 million) = 5.25 million.
There are a couple of applications for the above formula. Suppose that we do not have a list of data values, but do know any two of the mean, median or mode. The above formula could be used to estimate the third unknown quantity.
For instance, if we know that we have a mean of 10, a mode of 4, what is the median of our data set? Since Mean – Mode = 3(Mean – Median), we can say that 10 – 4 = 3(10 – Median). By some algebra, we see that 2 = (10 – Median), and so the median of our data is 8.
Another application of the above formula is in calculating skewness. Since skewness measures the difference between the mean and the mode, we could instead calculate 3(Mean – Mode). To make this quantity dimensionless, we can divide it by the standard deviation to give an alternate means of calculating the skewness than using moments in statistics.
A Word of Caution
As seen above, the above is not an exact relationship. Instead it is a good rule of thumb, similar to that of the range rule, which establishes an approximate connection between the standard deviation and range. The mean, median and mode may not fit exactly into the above emperical relationship, but there’s a good chance that it will be reasonably close.