The calculation of a sample variance or standard deviation is typically stated as a fraction. The numerator of this fraction involves a sum of squared deviations from the mean. The formula for this total sum of squares is
Σ (xi - x̄)2.
Here the symbol x̄ refers to the sample mean, and the symbol Σ tells us to sum the differences (xi - x̄) for all i.
While this formula is correct and will work for calculations, there is an equivalent, shortcut formula that does not require us to first calculate the sample mean. This shortcut formula for the sum of squares is
Here the variable n refers to the number of data points in our sample.
An Example – Standard Formula
To see how this shortcut formula works, we will consider an example that is calculated using both formulas. Suppose our sample is 2, 4, 6, 8. The sample mean is (2 + 4 + 6 + 8)/4 = 20/4 = 5. Now we calculate the difference of each data point with the mean 5.
- 2 – 5 = -3
- 4 – 5 = -1
- 6 – 5 = 1
- 8 – 5 = 3
An Example – Shortcut Formula
Now we will use the same set of data: 2, 4, 6, 8, with the shortcut formula to determine the sum of squares. We first square each data point and add them together: 22 + 42 + 62 + 82 = 4 + 16 + 36 + 64 = 120.
The next step is to add together all of the data and square this sum: (2 + 4 + 6 + 8)2 = 400. We divide this by the number of data points to obtain 400/4 =100.
We now subtract this number from 120. This gives us that the sum of the squared deviations is 20. This was exactly the number that we have already found from the other formula.
How Does This Work>
Many people will just accept the formula at face value, and do not have any idea why this formula works. By using a little bit of algebra, we can see why this shortcut formula is equivalent to the standard, traditional way of calculating the sum of squared deviations.
Although there may be hundreds, if not thousands in a real-world data set, we will assume that there are only three data values: x1 , x2, x3. What we see here could be expanded for a data set that has thousands of points.
We begin by noting that( x1 + x2 + x3) = 3 x̄. The expression Σ(xi - x̄)2 = (x1 - x̄)2 + (x2 - x̄)2 + (x3 - x̄)2.
We now use the fact from basic algebra that (a + b)2 = a2 +2ab + b2. This means that = (x1 - x̄)2 = x12 -2x1 x̄+ x̄2. We do this for the other two terms of our summation, and we have:
x12 -2x1 x̄+ x̄2 + x22 -2x2 x̄+ x̄2 + x32 -2x3 x̄+ x̄2.
We rearrange this and have:
x12+ x22 + x32+ 3x̄2 - 2x̄(x1 + x2 + x3) .
By rewriting (x1 + x2 + x3) = 3x̄ the above becomes:
x12+ x22 + x32 - 3x̄2.
Now since 3x̄2 = (x1+ x2 + x3)2/3, our formula becomes:
x12+ x22 + x32 - (x1+ x2 + x3)2/3
And this is a special case of the general formula that was mentioned above:
Is It Really a Shortcut?
It may not seem like this formula is truly a shortcut. After all, in the example above it seems that there are just as many calcuations. Part of this has to do with the fact that we only looked at a sample size that was small. As we increase the size of our sample, we see that the shortcut formula reduces the number of calculations by about half. We do not need to subtract the mean from each data point and then square the result. This cuts down considerably on the total number of operations.