What Is Correlation in Statistics?

Find Patterns Hiding in Data

A scatterplot of dinosaur bone lengths. C.K.Taylor

By

Updated on February 19, 2018

Sometimes numerical data comes in pairs. Perhaps a paleontologist measures the lengths of the femur (leg bone) and humerus (arm bone) in five fossils of the same dinosaur species. It might make sense to consider the arm lengths separately from the leg lengths, and calculate things such as the mean, or the standard deviation. But what if the researcher is curious to know if there is a relationship between these two measurements? It's not enough to just look at the arms separately from the legs. Instead, the paleontologist should pair the lengths of the bones for each skeleton and use an area of statistics known as correlation.

What is correlation? In the example above suppose that the researcher studied the data and reached the not very surprising result that dinosaur fossils with longer arms also had longer legs, and fossils with shorter arms had shorter legs. A scatterplot of the data showed that the data points were all clustered near a straight line. The researcher would then say that there is a strong straight line relationship, or correlation, between the lengths of arm bones and leg bones of the fossils. It requires some more work to say how strong the correlation is.

Correlation and Scatterplots

Since each data point represents two numbers, a two-dimensional scatterplot is a great help in visualizing the data. Suppose we actually have our hands on the dinosaur data, and the five fossils have the following measurements:

Femur 50 cm, humerus 41 cm
Femur 57 cm, humerus 61 cm
Femur 61 cm, humerus 71 cm
Femur 66 cm, humerus 70 cm
Femur 75 cm, humerus 82 cm

Correlation Coefficient

To objectively measure how close the data is to being along a straight line, the correlation coefficient comes to the rescue. The correlation coefficient, typically denoted r, is a real number between -1 and 1. The value of r measures the strength of a correlation based on a formula, eliminating any subjectivity in the process. There are several guidelines to keep in mind when interpreting the value of r.

If r = 0 then the points are a complete jumble with absolutely no straight line relationship between the data.
If r = -1 or r = 1 then all of the data points line up perfectly on a line.
If r is a value other than these extremes, then the result is a less than perfect fit of a straight line. In real-world data sets, this is the most common result.
If r is positive then the line is going up with a positive slope. If r is negative then the line is going down with negative slope.

The Calculation of the Correlation Coefficient

The formula for the correlation coefficient r is complicated, as can be seen here. The ingredients of the formula are the means and standard deviations of both sets of numerical data, as well as the number of data points. For most practical applications r is tedious to compute by hand. If our data has been entered into a calculator or spreadsheet program with statistical commands, then there is usually a built-in function to calculate r.

Limitations of Correlation

Although correlation is a powerful tool, there are some limitations in using it:

Correlation does not completely tell us everything about the data. Means and standard deviations continue to be important.
The data may be described by a curve more complicated than a straight line, but this will not show up in the calculation of r.
Outliers strongly influence the correlation coefficient. If we see any outliers in our data, we should be careful about what conclusions we draw from the value of r.
Just because two sets of data are correlated, it doesn't mean that one is the cause of the other.