Many times in our study of mathematics it is important to look for and make connections. This is true when studying statistics as well. For one example of this that is encountered early in statistics, both the correlation coefficient, denoted by r and the least squares equation for a line of best fit tell us if our data is described by a linear model. In other words, these both tell us if the two variables that we are studying exhibit some kind of straight line relationship. Since these concepts both involve straight lines, a question that comes up is, "How are the correlation coefficient and least square line related?"
It is important to remember the details pertaining to the correlation coefficient. This statistic is used when we have paired quantitative data. From a scatterplot of this paired data we can look for trends in the overall distribution of data. Some paired data exhibits a linear or straight line pattern. But in practice the data never falls exactly along a straight line.
Several people looking at the same scatterplot of paired data would disagree how close it was to showing an overall linear trend. The scale that we use could also affect our perception of the data. For these reasons and more we need some kind of objective measure to tell how close our paired data is to being linear. The correlation coefficient achieves this for us.
A few basic facts about r include:
- The value of r ranges from -1 to 1.
- Values of r close to 0 imply that there is little to no linear relationship between the data.
- Values of r close to 1 imply that there is a positive linear relationship between the data. This means that as x increases that y also increases.
- Values of r close to -1 imply that there is a negative linear relationship between the data. This means that as x increases that y decreases.
Slope of the Least Squares Line
The last two items in the above list point us toward the slope of the least squares line of best fit. Recall that the slope of a line is a measurement of how many units it goes up or down for every unit we move to the right. Sometimes this is stated as the rise of the line divided by the run, or the change in y values divided by the change in x values.
In general straight lines have slopes that are positive, negative or zero. If we were to examine our least-square regression lines and compare the corresponding values of r, we would notice that every time that our data has a negative correlation coefficient, the slope of the regression line is negative. Similarly, for every time that we have a positive correlation coefficient, the slope of the regression line is positive.
It should be evident from this observation that there is definitely a connection between the sign of the correlation coefficient and the slope of the least squares line. It remains to explain why this is true.
Formula for the Slope
The reason for the connection between the value of r and the slope of the least squares line has to do with the formula that gives us the slope of this line. For paired data (x,y) we denote the standard deviation of the x data by sx and the standard deviation of the y data by sy. The formula for the slope a of the regression line is a = r(sy/sx).
The calculation of a standard deviation involves taking the positive square root of a nonnegative number. As a result, both standard deviations in the formula for the slope must be nonnegative. If we assume that there is some variation in our data, we will be able to disregard the possibility that either of these standard deviations are zero. Therefore the sign of the correlation coeffient will be the same as the sign of the slope of the regression line.