This is “The Least Squares Regression Line”, section 10.4 from the book Beginning Statistics (v. 1.0).
For more information on the source of this book, or why it is available for free, please see the project's home page. You can browse or download additional books there. You may also download a PDF copy of this book (33 MB) or just this chapter (2 MB), suitable for printing or most e-readers, or a .zip file containing this book's HTML files (for use in a web browser offline).
Once the scatter diagram of the data has been drawn and the model assumptions described in the previous sections at least visually verified (and perhaps the correlation coefficient r computed to quantitatively verify the linear trend), the next step in the analysis is to find the straight line that best fits the data. We will explain how to measure how well a straight line fits a collection of points by examining how well the line $y=\frac{1}{2}x\text{\u2212}1$ fits the data set
$$\begin{array}{cccccc}x& 2& 2& 6& 8& 10\\ y& 0& 1& 2& 3& 3\end{array}$$(which will be used as a running example for the next three sections). We will write the equation of this line as $\widehat{y}=\frac{1}{2}x\text{\u2212}1$ with an accent on the y to indicate that the y-values computed using this equation are not from the data. We will do this with all lines approximating data sets. The line $\widehat{y}=\frac{1}{2}x\text{\u2212}1$ was selected as one that seems to fit the data reasonably well.
The idea for measuring the goodness of fit of a straight line to data is illustrated in Figure 10.6 "Plot of the Five-Point Data and the Line ", in which the graph of the line $\widehat{y}=\frac{1}{2}x\text{\u2212}1$ has been superimposed on the scatter plot for the sample data set.
Figure 10.6 Plot of the Five-Point Data and the Line $\widehat{y}=\frac{1}{2}x\text{\u2212}1$
To each point in the data set there is associated an “errorUsing $y-\widehat{y}$, the actual y-value of a data point minus the y-value that is computed from the equation of the line fitting the data.,” the positive or negative vertical distance from the point to the line: positive if the point is above the line and negative if it is below the line. The error can be computed as the actual y-value of the point minus the y-value $\widehat{y}$ that is “predicted” by inserting the x-value of the data point into the formula for the line:
$$\text{error}\text{\hspace{0.17em}}\text{at}\text{\hspace{0.17em}}\text{data}\text{\hspace{0.17em}}\text{point}\text{\hspace{0.17em}}(x,y)=(\text{true}\text{\hspace{0.17em}}y)-(\text{predicted}\text{\hspace{0.17em}}y)=y-\widehat{y}$$The computation of the error for each of the five points in the data set is shown in Table 10.1 "The Errors in Fitting Data with a Straight Line".
Table 10.1 The Errors in Fitting Data with a Straight Line
x | y | $\widehat{y}=\frac{1}{2}x\text{\u2212}1$ | $y-\widehat{y}$ | ${\left(y-\widehat{y}\right)}^{2}$ | |
---|---|---|---|---|---|
2 | 0 | 0 | 0 | 0 | |
2 | 1 | 0 | 1 | 1 | |
6 | 2 | 2 | 0 | 0 | |
8 | 3 | 3 | 0 | 0 | |
10 | 3 | 4 | −1 | 1 | |
Σ | - | - | - | 0 | 2 |
A first thought for a measure of the goodness of fit of the line to the data would be simply to add the errors at every point, but the example shows that this cannot work well in general. The line does not fit the data perfectly (no line can), yet because of cancellation of positive and negative errors the sum of the errors (the fourth column of numbers) is zero. Instead goodness of fit is measured by the sum of the squares of the errors. Squaring eliminates the minus signs, so no cancellation can occur. For the data and line in Figure 10.6 "Plot of the Five-Point Data and the Line " the sum of the squared errors (the last column of numbers) is 2. This number measures the goodness of fit of the line to the data.
The goodness of fit of a line $\widehat{y}=mx+b$ to a set of n pairs $\left(x,y\right)$ of numbers in a sample is the sum of the squared errors
$$\mathrm{\Sigma}{\left(y-\widehat{y}\right)}^{2}$$(n terms in the sum, one for each data pair).
Given any collection of pairs of numbers (except when all the x-values are the same) and the corresponding scatter diagram, there always exists exactly one straight line that fits the data better than any other, in the sense of minimizing the sum of the squared errors. It is called the least squares regression line. Moreover there are formulas for its slope and y-intercept.
Given a collection of pairs $\left(x,y\right)$ of numbers (in which not all the x-values are the same), there is a line $\widehat{y}={\widehat{\mathit{\beta}}}_{1}x+{\widehat{\mathit{\beta}}}_{0}$ that best fits the data in the sense of minimizing the sum of the squared errors. It is called the least squares regression lineThe line that best fits a set of sample data in the sense of minimizing the sum of the squared errors.. Its slope ${\widehat{\mathit{\beta}}}_{1}$ and y-intercept ${\widehat{\mathit{\beta}}}_{0}$ are computed using the formulas
$${\widehat{\mathit{\beta}}}_{1}=\frac{S{S}_{xy}}{S{S}_{xx}}\text{\hspace{1em}}and\text{\hspace{1em}}{\widehat{\mathit{\beta}}}_{0}=\stackrel{-}{y}-{\widehat{\mathit{\beta}}}_{1}\stackrel{-}{x}$$where
$$S{S}_{xx}=\mathrm{\Sigma}{x}^{2}-\frac{1}{n}{\left(\mathrm{\Sigma}x\right)}^{2},\text{\hspace{1em}}S{S}_{xy}=\mathrm{\Sigma}xy-\frac{1}{n}\left(\mathrm{\Sigma}x\right)\left(\mathrm{\Sigma}y\right)$$$\stackrel{-}{x}$ is the mean of all the x-values, $\stackrel{-}{y}$ is the mean of all the y-values, and n is the number of pairs in the data set.
The equation $\widehat{y}={\widehat{\mathit{\beta}}}_{1}x+{\widehat{\mathit{\beta}}}_{0}$ specifying the least squares regression line is called the least squares regression equationThe equation $\widehat{y}={\widehat{\mathit{\beta}}}_{1}x+{\widehat{\mathit{\beta}}}_{0}$ of the least squares regression line..
Remember from Section 10.3 "Modelling Linear Relationships with Randomness Present" that the line with the equation $y={\mathit{\beta}}_{1}x+{\mathit{\beta}}_{0}$ is called the population regression line. The numbers ${\widehat{\mathit{\beta}}}_{1}$ and ${\widehat{\mathit{\beta}}}_{0}$ are statistics that estimate the population parameters ${\mathit{\beta}}_{1}$ and ${\mathit{\beta}}_{0}.$
We will compute the least squares regression line for the five-point data set, then for a more practical example that will be another running example for the introduction of new concepts in this and the next three sections.
Find the least squares regression line for the five-point data set
$$\begin{array}{cccccc}x& 2& 2& 6& 8& 10\\ y& 0& 1& 2& 3& 3\end{array}$$and verify that it fits the data better than the line $\widehat{y}=\frac{1}{2}x\text{\u2212}1$ considered in Section 10.4.1 "Goodness of Fit of a Straight Line to Data".
Solution:
In actual practice computation of the regression line is done using a statistical computation package. In order to clarify the meaning of the formulas we display the computations in tabular form.
x | y | x^{2} | $xy$ | |
---|---|---|---|---|
2 | 0 | 4 | 0 | |
2 | 1 | 4 | 2 | |
6 | 2 | 36 | 12 | |
8 | 3 | 64 | 24 | |
10 | 3 | 100 | 30 | |
Σ | 28 | 9 | 208 | 68 |
In the last line of the table we have the sum of the numbers in each column. Using them we compute:
$$\begin{array}{ll}\hfill S{S}_{xx}& ={{\displaystyle \mathrm{\Sigma}}}^{\text{}}{x}^{2}-\frac{1}{n}{({{\displaystyle \mathrm{\Sigma}}}^{\text{}}x)}^{2}=208-\frac{1}{5}{(28)}^{2}=51.2\hfill \\ \hfill S{S}_{xy}& ={{\displaystyle \mathrm{\Sigma}}}^{\text{}}xy-\frac{1}{n}({{\displaystyle \mathrm{\Sigma}}}^{\text{}}x)({{\displaystyle \mathrm{\Sigma}}}^{\text{}}y)=68-\frac{1}{5}(28)(9)=17.6\hfill \\ \hfill \stackrel{-}{x}& =\frac{{{\displaystyle \mathrm{\Sigma}}}^{\text{}}x}{n}=\frac{28}{5}=5.6\hfill \\ \hfill \stackrel{-}{y}& =\frac{{{\displaystyle \mathrm{\Sigma}}}^{\text{}}y}{n}=\frac{9}{5}=1.8\hfill \end{array}$$so that
$${\widehat{\mathit{\beta}}}_{1}=\frac{S{S}_{xy}}{S{S}_{xx}}=\frac{17.6}{51.2}=0.34375\text{\hspace{1em}}\text{and}\text{\hspace{1em}}{\widehat{\mathit{\beta}}}_{0}=\stackrel{-}{y}-{\widehat{\mathit{\beta}}}_{1}\stackrel{-}{x}=1.8-(0.34375)(5.6)=\text{\u2212}0.125$$The least squares regression line for these data is
$$\widehat{y}=0.34375x\text{\u2212}0.125$$The computations for measuring how well it fits the sample data are given in Table 10.2 "The Errors in Fitting Data with the Least Squares Regression Line". The sum of the squared errors is the sum of the numbers in the last column, which is 0.75. It is less than 2, the sum of the squared errors for the fit of the line $\widehat{y}=\frac{1}{2}x\text{\u2212}1$ to this data set.
Table 10.2 The Errors in Fitting Data with the Least Squares Regression Line
x | y | $\widehat{y}=0.34375x\text{\u2212}0.125$ | $y-\widehat{y}$ | ${\left(y-\widehat{y}\right)}^{2}$ |
---|---|---|---|---|
2 | 0 | 0.5625 | −0.5625 | 0.31640625 |
2 | 1 | 0.5625 | 0.4375 | 0.19140625 |
6 | 2 | 1.9375 | 0.0625 | 0.00390625 |
8 | 3 | 2.6250 | 0.3750 | 0.14062500 |
10 | 3 | 3.3125 | −0.3125 | 0.09765625 |
Table 10.3 "Data on Age and Value of Used Automobiles of a Specific Make and Model" shows the age in years and the retail value in thousands of dollars of a random sample of ten automobiles of the same make and model.
Table 10.3 Data on Age and Value of Used Automobiles of a Specific Make and Model
x | 2 | 3 | 3 | 3 | 4 | 4 | 5 | 5 | 5 | 6 |
y | 28.7 | 24.8 | 26.0 | 30.5 | 23.8 | 24.6 | 23.8 | 20.4 | 21.6 | 22.1 |
Solution:
Figure 10.7 Scatter Diagram for Age and Value of Used Automobiles
We must first compute $S{S}_{xx}$, $S{S}_{xy}$, $S{S}_{yy}$, which means computing $\mathrm{\Sigma}x$, $\mathrm{\Sigma}y$, $\mathrm{\Sigma}{x}^{2}$, $\mathrm{\Sigma}{y}^{2}$, and $\mathrm{\Sigma}xy}.$ Using a computing device we obtain
$${{\displaystyle \mathrm{\Sigma}}}^{\text{}}x=40\text{\hspace{1em}}{{\displaystyle \mathrm{\Sigma}}}^{\text{}}y=246.3\text{\hspace{1em}}{{\displaystyle \mathrm{\Sigma}}}^{\text{}}{x}^{2}=174\text{\hspace{1em}}{{\displaystyle \mathrm{\Sigma}}}^{\text{}}{y}^{2}=6154.15\text{\hspace{1em}}{{\displaystyle \mathrm{\Sigma}}}^{\text{}}xy=956.5$$Thus
$$\begin{array}{ll}S{S}_{xx}\hfill & ={{\displaystyle \mathrm{\Sigma}}}^{\text{}}{x}^{2}-\frac{1}{n}{({{\displaystyle \mathrm{\Sigma}}}^{\text{}}x)}^{2}=174-\frac{1}{10}{(40)}^{2}=14\hfill \\ S{S}_{xy}\hfill & ={{\displaystyle \mathrm{\Sigma}}}^{\text{}}xy-\frac{1}{n}({{\displaystyle \mathrm{\Sigma}}}^{\text{}}x)({{\displaystyle \mathrm{\Sigma}}}^{\text{}}y)=956.5-\frac{1}{10}(40)(246.3)=\text{\u2212}28.7\hfill \\ S{S}_{yy}\hfill & ={{\displaystyle \mathrm{\Sigma}}}^{\text{}}{y}^{2}-\frac{1}{n}{({{\displaystyle \mathrm{\Sigma}}}^{\text{}}y)}^{2}=6154.15-\frac{1}{10}{(246.3)}^{2}=87.781\hfill \end{array}$$so that
$$r=\frac{S{S}_{xy}}{\sqrt{S{S}_{xx}\xb7S{S}_{yy}}}=\frac{\text{\u2212}28.7}{\sqrt{(14)(87.781)}}=\text{\u2212}0.819$$The age and value of this make and model automobile are moderately strongly negatively correlated. As the age increases, the value of the automobile tends to decrease.
Using the values of $\mathrm{\Sigma}x$ and $\mathrm{\Sigma}y$ computed in part (b),
$$\stackrel{-}{x}=\frac{{\displaystyle \mathrm{\Sigma}x}}{n}=\frac{40}{10}=4\text{\hspace{1em}}\text{and}\text{\hspace{1em}}\stackrel{-}{y}=\frac{{\displaystyle \mathrm{\Sigma}y}}{n}=\frac{246.3}{10}=24.63$$Thus using the values of $S{S}_{xx}$ and $S{S}_{xy}$ from part (b),
$${\widehat{\mathit{\beta}}}_{1}=\frac{S{S}_{xy}}{S{S}_{xx}}=\frac{\text{\u2212}28.7}{14}=\text{\u2212}2.05\text{\hspace{1em}}\text{and}\text{\hspace{1em}}{\widehat{\mathit{\beta}}}_{0}=\stackrel{-}{y}-{\widehat{\mathit{\beta}}}_{1}\stackrel{-}{x}=24.63-(\text{\u2212}2.05)(4)=32.83$$The equation $\widehat{y}={\widehat{\mathit{\beta}}}_{1}x+{\widehat{\mathit{\beta}}}_{0}$ of the least squares regression line for these sample data is
$$\widehat{y}=\text{\u2212}2.05x+32.83$$Figure 10.8 "Scatter Diagram and Regression Line for Age and Value of Used Automobiles" shows the scatter diagram with the graph of the least squares regression line superimposed.
Figure 10.8 Scatter Diagram and Regression Line for Age and Value of Used Automobiles
Since we know nothing about the automobile other than its age, we assume that it is of about average value and use the average value of all four-year-old vehicles of this make and model as our estimate. The average value is simply the value of $\widehat{y}$ obtained when the number 4 is inserted for x in the least squares regression equation:
$$\widehat{y}=\text{\u2212}2.05\left(4\right)+32.83=24.63$$which corresponds to $24,630.
Now we insert $x=20$ into the least squares regression equation, to obtain
$$\widehat{y}=\text{\u2212}2.05\left(20\right)+32.83=\text{\u2212}8.17$$which corresponds to −$8,170. Something is wrong here, since a negative makes no sense. The error arose from applying the regression equation to a value of x not in the range of x-values in the original data, from two to six years.
Applying the regression equation $\widehat{y}={\widehat{\mathit{\beta}}}_{1}x+{\widehat{\mathit{\beta}}}_{0}$ to a value of x outside the range of x-values in the data set is called extrapolation. It is an invalid use of the regression equation and should be avoided.
For emphasis we highlight the points raised by parts (f) and (g) of the example.
The process of using the least squares regression equation to estimate the value of y at a value of x that does not lie in the range of the x-values in the data set that was used to form the regression line is called extrapolationThe process of using the least squares regression equation to estimate the value of y at an x value not in the proper range.. It is an invalid use of the regression equation that can lead to errors, hence should be avoided.
In general, in order to measure the goodness of fit of a line to a set of data, we must compute the predicted y-value $\widehat{y}$ at every point in the data set, compute each error, square it, and then add up all the squares. In the case of the least squares regression line, however, the line that best fits the data, the sum of the squared errors can be computed directly from the data using the following formula.
The sum of the squared errors for the least squares regression line is denoted by $SSE.$ It can be computed using the formula
$$SSE=S{S}_{yy}-{\widehat{\mathit{\beta}}}_{1}S{S}_{xy}$$Find the sum of the squared errors $SSE$ for the least squares regression line for the five-point data set
$$\begin{array}{cccccc}x& 2& 2& 6& 8& 10\\ y& 0& 1& 2& 3& 3\end{array}$$Do so in two ways:
Solution:
The numbers $S{S}_{xy}$ and ${\widehat{\mathit{\beta}}}_{1}$ were already computed in Note 10.18 "Example 2" in the process of finding the least squares regression line. So was the number $\mathrm{\Sigma}y}=9.$ We must compute $S{S}_{yy}.$ To do so it is necessary to first compute $\mathrm{\Sigma}{y}^{2}}=0+{1}^{2}+{2}^{2}+{3}^{2}+{3}^{2}=23.$ Then
$$S{S}_{yy}=\mathrm{\Sigma}{y}^{2}-\frac{1}{n}{\left(\mathrm{\Sigma}y\right)}^{2}=23-\frac{1}{5}{\left(9\right)}^{2}=6.8$$so that
$$SSE=S{S}_{yy}-{\widehat{\mathit{\beta}}}_{1}S{S}_{xy}=6.8-(0.34375)(17.6)=0.75$$Find the sum of the squared errors $SSE$ for the least squares regression line for the data set, presented in Table 10.3 "Data on Age and Value of Used Automobiles of a Specific Make and Model", on age and values of used vehicles in Note 10.19 "Example 3".
Solution:
From Note 10.19 "Example 3" we already know that
$$S{S}_{xy}=\text{\u2212}28.7,\text{\hspace{1em}}{\widehat{\mathit{\beta}}}_{1}=\text{\u2212}2.05,\text{\hspace{1em}}\text{and}\text{\hspace{1em}}{\displaystyle \mathrm{\Sigma}y}=246.3$$To compute $S{S}_{yy}$ we first compute
$$\mathrm{\Sigma}{y}^{2}}={28.7}^{2}+{24.8}^{2}+{26.0}^{2}+{30.5}^{2}+{23.8}^{2}+{24.6}^{2}+{23.8}^{2}+{20.4}^{2}+{21.6}^{2}+{22.1}^{2}=6154.15$$Then
$$S{S}_{yy}={\displaystyle \mathrm{\Sigma}{y}^{2}}-\frac{1}{n}{({\displaystyle \mathrm{\Sigma}y})}^{2}=6154.15-\frac{1}{10}{(246.3)}^{2}=87.781$$Therefore
$$SSE=S{S}_{yy}-{\widehat{\mathit{\beta}}}_{1}S{S}_{xy}=87.781-(\text{\u2212}2.05)(\text{\u2212}28.7)=28.946$$For the Basic and Application exercises in this section use the computations that were done for the exercises with the same number in Section 10.2 "The Linear Correlation Coefficient".
Compute the least squares regression line for the data in Exercise 1 of Section 10.2 "The Linear Correlation Coefficient".
Compute the least squares regression line for the data in Exercise 2 of Section 10.2 "The Linear Correlation Coefficient".
Compute the least squares regression line for the data in Exercise 3 of Section 10.2 "The Linear Correlation Coefficient".
Compute the least squares regression line for the data in Exercise 4 of Section 10.2 "The Linear Correlation Coefficient".
For the data in Exercise 5 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 6 of Section 10.2 "The Linear Correlation Coefficient"
Compute the least squares regression line for the data in Exercise 7 of Section 10.2 "The Linear Correlation Coefficient".
Compute the least squares regression line for the data in Exercise 8 of Section 10.2 "The Linear Correlation Coefficient".
For the data in Exercise 9 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 10 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 11 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 12 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 13 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 14 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 15 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 16 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 17 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 18 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 19 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 20 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 21 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 22 of Section 10.2 "The Linear Correlation Coefficient"
Verify that no matter what the data are, the least squares regression line always passes through the point with coordinates $\left(\stackrel{-}{x},\stackrel{-}{y}\right).$ Hint: Find the predicted value of y when $x=\stackrel{-}{x}.$
In Exercise 1 you computed the least squares regression line for the data in Exercise 1 of Section 10.2 "The Linear Correlation Coefficient".
Reverse the roles of x and y and compute the least squares regression line for the new data set
$$\begin{array}{cccccc}x& 2& 4& 6& 5& 9\\ y& 0& 1& 3& 5& 8\end{array}$$Large Data Set 1 lists the SAT scores and GPAs of 1,000 students.
http://www.gone.2012books.lardbucket.org/sites/all/files/data1.xls
Large Data Set 12 lists the golf scores on one round of golf for 75 golfers first using their own original clubs, then using clubs of a new, experimental design (after two months of familiarization with the new clubs).
http://www.gone.2012books.lardbucket.org/sites/all/files/data12.xls
Large Data Set 13 records the number of bidders and sales price of a particular type of antique grandfather clock at 60 auctions.
http://www.gone.2012books.lardbucket.org/sites/all/files/data13.xls
$\widehat{y}=0.743x+2.675$
$\widehat{y}=\text{\u2212}0.610x+4.082$
$\widehat{y}=0.625x+1.25$, $SSE=5$
$\widehat{y}=0.6x+1.8$
$\widehat{y}=\text{\u2212}1.45x+2.4$, $SSE=50.25$ (cannot use the definition to compute)