This is “Estimation and Prediction”, section 10.7 from the book Beginning Statistics (v. 1.0). For details on it (including licensing), click here.

For more information on the source of this book, or why it is available for free, please see the project's home page. You can browse or download additional books there. You may also download a PDF copy of this book (33 MB) or just this chapter (2 MB), suitable for printing or most e-readers, or a .zip file containing this book's HTML files (for use in a web browser offline).

Has this book helped you? Consider passing it on:
Creative Commons supports free culture from music to education. Their licenses helped make this book available to you.
DonorsChoose.org helps people like you help teachers fund their classroom projects, from art supplies to books to calculators.

10.7 Estimation and Prediction

Learning Objectives

  1. To learn the distinction between estimation and prediction.
  2. To learn the distinction between a confidence interval and a prediction interval.
  3. To learn how to implement formulas for computing confidence intervals and prediction intervals.

Consider the following pairs of problems, in the context of Note 10.19 "Example 3" in Section 10.4 "The Least Squares Regression Line", the automobile age and value example.

  1.  

    1. Estimate the average value of all four-year-old automobiles of this make and model.
    2. Construct a 95% confidence interval for the average value of all four-year-old automobiles of this make and model.
  2.  

    1. Shylock intends to buy a four-year-old automobile of this make and model next week. Predict the value of the first such automobile that he encounters.
    2. Construct a 95% confidence interval for the value of the first such automobile that he encounters.

The method of solution and answer to the first question in each pair, (1a) and (2a), are the same. When we set x equal to 4 in the least squares regression equation y^=2.05x+32.83 that was computed in part (c) of Note 10.19 "Example 3" in Section 10.4 "The Least Squares Regression Line", the number returned,

y^=2.05(4)+32.83=24.63

which corresponds to value $24,630, is an estimate of precisely the number sought in question (1a): the mean E(y) of all y values when x = 4. Since nothing is known about the first four-year-old automobile of this make and model that Shylock will encounter, our best guess as to its value is the mean value E(y) of all such automobiles, the number 24.63 or $24,630, computed in the same way.

The answers to the second part of each question differ. In question (1b) we are trying to estimate a population parameter: the mean of the all the y-values in the sub-population picked out by the value x = 4, that is, the average value of all four-year-old automobiles. In question (2b), however, we are not trying to capture a fixed parameter, but the value of the random variable y in one trial of an experiment: examine the first four-year-old car Shylock encounters. In the first case we seek to construct a confidence interval in the same sense that we have done before. In the second case the situation is different, and the interval constructed has a different name, prediction interval. In the second case we are trying to “predict” where a the value of a random variable will take its value.

100(1α)% Confidence Interval for the Mean Value of y at x=xp

y^p±tα2sε1n+(xpx-)2SSxx

where

  1. xp is a particular value of x that lies in the range of x-values in the sample data set used to construct the least squares regression line;
  2. y^p is the numerical value obtained when the least square regression equation is evaluated at x=xp; and
  3. the number of degrees of freedom for tα2 is df=n2.

The assumptions listed in Section 10.3 "Modelling Linear Relationships with Randomness Present" must hold.

The formula for the prediction interval is identical except for the presence of the number 1 underneath the square root sign. This means that the prediction interval is always wider than the confidence interval at the same confidence level and value of x. In practice the presence of the number 1 tends to make it much wider.

100(1α)% Prediction Interval for an Individual New Value of y at x=xp

y^p±tα2sε1+1n+(xpx-)2SSxx

where

  1. xp is a particular value of x that lies in the range of x-values in the data set used to construct the least squares regression line;
  2. y^p is the numerical value obtained when the least square regression equation is evaluated at x=xp; and
  3. the number of degrees of freedom for tα2 is df=n2.

The assumptions listed in Section 10.3 "Modelling Linear Relationships with Randomness Present" must hold.

Example 12

Using the sample data of Note 10.19 "Example 3" in Section 10.4 "The Least Squares Regression Line", recorded in Table 10.3 "Data on Age and Value of Used Automobiles of a Specific Make and Model", construct a 95% confidence interval for the average value of all three-and-one-half-year-old automobiles of this make and model.

Solution:

Solving this problem is merely a matter of finding the values of y^p, α and tα2, sε, x-, and SSxx and inserting them into the confidence interval formula given just above. Most of these quantities are already known. From Note 10.19 "Example 3" in Section 10.4 "The Least Squares Regression Line", SSxx=14 and x-=4. From Note 10.31 "Example 7" in Section 10.5 "Statistical Inferences About ", sε=1.902169814.

From the statement of the problem xp=3.5, the value of x of interest. The value of y^p is the number given by the regression equation, which by Note 10.19 "Example 3" is y^=2.05x+32.83, when x=xp, that is, when x = 3.5. Thus here y^p=2.05(3.5)+32.83=25.655.

Lastly, confidence level 95% means that α=10.95=0.05 so α2=0.025. Since the sample size is n = 10, there are n2=8 degrees of freedom. By Figure 12.3 "Critical Values of ", t0.025=2.306. Thus

y^p±tα2sε1n+(xpx-)2SSxx=25.655±(2.306)(1.902169814)110+(3.54)214=25.655±4.3864035910.1178571429=25.655±1.506

which gives the interval (24.149,27.161).

We are 95% confident that the average value of all three-and-one-half-year-old vehicles of this make and model is between $24,149 and $27,161.

Example 13

Using the sample data of Note 10.19 "Example 3" in Section 10.4 "The Least Squares Regression Line", recorded in Table 10.3 "Data on Age and Value of Used Automobiles of a Specific Make and Model", construct a 95% prediction interval for the predicted value of a randomly selected three-and-one-half-year-old automobile of this make and model.

Solution:

The computations for this example are identical to those of the previous example, except that now there is the extra number 1 beneath the square root sign. Since we were careful to record the intermediate results of that computation, we have immediately that the 95% prediction interval is

y^p±tα2sε1+1n+(xpx-)2SSxx=25.655±4.3864035911.1178571429=25.655±4.638

which gives the interval (21.017,30.293).

We are 95% confident that the value of a randomly selected three-and-one-half-year-old vehicle of this make and model is between $21,017 and $30,293.

Note what an enormous difference the presence of the extra number 1 under the square root sign made. The prediction interval is about two-and-one-half times wider than the confidence interval at the same level of confidence.

Key Takeaways

  • A confidence interval is used to estimate the mean value of y in the sub-population determined by the condition that x have some specific value xp.
  • The prediction interval is used to predict the value that the random variable y will take when x has some specific value xp.

Exercises

    Basic

    For the Basic and Application exercises in this section use the computations that were done for the exercises with the same number in previous sections.

  1. For the sample data set of Exercise 1 of Section 10.2 "The Linear Correlation Coefficient"

    1. Give a point estimate for the mean value of y in the sub-population determined by the condition x = 4.
    2. Construct the 90% confidence interval for that mean value.
  2. For the sample data set of Exercise 2 of Section 10.2 "The Linear Correlation Coefficient"

    1. Give a point estimate for the mean value of y in the sub-population determined by the condition x = 4.
    2. Construct the 90% confidence interval for that mean value.
  3. For the sample data set of Exercise 3 of Section 10.2 "The Linear Correlation Coefficient"

    1. Give a point estimate for the mean value of y in the sub-population determined by the condition x = 7.
    2. Construct the 95% confidence interval for that mean value.
  4. For the sample data set of Exercise 4 of Section 10.2 "The Linear Correlation Coefficient"

    1. Give a point estimate for the mean value of y in the sub-population determined by the condition x = 2.
    2. Construct the 80% confidence interval for that mean value.
  5. For the sample data set of Exercise 5 of Section 10.2 "The Linear Correlation Coefficient"

    1. Give a point estimate for the mean value of y in the sub-population determined by the condition x = 1.
    2. Construct the 80% confidence interval for that mean value.
  6. For the sample data set of Exercise 6 of Section 10.2 "The Linear Correlation Coefficient"

    1. Give a point estimate for the mean value of y in the sub-population determined by the condition x = 5.
    2. Construct the 95% confidence interval for that mean value.
  7. For the sample data set of Exercise 7 of Section 10.2 "The Linear Correlation Coefficient"

    1. Give a point estimate for the mean value of y in the sub-population determined by the condition x = 6.
    2. Construct the 99% confidence interval for that mean value.
    3. Is it valid to make the same estimates for x = 12? Explain.
  8. For the sample data set of Exercise 8 of Section 10.2 "The Linear Correlation Coefficient"

    1. Give a point estimate for the mean value of y in the sub-population determined by the condition x = 12.
    2. Construct the 80% confidence interval for that mean value.
    3. Is it valid to make the same estimates for x = 0? Explain.
  9. For the sample data set of Exercise 9 of Section 10.2 "The Linear Correlation Coefficient"

    1. Give a point estimate for the mean value of y in the sub-population determined by the condition x = 0.
    2. Construct the 90% confidence interval for that mean value.
    3. Is it valid to make the same estimates for x=1? Explain.
  10. For the sample data set of Exercise 9 of Section 10.2 "The Linear Correlation Coefficient"

    1. Give a point estimate for the mean value of y in the sub-population determined by the condition x = 8.
    2. Construct the 95% confidence interval for that mean value.
    3. Is it valid to make the same estimates for x = 0? Explain.

    Applications

  1. For the data in Exercise 11 of Section 10.2 "The Linear Correlation Coefficient"

    1. Give a point estimate for the average number of words in the vocabulary of 18-month-old children.
    2. Construct the 95% confidence interval for that mean value.
    3. Is it valid to make the same estimates for two-year-olds? Explain.
  2. For the data in Exercise 12 of Section 10.2 "The Linear Correlation Coefficient"

    1. Give a point estimate for the average braking distance of automobiles that weigh 3,250 pounds.
    2. Construct the 80% confidence interval for that mean value.
    3. Is it valid to make the same estimates for 5,000-pound automobiles? Explain.
  3. For the data in Exercise 13 of Section 10.2 "The Linear Correlation Coefficient"

    1. Give a point estimate for the resting heart rate of a man who is 35 years old.
    2. One of the men in the sample is 35 years old, but his resting heart rate is not what you computed in part (a). Explain why this is not a contradiction.
    3. Construct the 90% confidence interval for the mean resting heart rate of all 35-year-old men.
  4. For the data in Exercise 14 of Section 10.2 "The Linear Correlation Coefficient"

    1. Give a point estimate for the wave height when the wind speed is 13 miles per hour.
    2. One of the wind speeds in the sample is 13 miles per hour, but the height of waves that day is not what you computed in part (a). Explain why this is not a contradiction.
    3. Construct the 90% confidence interval for the mean wave height on days when the wind speed is 13 miles per hour.
  5. For the data in Exercise 15 of Section 10.2 "The Linear Correlation Coefficient"

    1. The business owner intends to spend $2,500 on advertising next year. Give an estimate of next year’s revenue based on this fact.
    2. Construct the 90% prediction interval for next year’s revenue, based on the intent to spend $2,500 on advertising.
  6. For the data in Exercise 16 of Section 10.2 "The Linear Correlation Coefficient"

    1. A two-year-old girl is 32.3 inches long. Predict her adult height.
    2. Construct the 95% prediction interval for the girl’s adult height.
  7. For the data in Exercise 17 of Section 10.2 "The Linear Correlation Coefficient"

    1. Lodovico has a 78.6 average in his physics class just before the final. Give a point estimate of what his final exam grade will be.
    2. Explain whether an interval estimate for this problem is a confidence interval or a prediction interval.
    3. Based on your answer to (b), construct an interval estimate for Lodovico’s final exam grade at the 90% level of confidence.
  8. For the data in Exercise 18 of Section 10.2 "The Linear Correlation Coefficient"

    1. This year 86.2 million acres of corn were planted. Give a point estimate of the number of acres that will be harvested this year.
    2. Explain whether an interval estimate for this problem is a confidence interval or a prediction interval.
    3. Based on your answer to (b), construct an interval estimate for the number of acres that will be harvested this year, at the 99% level of confidence.
  9. For the data in Exercise 19 of Section 10.2 "The Linear Correlation Coefficient"

    1. Give a point estimate for the blood concentration of the active ingredient of this medication in a man who has consumed 1.5 ounces of the medication just recently.
    2. Gratiano just consumed 1.5 ounces of this medication 30 minutes ago. Construct a 95% prediction interval for the concentration of the active ingredient in his blood right now.
  10. For the data in Exercise 20 of Section 10.2 "The Linear Correlation Coefficient"

    1. You measure the girth of a free-standing oak tree five feet off the ground and obtain the value 127 inches. How old do you estimate the tree to be?
    2. Construct a 90% prediction interval for the age of this tree.
  11. For the data in Exercise 21 of Section 10.2 "The Linear Correlation Coefficient"

    1. A test cylinder of concrete three days old fails at 1,750 psi. Predict what the 28-day strength of the concrete will be.
    2. Construct a 99% prediction interval for the 28-day strength of this concrete.
    3. Based on your answer to (b), what would be the minimum 28-day strength you could expect this concrete to exhibit?
  12. For the data in Exercise 22 of Section 10.2 "The Linear Correlation Coefficient"

    1. Tomorrow’s average temperature is forecast to be 53 degrees. Estimate the energy demand tomorrow.
    2. Construct a 99% prediction interval for the energy demand tomorrow.
    3. Based on your answer to (b), what would be the minimum demand you could expect?

    Large Data Set Exercises

  1. Large Data Set 1 lists the SAT scores and GPAs of 1,000 students.

    http://www.gone.2012books.lardbucket.org/sites/all/files/data1.xls

    1. Give a point estimate of the mean GPA of all students who score 1350 on the SAT.
    2. Construct a 90% confidence interval for the mean GPA of all students who score 1350 on the SAT.
  2. Large Data Set 12 lists the golf scores on one round of golf for 75 golfers first using their own original clubs, then using clubs of a new, experimental design (after two months of familiarization with the new clubs).

    http://www.gone.2012books.lardbucket.org/sites/all/files/data12.xls

    1. Thurio averages 72 strokes per round with his own clubs. Give a point estimate for his score on one round if he switches to the new clubs.
    2. Explain whether an interval estimate for this problem is a confidence interval or a prediction interval.
    3. Based on your answer to (b), construct an interval estimate for Thurio’s score on one round if he switches to the new clubs, at 90% confidence.
  3. Large Data Set 13 records the number of bidders and sales price of a particular type of antique grandfather clock at 60 auctions.

    http://www.gone.2012books.lardbucket.org/sites/all/files/data13.xls

    1. There are seven likely bidders at the Verona auction today. Give a point estimate for the price of such a clock at today’s auction.
    2. Explain whether an interval estimate for this problem is a confidence interval or a prediction interval.
    3. Based on your answer to (b), construct an interval estimate for the likely sale price of such a clock at today’s sale, at 95% confidence.

Answers

    1. 5.647,
    2. 5.647±1.253
    1. −0.188,
    2. 0.188±3.041
    1. 1.875,
    2. 1.875±1.423
    1. 5.4,
    2. 5.4±3.355,
    3. invalid (extrapolation)
    1. 2.4,
    2. 2.4±1.474,
    3. valid (−1 is in the range of the x-values in the data set)
    1. 31.3 words,
    2. 31.3±7.1 words,
    3. not valid, since two years is 24 months, hence this is extrapolation
    1. 73.2 beats/min,
    2. The man’s heart rate is not the predicted average for all men his age. c. 73.2±1.2 beats/min
    1. $224,562,
    2. $224,562 ± $28,699
    1. 74,
    2. Prediction (one person, not an average for all who have average 78.6 before the final exam),
    3. 74±24
    1. 0.066%,
    2. 0.066±0.034%
    1. 4,656 psi,
    2. 4,656±321 psi,
    3. 4,656321=4,335 psi
    1. 2.19
    2. (2.1421,2.2316)
    1. 7771.39
    2. A prediction interval.
    3. (7410.41,8132.38)