Extra Exercise 20 ----------------- Minitab commands and output: MTB > WOpen "H:\VHM\VHM801\Datasets\Minitab\Xtra\anscombe.mtw". Retrieving worksheet from file: 'H:\VHM\VHM801\Datasets\Minitab\Xtra\anscombe.mtw' Worksheet was saved on 16/11/2014 MTB > Fitline 'y1' 'x1_3'; SUBC> Confidence 95.0. Regression Analysis: y1 versus x1_3 The regression equation is y1 = 3.000 + 0.5001 x1_3 S = 1.23660 R-Sq = 66.7% R-Sq(adj) = 62.9% Analysis of Variance Source DF SS MS F P Regression 1 27.5100 27.5100 17.99 0.002 Error 9 13.7627 1.5292 Total 10 41.2727 Fitted Line: y1 versus x1_3 MTB > Fitline 'y2' 'x1_3'; SUBC> Confidence 95.0. Regression Analysis: y2 versus x1_3 The regression equation is y2 = 3.001 + 0.5000 x1_3 S = 1.23721 R-Sq = 66.6% R-Sq(adj) = 62.9% Analysis of Variance Source DF SS MS F P Regression 1 27.5000 27.5000 17.97 0.002 Error 9 13.7763 1.5307 Total 10 41.2763 Fitted Line: y2 versus x1_3 MTB > Fitline 'y3' 'x1_3'; SUBC> Confidence 95.0. Regression Analysis: y3 versus x1_3 The regression equation is y3 = 3.002 + 0.4997 x1_3 S = 1.23631 R-Sq = 66.6% R-Sq(adj) = 62.9% Analysis of Variance Source DF SS MS F P Regression 1 27.4700 27.4700 17.97 0.002 Error 9 13.7562 1.5285 Total 10 41.2262 Fitted Line: y3 versus x1_3 MTB > Fitline 'y4' 'x4'; SUBC> Confidence 95.0. Regression Analysis: y4 versus x4 The regression equation is y4 = 3.002 + 0.4999 x4 S = 1.23570 R-Sq = 66.7% R-Sq(adj) = 63.0% Analysis of Variance Source DF SS MS F P Regression 1 27.4900 27.4900 18.00 0.002 Error 9 13.7425 1.5269 Total 10 41.2325 Fitted Line: y4 versus x4 MTB > Correlation 'x1_3'-'y4'; SUBC> NoPValues. Correlation: x1_3, y1, y2, y3, x4, y4 x1_3 y1 y2 y3 x4 y1 0.816 y2 0.816 0.750 y3 0.816 0.469 0.588 x4 -0.400 -0.297 -0.451 -0.289 y4 0.003 0.065 -0.014 0.023 0.817 Cell Contents: Pearson correlation Answers to questions: --------------------- All four datasets have the same correlation and regression line (to two decimals): r=0.816, and y = 3.00 + 0.500 x The prediction for x=10 is the same for all datasets: yhat=3+0.5*10=8. Only the first dataset (A) is suitable for linear regression. The second one is clearly curved and not described well by linear regression. The third one has a clear outlier, which makes the fitted regression line deviate from the linear pattern of the others. After removal of the outlier the data may be fitted well by a linear regression. In the fourth dataset, the line is drawn between two clusters (which can always be joined by a straight line), and it is not clear whether the line has any meaning for values inbetween. Anscombe commented on the regressions as follows (in his 1973 paper): "The data sets are graphed in the figures, together with the fitted line. Figure 1, corresponding to data set 1, is the kind of thing most people would see in their mind's eye, if they were presented with the above calculated summary. The theoretical description (A) seems to be perfectly appropriate here, and the calculated summary seems fair and adequate. Figure 2 suggests forcefully that data set 2 does not conform with the theoretical description (A), but rather y has a smooth curved relation with x, possibly quadratic, and there is little residual variability. Figure 3 similarly suggests that (A) is not a good description for data set 3: all but one of the observations lie close to a straight line (not the one yielded by the standard regression calculation), namely y = 4 + 0.346 x; and one observation is far from this line. Those are the essential facts that need to be understood and reported. Figure 4, like Figure 1, shows data apparently conforming well with the theoretical description (A). If all observations are considered genuine and reliable, data set 4 is just as informative about the regression relation as data set 1; there is no reason to prefer either to the other. Yet in most circumstances we should feel that there was something unsatisfactory about data set 4. All the information about the slope of the regression line resides in one observation-if that observation were deleted the slope could not be estimated. In most cir- cumstances we are not quite sure that every observation is reliable. If any one observation were discredited and therefore deleted from data set 1, the remainder would tell much the same story. That is not so for data set 4. Thus the standard regression calculation ought to be accompanied by a warning that one observation has played a critical role."