Extra Exercise 20
-----------------

Minitab commands and output:

MTB > WOpen "H:\VHM\VHM801\Datasets\Minitab\Xtra\anscombe.mtw".
Retrieving worksheet from file:
'H:\VHM\VHM801\Datasets\Minitab\Xtra\anscombe.mtw'
Worksheet was saved on 16/11/2014

MTB > Fitline 'y1' 'x1_3';
SUBC>   Confidence 95.0.
Regression Analysis: y1 versus x1_3 

The regression equation is
y1 = 3.000 + 0.5001 x1_3

S = 1.23660   R-Sq = 66.7%   R-Sq(adj) = 62.9%

Analysis of Variance

Source      DF       SS       MS      F      P
Regression   1  27.5100  27.5100  17.99  0.002
Error        9  13.7627   1.5292
Total       10  41.2727
 
Fitted Line: y1 versus x1_3 

MTB > Fitline 'y2' 'x1_3';
SUBC>   Confidence 95.0.
Regression Analysis: y2 versus x1_3 

The regression equation is
y2 = 3.001 + 0.5000 x1_3

S = 1.23721   R-Sq = 66.6%   R-Sq(adj) = 62.9%

Analysis of Variance

Source      DF       SS       MS      F      P
Regression   1  27.5000  27.5000  17.97  0.002
Error        9  13.7763   1.5307
Total       10  41.2763
 
Fitted Line: y2 versus x1_3 

MTB > Fitline 'y3' 'x1_3';
SUBC>   Confidence 95.0.
Regression Analysis: y3 versus x1_3 

The regression equation is
y3 = 3.002 + 0.4997 x1_3

S = 1.23631   R-Sq = 66.6%   R-Sq(adj) = 62.9%

Analysis of Variance

Source      DF       SS       MS      F      P
Regression   1  27.4700  27.4700  17.97  0.002
Error        9  13.7562   1.5285
Total       10  41.2262
 
Fitted Line: y3 versus x1_3 

MTB > Fitline 'y4' 'x4';
SUBC>   Confidence 95.0.
Regression Analysis: y4 versus x4 

The regression equation is
y4 = 3.002 + 0.4999 x4

S = 1.23570   R-Sq = 66.7%   R-Sq(adj) = 63.0%

Analysis of Variance

Source      DF       SS       MS      F      P
Regression   1  27.4900  27.4900  18.00  0.002
Error        9  13.7425   1.5269
Total       10  41.2325
 
Fitted Line: y4 versus x4 

MTB > Correlation 'x1_3'-'y4';
SUBC>   NoPValues.
Correlation: x1_3, y1, y2, y3, x4, y4 

        x1_3      y1      y2      y3      x4
y1     0.816
y2     0.816   0.750
y3     0.816   0.469   0.588
x4    -0.400  -0.297  -0.451  -0.289
y4     0.003   0.065  -0.014   0.023   0.817

Cell Contents: Pearson correlation


Answers to questions:
---------------------
All four datasets have the same correlation and regression line (to two
decimals):
    r=0.816, and  y = 3.00 + 0.500 x
The prediction for x=10 is the same for all datasets: yhat=3+0.5*10=8.

Only the first dataset (A) is suitable for linear regression.
The second one is clearly curved and not described well by linear
regression. The third one has a clear outlier, which makes the fitted
regression line deviate from the linear pattern of the others. After removal
of the outlier the data may be fitted well by a linear regression. In the
fourth dataset, the line is drawn between two clusters (which can always be
joined by a straight line), and it is not clear whether the line has any
meaning for values inbetween.

Anscombe commented on the regressions as follows (in his 1973 paper):

"The data sets are graphed in the figures, together with the fitted line.
Figure 1, corresponding to data set 1, is the kind of thing most people
would see in their mind's eye, if they were presented with the above
calculated summary. The theoretical description (A) seems to be
perfectly appropriate here, and the calculated summary seems fair and
adequate. Figure 2 suggests forcefully that data set 2 does not
conform with the theoretical description (A), but rather y has a smooth
curved relation with x, possibly quadratic, and there is little residual
variability. Figure 3 similarly suggests that (A) is not a good
description for data set 3: all but one of the observations lie close to
a straight line (not the one yielded by the standard regression
calculation), namely 
   y = 4 + 0.346 x; 
and one observation is far from this line. Those are the essential facts 
that need to be understood and reported. 

Figure 4, like Figure 1, shows data apparently conforming
well with the theoretical description (A). If all observations are
considered genuine and reliable, data set 4 is just as informative about
the regression relation as data set 1; there is no reason to prefer
either to the other. Yet in most circumstances we should feel that there
was something unsatisfactory about data set 4. All the information about
the slope of the regression line resides in one observation-if that
observation were deleted the slope could not be estimated. In most cir-
cumstances we are not quite sure that every observation is reliable.
If any one observation were discredited and therefore deleted from data
set 1, the remainder would tell much the same story. That is not so for
data set 4. Thus the standard regression calculation ought to be
accompanied by a warning that one observation has played a critical
role."