Supplementary Exercises 2.2 and 10.7 of IPS7e
---------------------------------------------

Golf scores for 12 members of a college women's golf team in two rounds
of tournament play (low scores are better). The scores in round 1 and
round 2 are both response variables. Although scores only take integer
values, it may be quite reasonable to assume normal distributions
because the range of scores can be quite wide.


2.2:
-----
Minitab commands for requested plot (we put round 1 on the x-axis because 
it is more natural to think of the round 2 following round 1 than the 
other way around) and calculation of correlation coefficient:

MTB > WOpen "H:\VHM\VHM801\Datasets\Minitab\Chapter 2\ex02_002.mtw".
Retrieving worksheet from file: 'H:\VHM\VHM801\Datasets\Minitab\Chapter
2\ex02_002.mtw'
Worksheet was saved on 15/11/2014

MTB > Plot 'round2'*'round1';
SUBC>   Symbol.
Scatterplot of round2 vs round1 


Answers to questions in (b):
----------------------------
- The association is positive; lower scores in round 1 are associated
with lower scores in round 2 as well.
- We would expect a positive association because the scores reflect the
skills of each player. A good player would be expected to have a low score 
in both rounds, whereas a less skilled player would be expected to have
higher scores in both rounds. Of course players do not always perform to
their level, but fluctations in scores should be random.

Answers to questions in (c):
----------------------------
The values for Player 8 fall outside the pattern: she had a very high
(poor) score in the first round (105), and an average score (89) in the 
second round. We have no way of telling which of these two scores
represents the "normal" level of this player; thus we cannot say whether
round 1 represented a "bad day" or round 2 represented an unusually good
round. Note that the scores for Player 7 are both high but quite similar,
so this point does not fall outside the pattern.


10.7:
-----
Minitab commands and output (continuing on the same worksheet as above):

MTB > Correlation  'round1' 'round2'.
Correlation: round1, round2 

Pearson correlation of round1 and round2 = 0.687
P-Value = 0.014

MTB > Copy 'round1' c4;
SUBC>   Varnames.
MTB > let c4(8)='*'
MTB > Correlation  'round1_1' 'round2'.
Correlation: round1_1, round2 

Pearson correlation of round1_1 and round2 = 0.842
P-Value = 0.001

Answers to questions:
---------------------
(a) answered above

(b) The sample correlation is 0.687. If we assume a joint normal
distribution for the two scores and that the women are an i.i.d.
sample from a meaningful population, we can test the hypothesis
  H0: rho=0   versus   Ha: rho<>0
by a t-test. Manual calculation gives t=2.99 with df=10. The Minitab 
listing gives P=0.014 for this test, and we conclude there is evidence 
to say that a non-zero correlation exists in the population. The sample 
correlation shows that the population correlation must be positive, as 
expected.

(c) Without the outlier we get a sample correlation of 0.842 and strong
statistical significance against H0 (P=0.001); manual calculation gives
t=4.68 and df=9. It is not surprising that removal of the outlier has 
this effect, because without this observation the linear relationship 
becomes considerably more apparent and convincing. 

Extra question:
---------------
Although the scores for Player 7 correspond well to the linear pattern,
removal of this point affects the correlation as well. Without Player 7 
alone the correlation drops down to 0.550 (P=0.080), and is no longer
statistically significant. Without both Players 7 and 8 we get a sample 
correlation of 0.661 (P=0.037), an even stronger impact. We know that 
extreme points can affect the correlation strongly, and without the 
scores of Player 8 we would indeed consider the scores for Player 7 
as somewhat extreme (in both rounds). 

Whether these findings mean that any of the two players should be
dropped from the data, is hard to say without additional information
about the population these data are supposed to represent. A cautious
conclusion is that the association is significant in any case, and that
one would need a larger sample size to decide about whether these points
do belong with the others or not. An alternative analytical approach is
to use the Spearman rank correlation coefficient which is less sensitive
to extreme points.

Minitab commands and output for extra question:

MTB > Copy 'round2' c5;
SUBC>   Varnames.
MTB > let c5(7)='*'
MTB > Correlation  'round1' 'round2_1'.
Correlation: round1, round2_1 

Pearson correlation of round1 and round2_1 = 0.550
P-Value = 0.080

MTB > Correlation  'round1_1' 'round2_1'.
Correlation: round1_1, round2_1 

Pearson correlation of round1_1 and round2_1 = 0.661
P-Value = 0.037