Solution file for additional exercise 4.3
-----------------------------------------

Data: presence or absence of disease (liver cirrhosis) of patients in two cities and 
classified as alcoholics/non-alcoholics.
Notation:
  y_ijk = presence (1) or absence (0) of disease,
for k'th person in city i, i=1,0 (New York, Philadelphia), and
in alcoholic group j, j=1,0 (alcoholic, non-alcoholic), k=1,...,n_ij.

Alternative notation:
  y_i, with i referring to patient number 1,...,4600.

Full model for grouped data:
  Pr(y_ijk=1)=p_ij, with no restrictions on the p_ij's
  all y_ijk's are independent.

Logistic regression model:
  Pr(y_ijk=1)=p_ij
  logit(p_ij) = mu + alpha_i + beta_j
( or logit(p_i) = mu + alpha_city(i) + beta_alcohol(i) )
  all y_ijk's are independent.
Effectively, this model assumes no interaction on logistic scale between 
effects of city and alcoholic groups.

MTB > WOpen "h:\VHM\VHM802\Data_csv\hs04_3.csv";
SUBC>   FType;
SUBC>     CSV;
SUBC>   DecSep;
SUBC>     Period;
SUBC>   Field;
SUBC>     Comma;
SUBC>   TDelimiter;
SUBC>     DoubleQuote.
Retrieving worksheet from file: 'h:\VHM\VHM802\Data_csv\hs04_3.csv'
Worksheet was saved on 27/01/2011

MTB > Blogistic 'disease' = city alcoholic; 
SUBC>   Frequency 'n';
SUBC>   Factors 'city' 'alcoholic';
SUBC>   Logit;
SUBC>   Brief 2.
Binary Logistic Regression: disease versus city, alcoholic 

Link Function: Logit

Response Information

Variable  Value  Count
disease   1        210  (Event)
          0       4390
          Total   4600

Frequency: n

Logistic Regression Table
                                                    Odds     95% CI
Predictor           Coef   SE Coef       Z      P  Ratio  Lower  Upper
Constant        -2.88590  0.163839  -17.61  0.000
city
 Philadelphia  -0.681352  0.171601   -3.97  0.000   0.51   0.36   0.71
alcoholic
 1               2.20349  0.160458   13.73  0.000   9.06   6.61  12.40

Log-Likelihood = -756.954
Test that all slopes are zero: G = 192.772, DF = 2, P-Value = 0.000

Goodness-of-Fit Tests

Method           Chi-Square  DF      P
Pearson            0.248164   1  0.618
Deviance           0.249304   1  0.618
Hosmer-Lemeshow    0.101655   1  0.750

Table of Observed and Expected Frequencies:
(See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic)

Value           1                   0
Group  Observed  Expected  Observed  Expected  Total
    1       105     103.6      3667    3668.4   3772
    2        25      26.4       475     473.6    500
    3        80      80.0       248     248.0    328

Measures of Association:
(Between the Response Variable and Predicted Probabilities)

Pairs       Number  Percent  Summary Measures
Concordant  429440     46.6  Somers' D              0.37
Discordant   85040      9.2  Goodman-Kruskal Gamma  0.67
Ties        407420     44.2  Kendall's Tau-a        0.03
Total       921900    100.0

Comments:
---------
The deviance and Pearson goodness-of-fit statistics are non-significant,
indicating no evidence of interaction between effects of city and
alcohol-group. The Hosmer-Lemeshow test is less useful here.

The z-tests for city and alcohol are both clearly significant, so we do
not bother calculating the likelihood ratio statistics (by fitting the
relevant submodels and computing the differences in deviance).

The odds-ratio for city (Philadelphia vs New York) is 0.51,
which means that the risk (odds) of being diseased is about twice as
high in the patient group in NY than in Ph. It is not clear what that
really means, because it may very well relate to the selection of the
patients. Maybe cities should be considered as blocks, and not be given
any particular interpretation.

The odds-ratio for alcohol group (alcoholic vs. non-alcoholic) is 9.1, 
which means that the risk (odds) for disease is much higher in the alcoholic 
group. Maybe not surprising, with today's understanding, but these data are
from before 1942.