Solution file for additional exercise 4.3
-----------------------------------------

Data: presence or absence of disease (liver cirrhosis) of patients in two cities and 
classified as alcoholics/non-alcoholics.
Notation:
  y_ijk = presence (1) or absence (0) of disease,
for k'th person in city i, i=1,0 (New York, Philadelphia), and
in alcoholic group j, j=1,0 (alcoholic, non-alcoholic), k=1,...,n_ij.

Alternative notation:
  y_i, with i referring to patient number 1,...,4600.

Full model for grouped data:
  Pr(y_ijk=1)=p_ij, with no restrictions on the p_ij's
  all y_ijk's are independent.

Logistic regression model:
  Pr(y_ijk=1)=p_ij
  logit(p_ij) = mu + alpha_i + beta_j
( or logit(p_i) = mu + alpha_city(i) + beta_alcohol(i) )
  all y_ijk's are independent.
Effectively, this model assumes no interaction on logistic scale between 
effects of city and alcoholic groups.

MTB > WOpen "H:\VHM\VHM802\Data_csv\hs04_3.csv";
SUBC>   FType;
SUBC>     CSV;
SUBC>   DecSep;
SUBC>     Period;
SUBC>   Field;
SUBC>     Comma;
SUBC>   TDelimiter;
SUBC>     DoubleQuote.
Retrieving worksheet from file: ‘H:\VHM\VHM802\Data_csv\hs04_3.csv’
Worksheet was saved on 27/01/2011

MTB > Gzlm;
SUBC>   Nodefault;
SUBC>   REvent 1;
SUBC>   Response 'disease';
SUBC>   Frequency 'n';
SUBC>   Categorical 'city' 'alcoholic';
SUBC>   Terms city alcoholic;
SUBC>   Constant;
SUBC>   Binomial;
SUBC>     Logit;
SUBC>   TOdds;
SUBC>   Tmethod;
SUBC>   Trinfo;
SUBC>   Tstep;
SUBC>   Tdeviance;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation;
SUBC>   Tgoodness;
SUBC>   Thosmer;
SUBC>   Tassociation.
Binary Logistic Regression: disease versus city, alcoholic 

Method
Link function                 Logit
Frequency                     n
Categorical predictor coding  (1, 0)
Rows used                     8

Response Information
Variable  Value  Count
disease   1        210  (Event)
          0       4390
          Total   4600

Deviance at Each Iterative Step
Step     Deviance
   1  1781.965789
   2  1544.116803
   3  1514.829898
   4  1513.908991
   5  1513.907631
   6  1513.907631

Deviance Table
Source         DF  Adj Dev  Adj Mean  Chi-Square  P-Value
Regression      2   192.77    96.386      192.77    0.000
  city          1    14.53    14.526       14.53    0.000
  alcoholic     1   155.52   155.520      155.52    0.000
Error        4597  1513.91     0.329
Total        4599  1706.68

Model Summary
Deviance   Deviance
    R-Sq  R-Sq(adj)      AIC
  11.30%     11.18%  1519.91

Coefficients
Term              Coef  SE Coef   VIF
Constant        -2.886    0.164
city
  Philadelphia  -0.681    0.172  1.04
alcoholic
  1              2.203    0.160  1.04

Odds Ratios for Categorical Predictors
Level A         Level B   Odds Ratio        95% CI
city
  Philadelphia  New York      0.5059  (0.3614,  0.7082)
alcoholic
  1             0             9.0565  (6.6127, 12.4035)
Odds ratio for level A relative to level B

Regression Equation
P(1)  =  exp(Y')/(1 + exp(Y'))

Y' = -2.886 + 0.0 city_New York - 0.681 city_Philadelphia + 0.0 alcoholic_0
     + 2.203 alcoholic_1

Goodness-of-Fit Tests
Test               DF  Chi-Square  P-Value
Deviance         4597     1513.91    1.000
Pearson          4597     4621.91    0.395
Hosmer-Lemeshow     1        0.10    0.750

Observed and Expected Frequencies for Hosmer-Lemeshow Test
            Event
         Probability       disease = 1         disease = 0
Group       Range      Observed  Expected  Observed  Expected
    1  (0.000, 0.027)       105     103.6      3667    3668.4
    2  (0.027, 0.053)        25      26.4       475     473.6
    3  (0.053, 0.336)        80      80.0       248     248.0

Measures of Association
Pairs       Number  Percent  Summary Measures       Value
Concordant  429440     46.6  Somers’ D               0.37
Discordant   85040      9.2  Goodman-Kruskal Gamma   0.67
Ties        407420     44.2  Kendall’s Tau-a         0.03
Total       921900    100.0

Association is between the response variable and predicted probabilities

Comments:
---------
The deviance and Pearson goodness-of-fit statistics are non-significant,
indicating no evidence of interaction between effects of city and
alcohol-group. The Hosmer-Lemeshow test is less useful here.

The z-tests for city and alcohol are both clearly significant, so we do
not bother calculating the likelihood ratio statistics (by fitting the
relevant submodels and computing the differences in deviance).

The odds-ratio for city (Philadelphia vs New York) is 0.51,
which means that the risk (odds) of being diseased is about twice as
high in the patient group in NY than in Ph. It is not clear what that
really means, because it may very well relate to the selection of the
patients. Maybe cities should be considered as blocks, and not be given
any particular interpretation.

The odds-ratio for alcohol group (alcoholic vs. non-alcoholic) is 9.1, 
which means that the risk (odds) for disease is much higher in the alcoholic 
group. Maybe not surprising, with today's understanding, but these data are
from before 1942.