Solution file for additional exercise 4.1
-----------------------------------------

Data: presence or absence of CHD condition in 100 patients with recorded
age. Notation:
  y_i = presence (1) or absence (0) of CHD,
  x_i = age,
for i'th patient, i=1,...,100.

Logistic regression model:
  Pr(y_i=1)=p_i
  logit(p_i) = beta_0 + beta_1*x_i
  y_1,...,y_100 are independent.

MTB > WOpen "h:\VHM\VHM802\Data_csv\hs04_1.csv";
SUBC>   FType;
SUBC>     CSV;
SUBC>   DecSep;
SUBC>     Period;
SUBC>   Field;
SUBC>     Comma;
SUBC>   TDelimiter;
SUBC>     DoubleQuote.
Retrieving worksheet from file: 'h:\VHM\VHM802\Data_csv\hs04_1.csv'
Worksheet was saved on 27/01/2011

MTB > Blogistic 'chd' = age;
SUBC>   Logit;
SUBC>   Brief 2;
SUBC>   Step.
Binary Logistic Regression: chd versus age 

Step  Log-Likelihood
   0        -68.3315
   1        -54.1706
   2        -53.6816
   3        -53.6765
   4        -53.6765
   5        -53.6765

Link Function: Logit

Response Information

Variable  Value  Count
chd       1         43  (Event)
          0         57
          Total    100

Logistic Regression Table

                                               Odds     95% CI
Predictor      Coef    SE Coef      Z      P  Ratio  Lower  Upper
Constant   -5.30945    1.13365  -4.68  0.000
age        0.110921  0.0240598   4.61  0.000   1.12   1.07   1.17

Log-Likelihood = -53.677
Test that all slopes are zero: G = 29.310, DF = 1, P-Value = 0.000

Goodness-of-Fit Tests

Method           Chi-Square  DF      P
Pearson             21.3113  41  0.995
Deviance            23.7543  41  0.986
Hosmer-Lemeshow      0.8900   8  0.999

Table of Observed and Expected Frequencies:
(See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic)

                             Group
Value    1    2    3    4    5    6    7    8    9   10  Total
1
  Obs    1    1    2    3    4    5    5   10    8    4     43
  Exp  0.8  1.3  1.9  3.0  4.1  4.7  5.8  9.3  7.9  4.3
0
  Obs    9    9    8    8    7    5    5    3    2    1     57
  Exp  9.2  8.7  8.1  8.0  6.9  5.3  4.2  3.7  2.1  0.7
Total   10   10   10   11   11   10   10   13   10    5    100

Measures of Association:
(Between the Response Variable and Predicted Probabilities)

Pairs       Number  Percent  Summary Measures
Concordant    1936     79.0  Somers' D              0.60
Discordant     466     19.0  Goodman-Kruskal Gamma  0.61
Ties            49      2.0  Kendall's Tau-a        0.30
Total         2451    100.0

Answers to question 1:
----------------------
The estimated parameters are
  beta_0: -5.309 (1.134), 95% CI: -5.309 +- 1.96*1.134 = -5.31 +- 2.22
  beta_1: 0.111 (0.024),  95% CI: 0.111 +- 1.96*0.024 = 0.111 +- 0.047

Interpretation: 
  beta_1 = slope of the linear regression on logit-scale,
           an increase in age of 1 unit (year) correspond to an increase
           in logit(p) of beta_1 units,
  exp(beta_1) = 1.12 = odds-ratio for comparison of two patients one year apart,
                       the CHD-risk (odds) of the older is 1.12 larger

  beta_0 = intercept of the linear regression on logit-scale and
           corresponds to value on logit-scale for age=0 - meaningless
           because age 0 is way beyond the range of the data

Test of slope=0 (no impact of age on CHD-risk)
  Walds test z = 4.61, P < 0.0005
  likelihood-ratio test G = 29.3, P < 0.0005 (because only one slope in model)
The data show a clear impact of age on CHD-risk.


Answers to question 2:
----------------------
Log-Likelihood - value that can sometimes be used to compute likelihood-ratio tests
Goodness-of-fit tests: Pearson and Deviance tests are meaningless (ungrouped data).
                       Hosmer-Lemeshow test is useful: 
it is the chi-square test for the shown table of observed and expected
values.

From the Minitab help menus:
"The table of concordant, discordant, and tied pairs is calculated by
pairing the observations with different response values. Here, you have
43 individuals with CHD and 57 with no disease, resulting in
43 * 57 = 2451 pairs with different response values. Based on the model,
pair is concordant if the individual with a low pulse rate has a higher
probability of having a low pulse, discordant if the opposite is true,
and tied if the probabilities are equal. In this example, 79% of pairs
are concordant and 19% are discordant. You can use these values as a
comparative measure of prediction, for example in comparing fits with
different sets of predictors or with different link functions."

The summary measures for concordant, discordant and tied pairs just play
around with these numbers, and are not terribly important.


Question 3:
-----------
For ungrouped data, the goodness-of-fit statistics based on the deviance
and the Pearson chi-square are useless. The Hosmer-Lemeshow test is
okay, but not always particularly powerful. One simple method is to add
a quadratic term in the linear relation, as shown below.

MTB > Name C3 'age2'
MTB > Let 'age2' = 'age'**2
MTB > Blogistic 'chd' = age age2;
SUBC>   Logit;
SUBC>   Brief 2;
SUBC>   Step.
Binary Logistic Regression: chd versus age, age2 

Step  Log-Likelihood
   0        -68.3315
   1        -54.0734
   2        -53.6501
   3        -53.6440
   4        -53.6440
   5        -53.6440

Link Function: Logit

Response Information

Variable  Value  Count
chd       1         43  (Event)
          0         57
          Total    100

Logistic Regression Table

                                                Odds     95% CI
Predictor       Coef    SE Coef      Z      P  Ratio  Lower  Upper
Constant    -4.24076    4.29018  -0.99  0.323
age        0.0613177   0.194640   0.32  0.753   1.06   0.73   1.56
age2       0.0005482  0.0021437   0.26  0.798   1.00   1.00   1.00

Log-Likelihood = -53.644
Test that all slopes are zero: G = 29.375, DF = 2, P-Value = 0.000

Goodness-of-Fit Tests

Method           Chi-Square  DF      P
Pearson             20.8487  40  0.995
Deviance            23.6893  40  0.981
Hosmer-Lemeshow      0.9891   8  0.998

Table of Observed and Expected Frequencies:
(See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic)

                             Group
Value    1    2    3    4    5    6    7    8    9   10  Total
1
  Obs    1    1    2    3    4    5    5   10    8    4     43
  Exp  0.9  1.4  1.9  2.9  3.9  4.6  5.7  9.3  8.0  4.4
0
  Obs    9    9    8    8    7    5    5    3    2    1     57
  Exp  9.1  8.6  8.1  8.1  7.1  5.4  4.3  3.7  2.0  0.6
Total   10   10   10   11   11   10   10   13   10    5    100

Measures of Association:
(Between the Response Variable and Predicted Probabilities)

Pairs       Number  Percent  Summary Measures
Concordant    1936     79.0  Somers' D              0.60
Discordant     466     19.0  Goodman-Kruskal Gamma  0.61
Ties            49      2.0  Kendall's Tau-a        0.30
Total         2451    100.0


Answers to question 3:
----------------------
It is seen that the improvement in logL from adding the quadratic term
is very small. To compare the models with and without the quadratic term
we may compute the likelihood-ratio test G2
  G2 = 2*(-53.644+53.677) = 0.066, df=1, Pr(chi^2(df)>0.07) >> 0.05
The LR-test shows no evidence whatsoever of lack of fit due to curvature
in the relation. We would reach the same conclusion by looking at the
Wald test for the quadratic term (z=0.26, P=0.80).


Question 4:
-----------
Another test of lack of fit may be obtained by grouping the quantitative
variable, as described in the text. The Data-Code menu is used to
create the age groups, and the Stat-Basic-Store menu is used to compute
the age group means and to aggregate the data to the groups.

MTB > Name c4 "agegrp"
MTB > Code (20:29) 1 (30:34) 2 (35:39) 3 (40:44) 4 (45:49) 5 (50:54) 6 (55 &
CONT>     :59) 7 (60:69) 8 'age' 'agegrp'
MTB > Name c5 "ByVar1" c6 "Mean1"
MTB > Statistics 'age';
SUBC>   By 'agegrp';
SUBC>   GValues 'ByVar1';
SUBC>   Mean 'Mean1'.
MTB > name c6 'agemean'
MTB > Name c7 "ByVar2" c8 "Sum2" c9 "Count2"
MTB > Statistics  'chd';
SUBC>   By 'agegrp';
SUBC>   GValues 'ByVar2';
SUBC>   Sums 'Sum2';
SUBC>   Count 'Count2'.
MTB > name c8 'chdpos' c9 'total'
MTB > Blogistic 'chdpos' 'total' =  agemean; 
SUBC>   ST;
SUBC>   Logit;
SUBC>   Brief 2;
SUBC>   Step.
Binary Logistic Regression: chdpos, total versus agemean 

Step  Log-Likelihood
   0        -68.3315
   1        -54.5984
   2        -54.1827
   3        -54.1791
   4        -54.1791
   5        -54.1791

Link Function: Logit

Response Information

Variable  Value      Count
chdpos    Event         43
          Non-event     57
total     Total        100

Logistic Regression Table

                                               Odds     95% CI
Predictor      Coef    SE Coef      Z      P  Ratio  Lower  Upper
Constant   -5.20386    1.11799  -4.65  0.000
agemean    0.108652  0.0237330   4.58  0.000   1.11   1.06   1.17

Log-Likelihood = -54.179
Test that all slopes are zero: G = 28.305, DF = 1, P-Value = 0.000

Goodness-of-Fit Tests

Method           Chi-Square  DF      P
Pearson            0.401068   6  0.999
Deviance           0.396712   6  0.999
Hosmer-Lemeshow    0.383176   5  0.996

Table of Observed and Expected Frequencies:
(See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic)

                          Group
Value        1     2    3    4    5     6    7  Total
Event
  Obs        1     2    3    5    6    18    8     43
  Exp      0.8   2.3  2.8  5.3  6.3  17.2  8.4
Non-event
  Obs        9    13    9   10    7     7    2     57
  Exp      9.2  12.7  9.2  9.7  6.7   7.8  1.6
Total       10    15   12   15   13    25   10    100

Measures of Association:
(Between the Response Variable and Predicted Probabilities)

Pairs       Number  Percent  Summary Measures
Concordant    1828     74.6  Somers' D              0.59
Discordant     386     15.7  Goodman-Kruskal Gamma  0.65
Ties           237      9.7  Kendall's Tau-a        0.29
Total         2451    100.0


Answers to question 4:
----------------------
The deviance goodness-of-fit test is the LR-test comparing a full model
with agegrp as a factor with the reduced model with agegrp as a continuous
predictor. The value of 0.396 is absolutely non-significant in a chi-square
distribution with 6 df. There is no indication of lack of fit.
Additionally, one could compute residuals and inspect these, but with
such a non-significant goodness-of-fit test it is almost certain that
the residuals would show no major model deviations.


Answers to question 5:
----------------------
The interpretation of beta_1 in terms of odds-ratio has already been
mentioned. The p's for the grouped data may be plotted together with the
fitted curve (from the grouped analysis). The fitted curve from the
ungrouped analysis may be plotted alone (without data points) as
follows:

MTB > Name c10 "ages"
MTB > Set 'ages'
DATA>   1( 29 : 60 / 1 )1
DATA>   End.
MTB > Name C11 'fitp'
MTB > Let 'fitp' = exp(-5.30945+.110921*c10)/(1+exp(-5.30945+.110921*c10))
MTB > Plot 'fitp'*'ages';
SUBC>   Symbol.
Scatterplot of fitp vs ages