Solution file for additional exercise 4.1
-----------------------------------------

Data: presence or absence of CHD condition in 100 patients with recorded
age. Notation:
  y_i = presence (1) or absence (0) of CHD,
  x_i = age,
for i'th patient, i=1,...,100.

Logistic regression model:
  Pr(y_i=1)=p_i
  logit(p_i) = beta_0 + beta_1*x_i
  y_1,...,y_100 are independent.

MTB > WOpen "H:\VHM\VHM802\Data_csv\hs04_1.csv";
SUBC>   FType;
SUBC>     CSV;
SUBC>   DecSep;
SUBC>     Period;
SUBC>   Field;
SUBC>     Comma;
SUBC>   TDelimiter;
SUBC>     DoubleQuote.
Retrieving worksheet from file: ‘H:\VHM\VHM802\Data_csv\hs04_1.csv’
Worksheet was saved on 27/01/2011

MTB > Gzlm;
SUBC>   Nodefault;
SUBC>   REvent 1;
SUBC>   Response 'chd';
SUBC>   Continuous 'age';
SUBC>   Terms age;
SUBC>   Constant;
SUBC>   Binomial;
SUBC>     Logit;
SUBC>   TOdds;
SUBC>     Increment 1;
SUBC>   Unstandardized;
SUBC>   TExpand;
SUBC>   Tmethod;
SUBC>   Trinfo;
SUBC>   Tdeviance;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation;
SUBC>   Tgoodness;
SUBC>   Thosmer;
SUBC>   Tassociation.
Binary Logistic Regression: chd versus age 

Method
Link function  Logit
Rows used      100


Response Information
Variable  Value  Count
chd       1         43  (Event)
          0         57
          Total    100

Deviance Table
Source      DF  Seq Dev  Contribution  Adj Dev  Adj Mean  Chi-Square  P-Value
Regression   1    29.31        21.45%    29.31    29.310       29.31    0.000
  age        1    29.31        21.45%    29.31    29.310       29.31    0.000
Error       98   107.35        78.55%   107.35     1.095
Total       99   136.66       100.00%

Model Summary
Deviance   Deviance
    R-Sq  R-Sq(adj)     AIC
  21.45%     20.72%  111.35

Coefficients
Term        Coef  SE Coef       95% CI       Z-Value  P-Value   VIF
Constant   -5.31     1.13  ( -7.53,  -3.09)    -4.68    0.000
age       0.1109   0.0241  (0.0638, 0.1581)     4.61    0.000  1.00

Odds Ratios for Continuous Predictors
     Odds Ratio       95% CI
age      1.1173  (1.0658, 1.1713)

Regression Equation
P(1)  =  exp(Y')/(1 + exp(Y'))
Y' = -5.31 + 0.1109 age


Goodness-of-Fit Tests
Test             DF  Chi-Square  P-Value
Deviance         98      107.35    0.243
Pearson          98      101.94    0.372
Hosmer-Lemeshow   8        0.89    0.999

Observed and Expected Frequencies for Hosmer-Lemeshow Test
            Event
         Probability         chd = 1             chd = 0
Group       Range      Observed  Expected  Observed  Expected
    1  (0.000, 0.110)         1       0.8         9       9.2
    2  (0.110, 0.161)         1       1.3         9       8.7
    3  (0.161, 0.211)         2       1.9         8       8.1
    4  (0.211, 0.318)         3       3.0         8       8.0
    5  (0.318, 0.394)         4       4.1         7       6.9
    6  (0.394, 0.504)         5       4.7         5       5.3
    7  (0.504, 0.639)         5       5.8         5       4.2
    8  (0.639, 0.734)        10       9.3         3       3.7
    9  (0.734, 0.827)         8       7.9         2       2.1
   10  (0.827, 0.912)         4       4.3         1       0.7

Measures of Association
Pairs       Number  Percent  Summary Measures       Value
Concordant    1936     79.0  Somers’ D               0.60
Discordant     466     19.0  Goodman-Kruskal Gamma   0.61
Ties            49      2.0  Kendall’s Tau-a         0.30
Total         2451    100.0

Association is between the response variable and predicted probabilities


Answers to question 1:
----------------------
The estimated parameters are
  beta_0: -5.309 (1.134), 95% CI: -5.309 +- 1.96*1.134 = -5.31 +- 2.22
  beta_1: 0.111 (0.024),  95% CI: 0.111 +- 1.96*0.024 = 0.111 +- 0.047

Interpretation: 
  beta_1 = slope of the linear regression on logit-scale,
           an increase in age of 1 unit (year) correspond to an increase
           in logit(p) of beta_1 units,
  exp(beta_1) = 1.12 = odds-ratio for comparison of two patients one year apart,
                       the CHD-risk (odds) of the older is 1.12 larger

  beta_0 = intercept of the linear regression on logit-scale and
           corresponds to value on logit-scale for age=0 - meaningless
           because age 0 is way beyond the range of the data

Test of slope=0 (no impact of age on CHD-risk)
  Walds test z = 4.61, P < 0.0005
  likelihood-ratio test G = 29.31, P < 0.0005
The data show a clear impact of age on CHD-risk.


Answers to question 2:
----------------------
Log-Likelihood - value that can sometimes be used to compute likelihood-ratio tests
Goodness-of-fit tests: Pearson and Deviance tests are meaningless (ungrouped data).
                       Hosmer-Lemeshow test is useful: 
it is the chi-square test for the shown table of observed and expected
values.

From the Minitab help menus:
"The table of concordant, discordant, and tied pairs is calculated by
pairing the observations with different response values. Here, you have
43 individuals with CHD and 57 with no disease, resulting in
43 * 57 = 2451 pairs with different response values. Based on the model,
pair is concordant if the individual with a low pulse rate has a higher
probability of having a low pulse, discordant if the opposite is true,
and tied if the probabilities are equal. In this example, 79% of pairs
are concordant and 19% are discordant. You can use these values as a
comparative measure of prediction, for example in comparing fits with
different sets of predictors or with different link functions."

The summary measures for concordant, discordant and tied pairs just play
around with these numbers, and are not terribly important.


Question 3:
-----------
For ungrouped data, the goodness-of-fit statistics based on the deviance
and the Pearson chi-square are useless. The Hosmer-Lemeshow test is
okay, but not always particularly powerful. One simple method is to add
a quadratic term in the linear relation, as shown below.

MTB > Gzlm;
SUBC>   Nodefault;
SUBC>   REvent 1;
SUBC>   Response 'chd';
SUBC>   Continuous 'age';
SUBC>   Terms age age*age;
SUBC>   Constant;
SUBC>   Binomial;
SUBC>     Logit;
SUBC>   TOdds;
SUBC>     Increment 1;
SUBC>   Unstandardized;
SUBC>   TExpand;
SUBC>   Tmethod;
SUBC>   Trinfo;
SUBC>   Tdeviance;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation;
SUBC>   Tgoodness;
SUBC>   Thosmer.
Binary Logistic Regression: chd versus age 

... (same output as above omitted)

Deviance Table
Source      DF  Seq Dev  Contribution  Adj Dev  Adj Mean  Chi-Square  P-Value
Regression   2   29.375        21.49%   29.375   14.6875       29.37    0.000
  age        1   29.310        21.45%    0.101    0.1009        0.10    0.751
  age*age    1    0.065         0.05%    0.065    0.0650        0.07    0.799
Error       97  107.288        78.51%  107.288    1.1061
Total       99  136.663       100.00%

Model Summary
Deviance   Deviance
    R-Sq  R-Sq(adj)     AIC
  21.49%     20.03%  113.29

Coefficients
Term         Coef  SE Coef         95% CI        Z-Value  P-Value    VIF
Constant    -4.24     4.29  (  -12.65,    4.17)    -0.99    0.323
age         0.061    0.195  (  -0.320,   0.443)     0.32    0.753  66.13
age*age   0.00055  0.00214  (-0.00365, 0.00475)     0.26    0.798  66.13

Odds Ratios for Continuous Predictors
      Odds    95%
     Ratio    CI
age      *  (*, *)
Odds ratios are not calculated for predictors that are included in interaction terms
     because these ratios depend on values of the other predictors in the interaction
     terms.

Regression Equation
P(1)  =  exp(Y')/(1 + exp(Y'))
Y' = -4.24 + 0.061 age + 0.00055 age*age

Goodness-of-Fit Tests
Test             DF  Chi-Square  P-Value
Deviance         97      107.29    0.223
Pearson          97      100.99    0.371
Hosmer-Lemeshow   8        0.99    0.998

Observed and Expected Frequencies for Hosmer-Lemeshow Test
            Event
         Probability         chd = 1             chd = 0
Group       Range      Observed  Expected  Observed  Expected
    1  (0.000, 0.119)         1       0.9         9       9.1
    2  (0.119, 0.165)         1       1.4         9       8.6
    3  (0.165, 0.210)         2       1.9         8       8.1
    4  (0.210, 0.309)         3       2.9         8       8.1
    5  (0.309, 0.382)         4       3.9         7       7.1
    6  (0.382, 0.491)         5       4.6         5       5.4
    7  (0.491, 0.634)         5       5.7         5       4.3
    8  (0.634, 0.738)        10       9.3         3       3.7
    9  (0.738, 0.841)         8       8.0         2       2.0
   10  (0.841, 0.931)         4       4.4         1       0.6


Answers to question 3:
----------------------
It is seen that the improvement in logL from adding the quadratic term
is very small. To compare the models with and without the quadratic term
we may compute the likelihood-ratio test G2
  G2 = 2*(-53.644+53.677) = 0.066, df=1, Pr(chi^2(df)>0.07) >> 0.05
Note that this test is already given in the Analysis of Deviance table.
The LR-test shows no evidence whatsoever of lack of fit due to curvature
in the relation. We would reach the same conclusion by looking at the
Wald test for the quadratic term (z=0.26, P=0.80).


Question 4:
-----------
Another test of lack of fit may be obtained by grouping the quantitative
variable, as described in the text. The Data-Code menu is used to
create the age groups, and the Stat-Basic-Store menu is used to compute
the age group means and to aggregate the data to the groups.

MTB > Code (20:29) 1 (30:34) 2 (35:39) 3 (40:44) 4 (45:49) 5 (50:54) 6 (55:59) 7 &
CONT>     (60:69) 8 'age' 'agegrp';
SUBC>   Endpoints 3;
SUBC>   TSummary.
Code 

Summary
Lower  Upper  Recoded   Number
  End    End    Value  of Rows
   20     29        1       10
   30     34        2       15
   35     39        3       12
   40     44        4       15
   45     49        5       13
   50     54        6        8
   55     59        7       17
   60     69        8       10

Source data column   age
Recoded data column  agegrp
Each interval includes its lower and upper ends.

MTB > Name c4 "ByVar1" c5 "Mean1"
MTB > Statistics 'age';
SUBC>   By 'agegrp';
SUBC>   GValues 'ByVar1';
SUBC>   Mean 'Mean1'.
MTB > name c5 'agemean'
MTB > Name c6 "ByVar2" c7 "Sum2" c8 "N2"
MTB > Statistics  'chd';
SUBC>   By 'agegrp';
SUBC>   GValues 'ByVar2';
SUBC>   Sums 'Sum2';
SUBC>   N 'N2'.
MTB > name c7 'chdpos'
MTB > name c8 'total'

MTB > Gzlm;
SUBC>   Nodefault;
SUBC>   Response chdpos total;
SUBC>   Evname "Event";
SUBC>   Continuous 'agemean';
SUBC>   Terms agemean;
SUBC>   Constant;
SUBC>   Binomial;
SUBC>     Logit;
SUBC>   TOdds;
SUBC>     Increment 1;
SUBC>   Unstandardized;
SUBC>   TExpand;
SUBC>   Tmethod;
SUBC>   Trinfo;
SUBC>   Tdeviance;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation;
SUBC>   Tgoodness;
SUBC>   Thosmer.
Binary Logistic Regression: chdpos versus agemean 

Method
Link function  Logit
Rows used      8

Response Information
                            Event
Variable  Value      Count  Name
chdpos    Event         43  Event
          Non-event     57
total     Total        100

Deviance Table
Source      DF  Seq Dev  Contribution  Adj Dev  Adj Mean  Chi-Square  P-Value
Regression   1  28.3048        98.62%  28.3048   28.3048       28.30    0.000
  agemean    1  28.3048        98.62%  28.3048   28.3048       28.30    0.000
Error        6   0.3967         1.38%   0.3967    0.0661
Total        7  28.7015       100.00%

Model Summary
Deviance   Deviance
    R-Sq  R-Sq(adj)     AIC
  98.62%     95.13%  112.36

Coefficients
Term        Coef  SE Coef       95% CI       Z-Value  P-Value   VIF
Constant   -5.20     1.12  ( -7.40,  -3.01)    -4.65    0.000
agemean   0.1087   0.0237  (0.0621, 0.1552)     4.58    0.000  1.00

Odds Ratios for Continuous Predictors
         Odds Ratio       95% CI
agemean      1.1148  (1.0641, 1.1679)

Regression Equation
P(Event)  =  exp(Y')/(1 + exp(Y'))
Y' = -5.20 + 0.1087 agemean

Goodness-of-Fit Tests
Test             DF  Chi-Square  P-Value
Deviance          6        0.40    0.999
Pearson           6        0.40    0.999
Hosmer-Lemeshow   5        0.38    0.996

Observed and Expected Frequencies for Hosmer-Lemeshow Test
            Event
         Probability          Event             Non-event
Group       Range      Observed  Expected  Observed  Expected
    1  (0.000, 0.080)         1       0.8         9       9.2
    2  (0.080, 0.151)         2       2.3        13      12.7
    3  (0.151, 0.233)         3       2.8         9       9.2
    4  (0.233, 0.353)         5       5.3        10       9.7
    5  (0.353, 0.482)         6       6.3         7       6.7
    6  (0.482, 0.726)        18      17.2         7       7.8
    7  (0.726, 0.838)         8       8.4         2       1.6


Answers to question 4:
----------------------
The deviance goodness-of-fit test is the LR-test comparing a full model
with agegrp as a factor with the reduced model with agegrp as a continuous
predictor; the difference in deviance can also be found under Error in the 
Deviance table. The value of 0.396 is absolutely non-significant in a chi-square
distribution with 6 df. There is no indication of lack of fit.
Additionally, one could compute residuals and inspect these, but with
such a non-significant goodness-of-fit test it is almost certain that
the residuals would show no major model deviations.


Answers to question 5:
----------------------
The interpretation of beta_1 in terms of odds-ratio has already been
mentioned. The p's for the grouped data may be plotted together with the
fitted curve (from the grouped analysis). The fitted curve from the
ungrouped analysis may be plotted alone (without data points) as
follows:

MTB > Name c9 "ages"
MTB > Set 'ages'
DATA>   1( 29 : 60 / 1 )1
DATA>   End.
MTB > Name C10 'fitp'
MTB > Let 'fitp' = exp(-5.30945+.110921*c10)/(1+exp(-5.30945+.110921*c10))
MTB > Plot 'fitp'*'ages';
SUBC>   Symbol.
Scatterplot of fitp vs ages