Solution file for additional exercise 4.2 ----------------------------------------- Data: death or survival of beetles subjected to doses of CS2 Notation: y_ij = presence (1) or absence (0) of CHD, x_i = dose, for j'th beetle in dose group i, i=1,...,8 and j=1,...,n_i. Full model for grouped data: Pr(y_ij=1)=p_i, with no restrictions on the p_i's all y_ij's are independent. Logistic regression model: Pr(y_ij=1)=p_i logit(p_i) = beta_0 + beta_1*x_i all y_ij's are independent. MTB > WOpen "h:\VHM\VHM802\Data_csv\hs04_2.csv"; SUBC> FType; SUBC> CSV; SUBC> DecSep; SUBC> Period; SUBC> Field; SUBC> Comma; SUBC> TDelimiter; SUBC> DoubleQuote. Retrieving worksheet from file: 'h:\VHM\VHM802\Data_csv\hs04_2.csv' Worksheet was saved on 27/01/2011 MTB > Name c4 "DRES1" MTB > Blogistic 'dead' 'total' = dose; SUBC> ST; SUBC> Logit; SUBC> Dresiduals 'DRES1'; SUBC> Brief 2. Binary Logistic Regression: dead, total versus dose Link Function: Logit Response Information Variable Value Count dead Event 291 Non-event 190 total Total 481 Logistic Regression Table 95% CI Predictor Coef SE Coef Z P Odds Ratio Lower Constant -60.7175 5.18071 -11.72 0.000 dose 34.2703 2.91214 11.77 0.000 7.64563E+14 2.53834E+12 Predictor Upper Constant dose 2.30291E+17 Log-Likelihood = -186.235 Test that all slopes are zero: G = 272.970, DF = 1, P-Value = 0.000 Goodness-of-Fit Tests Method Chi-Square DF P Pearson 10.0268 6 0.124 Deviance 11.2322 6 0.081 Hosmer-Lemeshow 10.0268 6 0.124 Table of Observed and Expected Frequencies: (See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic) Group Value 1 2 3 4 5 6 7 8 Total Event Obs 6 13 18 28 52 53 61 60 291 Exp 3.5 9.8 22.5 33.9 50.1 53.3 59.2 58.7 Non-event Obs 53 47 44 28 11 6 1 0 190 Exp 55.5 50.2 39.5 22.1 12.9 5.7 2.8 1.3 Total 59 60 62 56 63 59 62 60 481 Measures of Association: (Between the Response Variable and Predicted Probabilities) Pairs Number Percent Summary Measures Concordant 48093 87.0 Somers' D 0.80 Discordant 3741 6.8 Goodman-Kruskal Gamma 0.86 Ties 3456 6.3 Kendall's Tau-a 0.38 Total 55290 100.0 MTB > Plot 'DRES1'*'dose'; SUBC> Symbol. Scatterplot of DRES1 vs dose Comments: --------- The estimated logistic regression line is logit(p) = -60.717 + 34.270*dose The parameter estimates are very large, reflecting that the doses are in a narrow range far away from zero. Predictions are only going to be sensible within (or very close to) the range of observed doses. There is a strong impact of dose on the mortality (which is obvious from the data as well). The odds-ratio for an increase in dose of 0.1 is exp(3.247) = 30.8 (which is huge). The goodness-of-fit statistics give P-values in the range 0.08-0.12, indicating that there is possibly some problem with the model. However, values in that range are not prohibitive for using the model. The plots of delta deviance and delta Pearson chi-square show rather large values (not too far from 3.84) but no clear pattern. It is much more informative to look at the residuals themselves, which can be stored in a column and plotted, e.g. versus dose. Such a residual plot shows a clear pattern, indicating lack of fit due to curvature. MTB > Name C5 'dose2' MTB > Let 'dose2' = 'dose'**2 MTB > Name c6 "DRES2" MTB > Blogistic 'dead' 'total' = dose dose2; SUBC> ST; SUBC> Logit; SUBC> Dresiduals 'DRES2'; SUBC> Brief 2. Binary Logistic Regression: dead, total versus dose, dose2 Link Function: Logit Response Information Variable Value Count dead Event 291 Non-event 190 total Total 481 Logistic Regression Table 95% CI Predictor Coef SE Coef Z P Odds Ratio Lower Constant 431.106 180.654 2.39 0.017 dose -520.615 204.523 -2.55 0.011 0.00 0.00 dose2 156.412 57.8630 2.70 0.007 8.48556E+67 4.72798E+18 Predictor Upper Constant dose 0.00 dose2 1.52295E+117 Log-Likelihood = -182.217 Test that all slopes are zero: G = 281.008, DF = 2, P-Value = 0.000 Goodness-of-Fit Tests Method Chi-Square DF P Pearson 3.00387 5 0.699 Deviance 3.19491 5 0.670 Hosmer-Lemeshow 3.00387 6 0.808 Table of Observed and Expected Frequencies: (See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic) Group Value 1 2 3 4 5 6 7 8 Total Event Obs 6 13 18 28 52 53 61 60 291 Exp 7.0 10.5 19.0 30.0 49.2 54.7 60.8 59.7 Non-event Obs 53 47 44 28 11 6 1 0 190 Exp 52.0 49.5 43.0 26.0 13.8 4.3 1.2 0.3 Total 59 60 62 56 63 59 62 60 481 Measures of Association: (Between the Response Variable and Predicted Probabilities) Pairs Number Percent Summary Measures Concordant 48093 87.0 Somers' D 0.80 Discordant 3741 6.8 Goodman-Kruskal Gamma 0.86 Ties 3456 6.3 Kendall's Tau-a 0.38 Total 55290 100.0 MTB > Plot 'DRES2'*'dose'; SUBC> Symbol. Scatterplot of DRES2 vs dose Comments: --------- The residual plot for the residuals of the model with added quadratic term looks much better. A comparison of the two models using the likelihood-ratio test proceeds as follows: G = 11.232-3.195 = 8.037, df=1 (one extra parameter), P + Pr(chi^2(df)>8.037) = 0.005 Alternatively, one may use the z-statistic for the regression coefficient for dose2, with a similar result. However, in general caution should be exercised for "large" estimates and standard errors, but in this case the values are okay. The conclusion is clear: a quadratic term is needed in the equation. The enthusiastic student may convince herself about that by plotting the data and fitted curves. The resulting statistical model is Logistic regression model with quadratic term: Pr(y_ij=1)=p_i logit(p_i) = beta_0 + beta_1*x_i + beta_2*(x_i^2) all y_ij's are independent and the estimated regression equation (on logistic scale) is logit(p) = 431.1 - 520.6*dose + 156.41*dose^2 As noted above, prediction from this model is only possible in a narrow range of dose values. Finally, sample commands to plot the estimated mortality curve. MTB > Name c10 'doses' MTB > Set 'doses' DATA> 1( 1.69 : 1.88 / 0.005 )1 DATA> End. MTB > Name c11 'fit mortality' MTB > let 'fit mortality' = expo(431.1-520.6*c10+156.41*c10*c10)/(1+expo(431.1-520.6*c10+156.41*c10*c10)) MTB > Plot 'fit mortality'*'doses'; SUBC> Symbol.