Solution file for additional exercise 4.1 ----------------------------------------- Data: presence or absence of CHD condition in 100 patients with recorded age. Notation: y_i = presence (1) or absence (0) of CHD, x_i = age, for i'th patient, i=1,...,100. Logistic regression model: Pr(y_i=1)=p_i logit(p_i) = beta_0 + beta_1*x_i y_1,...,y_100 are independent. MTB > WOpen "H:\VHM\VHM802\Data_csv\hs04_1.csv"; SUBC> FType; SUBC> CSV; SUBC> DecSep; SUBC> Period; SUBC> Field; SUBC> Comma; SUBC> TDelimiter; SUBC> DoubleQuote. Retrieving worksheet from file: ‘H:\VHM\VHM802\Data_csv\hs04_1.csv’ Worksheet was saved on 27/01/2011 MTB > Gzlm; SUBC> Nodefault; SUBC> REvent 1; SUBC> Response 'chd'; SUBC> Continuous 'age'; SUBC> Terms age; SUBC> Constant; SUBC> Binomial; SUBC> Logit; SUBC> TOdds; SUBC> Increment 1; SUBC> Unstandardized; SUBC> TExpand; SUBC> Tmethod; SUBC> Trinfo; SUBC> Tdeviance; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation; SUBC> Tgoodness; SUBC> Thosmer; SUBC> Tassociation. Binary Logistic Regression: chd versus age Method Link function Logit Rows used 100 Response Information Variable Value Count chd 1 43 (Event) 0 57 Total 100 Deviance Table Source DF Seq Dev Contribution Adj Dev Adj Mean Chi-Square P-Value Regression 1 29.31 21.45% 29.31 29.310 29.31 0.000 age 1 29.31 21.45% 29.31 29.310 29.31 0.000 Error 98 107.35 78.55% 107.35 1.095 Total 99 136.66 100.00% Model Summary Deviance Deviance R-Sq R-Sq(adj) AIC 21.45% 20.72% 111.35 Coefficients Term Coef SE Coef 95% CI Z-Value P-Value VIF Constant -5.31 1.13 ( -7.53, -3.09) -4.68 0.000 age 0.1109 0.0241 (0.0638, 0.1581) 4.61 0.000 1.00 Odds Ratios for Continuous Predictors Odds Ratio 95% CI age 1.1173 (1.0658, 1.1713) Regression Equation P(1) = exp(Y')/(1 + exp(Y')) Y' = -5.31 + 0.1109 age Goodness-of-Fit Tests Test DF Chi-Square P-Value Deviance 98 107.35 0.243 Pearson 98 101.94 0.372 Hosmer-Lemeshow 8 0.89 0.999 Observed and Expected Frequencies for Hosmer-Lemeshow Test Event Probability chd = 1 chd = 0 Group Range Observed Expected Observed Expected 1 (0.000, 0.110) 1 0.8 9 9.2 2 (0.110, 0.161) 1 1.3 9 8.7 3 (0.161, 0.211) 2 1.9 8 8.1 4 (0.211, 0.318) 3 3.0 8 8.0 5 (0.318, 0.394) 4 4.1 7 6.9 6 (0.394, 0.504) 5 4.7 5 5.3 7 (0.504, 0.639) 5 5.8 5 4.2 8 (0.639, 0.734) 10 9.3 3 3.7 9 (0.734, 0.827) 8 7.9 2 2.1 10 (0.827, 0.912) 4 4.3 1 0.7 Measures of Association Pairs Number Percent Summary Measures Value Concordant 1936 79.0 Somers’ D 0.60 Discordant 466 19.0 Goodman-Kruskal Gamma 0.61 Ties 49 2.0 Kendall’s Tau-a 0.30 Total 2451 100.0 Association is between the response variable and predicted probabilities Answers to question 1: ---------------------- The estimated parameters are beta_0: -5.309 (1.134), 95% CI: -5.309 +- 1.96*1.134 = -5.31 +- 2.22 beta_1: 0.111 (0.024), 95% CI: 0.111 +- 1.96*0.024 = 0.111 +- 0.047 Interpretation: beta_1 = slope of the linear regression on logit-scale, an increase in age of 1 unit (year) correspond to an increase in logit(p) of beta_1 units, exp(beta_1) = 1.12 = odds-ratio for comparison of two patients one year apart, the CHD-risk (odds) of the older is 1.12 larger beta_0 = intercept of the linear regression on logit-scale and corresponds to value on logit-scale for age=0 - meaningless because age 0 is way beyond the range of the data Test of slope=0 (no impact of age on CHD-risk) Walds test z = 4.61, P < 0.0005 likelihood-ratio test G = 29.31, P < 0.0005 The data show a clear impact of age on CHD-risk. Answers to question 2: ---------------------- Log-Likelihood - value that can sometimes be used to compute likelihood-ratio tests Goodness-of-fit tests: Pearson and Deviance tests are meaningless (ungrouped data). Hosmer-Lemeshow test is useful: it is the chi-square test for the shown table of observed and expected values. From the Minitab help menus: "The table of concordant, discordant, and tied pairs is calculated by pairing the observations with different response values. Here, you have 43 individuals with CHD and 57 with no disease, resulting in 43 * 57 = 2451 pairs with different response values. Based on the model, pair is concordant if the individual with a low pulse rate has a higher probability of having a low pulse, discordant if the opposite is true, and tied if the probabilities are equal. In this example, 79% of pairs are concordant and 19% are discordant. You can use these values as a comparative measure of prediction, for example in comparing fits with different sets of predictors or with different link functions." The summary measures for concordant, discordant and tied pairs just play around with these numbers, and are not terribly important. Question 3: ----------- For ungrouped data, the goodness-of-fit statistics based on the deviance and the Pearson chi-square are useless. The Hosmer-Lemeshow test is okay, but not always particularly powerful. One simple method is to add a quadratic term in the linear relation, as shown below. MTB > Gzlm; SUBC> Nodefault; SUBC> REvent 1; SUBC> Response 'chd'; SUBC> Continuous 'age'; SUBC> Terms age age*age; SUBC> Constant; SUBC> Binomial; SUBC> Logit; SUBC> TOdds; SUBC> Increment 1; SUBC> Unstandardized; SUBC> TExpand; SUBC> Tmethod; SUBC> Trinfo; SUBC> Tdeviance; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation; SUBC> Tgoodness; SUBC> Thosmer. Binary Logistic Regression: chd versus age ... (same output as above omitted) Deviance Table Source DF Seq Dev Contribution Adj Dev Adj Mean Chi-Square P-Value Regression 2 29.375 21.49% 29.375 14.6875 29.37 0.000 age 1 29.310 21.45% 0.101 0.1009 0.10 0.751 age*age 1 0.065 0.05% 0.065 0.0650 0.07 0.799 Error 97 107.288 78.51% 107.288 1.1061 Total 99 136.663 100.00% Model Summary Deviance Deviance R-Sq R-Sq(adj) AIC 21.49% 20.03% 113.29 Coefficients Term Coef SE Coef 95% CI Z-Value P-Value VIF Constant -4.24 4.29 ( -12.65, 4.17) -0.99 0.323 age 0.061 0.195 ( -0.320, 0.443) 0.32 0.753 66.13 age*age 0.00055 0.00214 (-0.00365, 0.00475) 0.26 0.798 66.13 Odds Ratios for Continuous Predictors Odds 95% Ratio CI age * (*, *) Odds ratios are not calculated for predictors that are included in interaction terms because these ratios depend on values of the other predictors in the interaction terms. Regression Equation P(1) = exp(Y')/(1 + exp(Y')) Y' = -4.24 + 0.061 age + 0.00055 age*age Goodness-of-Fit Tests Test DF Chi-Square P-Value Deviance 97 107.29 0.223 Pearson 97 100.99 0.371 Hosmer-Lemeshow 8 0.99 0.998 Observed and Expected Frequencies for Hosmer-Lemeshow Test Event Probability chd = 1 chd = 0 Group Range Observed Expected Observed Expected 1 (0.000, 0.119) 1 0.9 9 9.1 2 (0.119, 0.165) 1 1.4 9 8.6 3 (0.165, 0.210) 2 1.9 8 8.1 4 (0.210, 0.309) 3 2.9 8 8.1 5 (0.309, 0.382) 4 3.9 7 7.1 6 (0.382, 0.491) 5 4.6 5 5.4 7 (0.491, 0.634) 5 5.7 5 4.3 8 (0.634, 0.738) 10 9.3 3 3.7 9 (0.738, 0.841) 8 8.0 2 2.0 10 (0.841, 0.931) 4 4.4 1 0.6 Answers to question 3: ---------------------- It is seen that the improvement in logL from adding the quadratic term is very small. To compare the models with and without the quadratic term we may compute the likelihood-ratio test G2 G2 = 2*(-53.644+53.677) = 0.066, df=1, Pr(chi^2(df)>0.07) >> 0.05 Note that this test is already given in the Analysis of Deviance table. The LR-test shows no evidence whatsoever of lack of fit due to curvature in the relation. We would reach the same conclusion by looking at the Wald test for the quadratic term (z=0.26, P=0.80). Question 4: ----------- Another test of lack of fit may be obtained by grouping the quantitative variable, as described in the text. The Data-Code menu is used to create the age groups, and the Stat-Basic-Store menu is used to compute the age group means and to aggregate the data to the groups. MTB > Code (20:29) 1 (30:34) 2 (35:39) 3 (40:44) 4 (45:49) 5 (50:54) 6 (55:59) 7 & CONT> (60:69) 8 'age' 'agegrp'; SUBC> Endpoints 3; SUBC> TSummary. Code Summary Lower Upper Recoded Number End End Value of Rows 20 29 1 10 30 34 2 15 35 39 3 12 40 44 4 15 45 49 5 13 50 54 6 8 55 59 7 17 60 69 8 10 Source data column age Recoded data column agegrp Each interval includes its lower and upper ends. MTB > Name c4 "ByVar1" c5 "Mean1" MTB > Statistics 'age'; SUBC> By 'agegrp'; SUBC> GValues 'ByVar1'; SUBC> Mean 'Mean1'. MTB > name c5 'agemean' MTB > Name c6 "ByVar2" c7 "Sum2" c8 "N2" MTB > Statistics 'chd'; SUBC> By 'agegrp'; SUBC> GValues 'ByVar2'; SUBC> Sums 'Sum2'; SUBC> N 'N2'. MTB > name c7 'chdpos' MTB > name c8 'total' MTB > Gzlm; SUBC> Nodefault; SUBC> Response chdpos total; SUBC> Evname "Event"; SUBC> Continuous 'agemean'; SUBC> Terms agemean; SUBC> Constant; SUBC> Binomial; SUBC> Logit; SUBC> TOdds; SUBC> Increment 1; SUBC> Unstandardized; SUBC> TExpand; SUBC> Tmethod; SUBC> Trinfo; SUBC> Tdeviance; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation; SUBC> Tgoodness; SUBC> Thosmer. Binary Logistic Regression: chdpos versus agemean Method Link function Logit Rows used 8 Response Information Event Variable Value Count Name chdpos Event 43 Event Non-event 57 total Total 100 Deviance Table Source DF Seq Dev Contribution Adj Dev Adj Mean Chi-Square P-Value Regression 1 28.3048 98.62% 28.3048 28.3048 28.30 0.000 agemean 1 28.3048 98.62% 28.3048 28.3048 28.30 0.000 Error 6 0.3967 1.38% 0.3967 0.0661 Total 7 28.7015 100.00% Model Summary Deviance Deviance R-Sq R-Sq(adj) AIC 98.62% 95.13% 112.36 Coefficients Term Coef SE Coef 95% CI Z-Value P-Value VIF Constant -5.20 1.12 ( -7.40, -3.01) -4.65 0.000 agemean 0.1087 0.0237 (0.0621, 0.1552) 4.58 0.000 1.00 Odds Ratios for Continuous Predictors Odds Ratio 95% CI agemean 1.1148 (1.0641, 1.1679) Regression Equation P(Event) = exp(Y')/(1 + exp(Y')) Y' = -5.20 + 0.1087 agemean Goodness-of-Fit Tests Test DF Chi-Square P-Value Deviance 6 0.40 0.999 Pearson 6 0.40 0.999 Hosmer-Lemeshow 5 0.38 0.996 Observed and Expected Frequencies for Hosmer-Lemeshow Test Event Probability Event Non-event Group Range Observed Expected Observed Expected 1 (0.000, 0.080) 1 0.8 9 9.2 2 (0.080, 0.151) 2 2.3 13 12.7 3 (0.151, 0.233) 3 2.8 9 9.2 4 (0.233, 0.353) 5 5.3 10 9.7 5 (0.353, 0.482) 6 6.3 7 6.7 6 (0.482, 0.726) 18 17.2 7 7.8 7 (0.726, 0.838) 8 8.4 2 1.6 Answers to question 4: ---------------------- The deviance goodness-of-fit test is the LR-test comparing a full model with agegrp as a factor with the reduced model with agegrp as a continuous predictor; the difference in deviance can also be found under Error in the Deviance table. The value of 0.396 is absolutely non-significant in a chi-square distribution with 6 df. There is no indication of lack of fit. Additionally, one could compute residuals and inspect these, but with such a non-significant goodness-of-fit test it is almost certain that the residuals would show no major model deviations. Answers to question 5: ---------------------- The interpretation of beta_1 in terms of odds-ratio has already been mentioned. The p's for the grouped data may be plotted together with the fitted curve (from the grouped analysis). The fitted curve from the ungrouped analysis may be plotted alone (without data points) as follows: MTB > Name c9 "ages" MTB > Set 'ages' DATA> 1( 29 : 60 / 1 )1 DATA> End. MTB > Name C10 'fitp' MTB > Let 'fitp' = exp(-5.30945+.110921*c10)/(1+exp(-5.30945+.110921*c10)) MTB > Plot 'fitp'*'ages'; SUBC> Symbol. Scatterplot of fitp vs ages