Solution file for additional exercise 4.1 ----------------------------------------- Data: presence or absence of CHD condition in 100 patients with recorded age. Notation: y_i = presence (1) or absence (0) of CHD, x_i = age, for i'th patient, i=1,...,100. Logistic regression model: Pr(y_i=1)=p_i logit(p_i) = beta_0 + beta_1*x_i y_1,...,y_100 are independent. MTB > WOpen "h:\VHM\VHM802\Data_csv\hs04_1.csv"; SUBC> FType; SUBC> CSV; SUBC> DecSep; SUBC> Period; SUBC> Field; SUBC> Comma; SUBC> TDelimiter; SUBC> DoubleQuote. Retrieving worksheet from file: 'h:\VHM\VHM802\Data_csv\hs04_1.csv' Worksheet was saved on 27/01/2011 MTB > Blogistic 'chd' = age; SUBC> Logit; SUBC> Brief 2; SUBC> Step. Binary Logistic Regression: chd versus age Step Log-Likelihood 0 -68.3315 1 -54.1706 2 -53.6816 3 -53.6765 4 -53.6765 5 -53.6765 Link Function: Logit Response Information Variable Value Count chd 1 43 (Event) 0 57 Total 100 Logistic Regression Table Odds 95% CI Predictor Coef SE Coef Z P Ratio Lower Upper Constant -5.30945 1.13365 -4.68 0.000 age 0.110921 0.0240598 4.61 0.000 1.12 1.07 1.17 Log-Likelihood = -53.677 Test that all slopes are zero: G = 29.310, DF = 1, P-Value = 0.000 Goodness-of-Fit Tests Method Chi-Square DF P Pearson 21.3113 41 0.995 Deviance 23.7543 41 0.986 Hosmer-Lemeshow 0.8900 8 0.999 Table of Observed and Expected Frequencies: (See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic) Group Value 1 2 3 4 5 6 7 8 9 10 Total 1 Obs 1 1 2 3 4 5 5 10 8 4 43 Exp 0.8 1.3 1.9 3.0 4.1 4.7 5.8 9.3 7.9 4.3 0 Obs 9 9 8 8 7 5 5 3 2 1 57 Exp 9.2 8.7 8.1 8.0 6.9 5.3 4.2 3.7 2.1 0.7 Total 10 10 10 11 11 10 10 13 10 5 100 Measures of Association: (Between the Response Variable and Predicted Probabilities) Pairs Number Percent Summary Measures Concordant 1936 79.0 Somers' D 0.60 Discordant 466 19.0 Goodman-Kruskal Gamma 0.61 Ties 49 2.0 Kendall's Tau-a 0.30 Total 2451 100.0 Answers to question 1: ---------------------- The estimated parameters are beta_0: -5.309 (1.134), 95% CI: -5.309 +- 1.96*1.134 = -5.31 +- 2.22 beta_1: 0.111 (0.024), 95% CI: 0.111 +- 1.96*0.024 = 0.111 +- 0.047 Interpretation: beta_1 = slope of the linear regression on logit-scale, an increase in age of 1 unit (year) correspond to an increase in logit(p) of beta_1 units, exp(beta_1) = 1.12 = odds-ratio for comparison of two patients one year apart, the CHD-risk (odds) of the older is 1.12 larger beta_0 = intercept of the linear regression on logit-scale and corresponds to value on logit-scale for age=0 - meaningless because age 0 is way beyond the range of the data Test of slope=0 (no impact of age on CHD-risk) Walds test z = 4.61, P < 0.0005 likelihood-ratio test G = 29.3, P < 0.0005 (because only one slope in model) The data show a clear impact of age on CHD-risk. Answers to question 2: ---------------------- Log-Likelihood - value that can sometimes be used to compute likelihood-ratio tests Goodness-of-fit tests: Pearson and Deviance tests are meaningless (ungrouped data). Hosmer-Lemeshow test is useful: it is the chi-square test for the shown table of observed and expected values. From the Minitab help menus: "The table of concordant, discordant, and tied pairs is calculated by pairing the observations with different response values. Here, you have 43 individuals with CHD and 57 with no disease, resulting in 43 * 57 = 2451 pairs with different response values. Based on the model, pair is concordant if the individual with a low pulse rate has a higher probability of having a low pulse, discordant if the opposite is true, and tied if the probabilities are equal. In this example, 79% of pairs are concordant and 19% are discordant. You can use these values as a comparative measure of prediction, for example in comparing fits with different sets of predictors or with different link functions." The summary measures for concordant, discordant and tied pairs just play around with these numbers, and are not terribly important. Question 3: ----------- For ungrouped data, the goodness-of-fit statistics based on the deviance and the Pearson chi-square are useless. The Hosmer-Lemeshow test is okay, but not always particularly powerful. One simple method is to add a quadratic term in the linear relation, as shown below. MTB > Name C3 'age2' MTB > Let 'age2' = 'age'**2 MTB > Blogistic 'chd' = age age2; SUBC> Logit; SUBC> Brief 2; SUBC> Step. Binary Logistic Regression: chd versus age, age2 Step Log-Likelihood 0 -68.3315 1 -54.0734 2 -53.6501 3 -53.6440 4 -53.6440 5 -53.6440 Link Function: Logit Response Information Variable Value Count chd 1 43 (Event) 0 57 Total 100 Logistic Regression Table Odds 95% CI Predictor Coef SE Coef Z P Ratio Lower Upper Constant -4.24076 4.29018 -0.99 0.323 age 0.0613177 0.194640 0.32 0.753 1.06 0.73 1.56 age2 0.0005482 0.0021437 0.26 0.798 1.00 1.00 1.00 Log-Likelihood = -53.644 Test that all slopes are zero: G = 29.375, DF = 2, P-Value = 0.000 Goodness-of-Fit Tests Method Chi-Square DF P Pearson 20.8487 40 0.995 Deviance 23.6893 40 0.981 Hosmer-Lemeshow 0.9891 8 0.998 Table of Observed and Expected Frequencies: (See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic) Group Value 1 2 3 4 5 6 7 8 9 10 Total 1 Obs 1 1 2 3 4 5 5 10 8 4 43 Exp 0.9 1.4 1.9 2.9 3.9 4.6 5.7 9.3 8.0 4.4 0 Obs 9 9 8 8 7 5 5 3 2 1 57 Exp 9.1 8.6 8.1 8.1 7.1 5.4 4.3 3.7 2.0 0.6 Total 10 10 10 11 11 10 10 13 10 5 100 Measures of Association: (Between the Response Variable and Predicted Probabilities) Pairs Number Percent Summary Measures Concordant 1936 79.0 Somers' D 0.60 Discordant 466 19.0 Goodman-Kruskal Gamma 0.61 Ties 49 2.0 Kendall's Tau-a 0.30 Total 2451 100.0 Answers to question 3: ---------------------- It is seen that the improvement in logL from adding the quadratic term is very small. To compare the models with and without the quadratic term we may compute the likelihood-ratio test G2 G2 = 2*(-53.644+53.677) = 0.066, df=1, Pr(chi^2(df)>0.07) >> 0.05 The LR-test shows no evidence whatsoever of lack of fit due to curvature in the relation. We would reach the same conclusion by looking at the Wald test for the quadratic term (z=0.26, P=0.80). Question 4: ----------- Another test of lack of fit may be obtained by grouping the quantitative variable, as described in the text. The Data-Code menu is used to create the age groups, and the Stat-Basic-Store menu is used to compute the age group means and to aggregate the data to the groups. MTB > Name c4 "agegrp" MTB > Code (20:29) 1 (30:34) 2 (35:39) 3 (40:44) 4 (45:49) 5 (50:54) 6 (55 & CONT> :59) 7 (60:69) 8 'age' 'agegrp' MTB > Name c5 "ByVar1" c6 "Mean1" MTB > Statistics 'age'; SUBC> By 'agegrp'; SUBC> GValues 'ByVar1'; SUBC> Mean 'Mean1'. MTB > name c6 'agemean' MTB > Name c7 "ByVar2" c8 "Sum2" c9 "Count2" MTB > Statistics 'chd'; SUBC> By 'agegrp'; SUBC> GValues 'ByVar2'; SUBC> Sums 'Sum2'; SUBC> Count 'Count2'. MTB > name c8 'chdpos' c9 'total' MTB > Blogistic 'chdpos' 'total' = agemean; SUBC> ST; SUBC> Logit; SUBC> Brief 2; SUBC> Step. Binary Logistic Regression: chdpos, total versus agemean Step Log-Likelihood 0 -68.3315 1 -54.5984 2 -54.1827 3 -54.1791 4 -54.1791 5 -54.1791 Link Function: Logit Response Information Variable Value Count chdpos Event 43 Non-event 57 total Total 100 Logistic Regression Table Odds 95% CI Predictor Coef SE Coef Z P Ratio Lower Upper Constant -5.20386 1.11799 -4.65 0.000 agemean 0.108652 0.0237330 4.58 0.000 1.11 1.06 1.17 Log-Likelihood = -54.179 Test that all slopes are zero: G = 28.305, DF = 1, P-Value = 0.000 Goodness-of-Fit Tests Method Chi-Square DF P Pearson 0.401068 6 0.999 Deviance 0.396712 6 0.999 Hosmer-Lemeshow 0.383176 5 0.996 Table of Observed and Expected Frequencies: (See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic) Group Value 1 2 3 4 5 6 7 Total Event Obs 1 2 3 5 6 18 8 43 Exp 0.8 2.3 2.8 5.3 6.3 17.2 8.4 Non-event Obs 9 13 9 10 7 7 2 57 Exp 9.2 12.7 9.2 9.7 6.7 7.8 1.6 Total 10 15 12 15 13 25 10 100 Measures of Association: (Between the Response Variable and Predicted Probabilities) Pairs Number Percent Summary Measures Concordant 1828 74.6 Somers' D 0.59 Discordant 386 15.7 Goodman-Kruskal Gamma 0.65 Ties 237 9.7 Kendall's Tau-a 0.29 Total 2451 100.0 Answers to question 4: ---------------------- The deviance goodness-of-fit test is the LR-test comparing a full model with agegrp as a factor with the reduced model with agegrp as a continuous predictor. The value of 0.396 is absolutely non-significant in a chi-square distribution with 6 df. There is no indication of lack of fit. Additionally, one could compute residuals and inspect these, but with such a non-significant goodness-of-fit test it is almost certain that the residuals would show no major model deviations. Answers to question 5: ---------------------- The interpretation of beta_1 in terms of odds-ratio has already been mentioned. The p's for the grouped data may be plotted together with the fitted curve (from the grouped analysis). The fitted curve from the ungrouped analysis may be plotted alone (without data points) as follows: MTB > Name c10 "ages" MTB > Set 'ages' DATA> 1( 29 : 60 / 1 )1 DATA> End. MTB > Name C11 'fitp' MTB > Let 'fitp' = exp(-5.30945+.110921*c10)/(1+exp(-5.30945+.110921*c10)) MTB > Plot 'fitp'*'ages'; SUBC> Symbol. Scatterplot of fitp vs ages