Solution file for additional exercise 4.3 ----------------------------------------- Data: presence or absence of disease (liver cirrhosis) of patients in two cities and classified as alcoholics/non-alcoholics. Notation: y_ijk = presence (1) or absence (0) of disease, for k'th person in city i, i=1,0 (New York, Philadelphia), and in alcoholic group j, j=1,0 (alcoholic, non-alcoholic), k=1,...,n_ij. Alternative notation: y_i, with i referring to patient number 1,...,4600. Full model for grouped data: Pr(y_ijk=1)=p_ij, with no restrictions on the p_ij's all y_ijk's are independent. Logistic regression model: Pr(y_ijk=1)=p_ij logit(p_ij) = mu + alpha_i + beta_j ( or logit(p_i) = mu + alpha_city(i) + beta_alcohol(i) ) all y_ijk's are independent. Effectively, this model assumes no interaction on logistic scale between effects of city and alcoholic groups. MTB > WOpen "H:\VHM\VHM802\Data_csv\hs04_3.csv"; SUBC> FType; SUBC> CSV; SUBC> DecSep; SUBC> Period; SUBC> Field; SUBC> Comma; SUBC> TDelimiter; SUBC> DoubleQuote. Retrieving worksheet from file: ‘H:\VHM\VHM802\Data_csv\hs04_3.csv’ Worksheet was saved on 27/01/2011 MTB > Gzlm; SUBC> Nodefault; SUBC> REvent 1; SUBC> Response 'disease'; SUBC> Frequency 'n'; SUBC> Categorical 'city' 'alcoholic'; SUBC> Terms city alcoholic; SUBC> Constant; SUBC> Binomial; SUBC> Logit; SUBC> TOdds; SUBC> Tmethod; SUBC> Trinfo; SUBC> Tstep; SUBC> Tdeviance; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation; SUBC> Tgoodness; SUBC> Thosmer; SUBC> Tassociation. Binary Logistic Regression: disease versus city, alcoholic Method Link function Logit Frequency n Categorical predictor coding (1, 0) Rows used 8 Response Information Variable Value Count disease 1 210 (Event) 0 4390 Total 4600 Deviance at Each Iterative Step Step Deviance 1 1781.965789 2 1544.116803 3 1514.829898 4 1513.908991 5 1513.907631 6 1513.907631 Deviance Table Source DF Adj Dev Adj Mean Chi-Square P-Value Regression 2 192.77 96.386 192.77 0.000 city 1 14.53 14.526 14.53 0.000 alcoholic 1 155.52 155.520 155.52 0.000 Error 4597 1513.91 0.329 Total 4599 1706.68 Model Summary Deviance Deviance R-Sq R-Sq(adj) AIC 11.30% 11.18% 1519.91 Coefficients Term Coef SE Coef VIF Constant -2.886 0.164 city Philadelphia -0.681 0.172 1.04 alcoholic 1 2.203 0.160 1.04 Odds Ratios for Categorical Predictors Level A Level B Odds Ratio 95% CI city Philadelphia New York 0.5059 (0.3614, 0.7082) alcoholic 1 0 9.0565 (6.6127, 12.4035) Odds ratio for level A relative to level B Regression Equation P(1) = exp(Y')/(1 + exp(Y')) Y' = -2.886 + 0.0 city_New York - 0.681 city_Philadelphia + 0.0 alcoholic_0 + 2.203 alcoholic_1 Goodness-of-Fit Tests Test DF Chi-Square P-Value Deviance 4597 1513.91 1.000 Pearson 4597 4621.91 0.395 Hosmer-Lemeshow 1 0.10 0.750 Observed and Expected Frequencies for Hosmer-Lemeshow Test Event Probability disease = 1 disease = 0 Group Range Observed Expected Observed Expected 1 (0.000, 0.027) 105 103.6 3667 3668.4 2 (0.027, 0.053) 25 26.4 475 473.6 3 (0.053, 0.336) 80 80.0 248 248.0 Measures of Association Pairs Number Percent Summary Measures Value Concordant 429440 46.6 Somers’ D 0.37 Discordant 85040 9.2 Goodman-Kruskal Gamma 0.67 Ties 407420 44.2 Kendall’s Tau-a 0.03 Total 921900 100.0 Association is between the response variable and predicted probabilities Comments: --------- The deviance and Pearson goodness-of-fit statistics are non-significant, indicating no evidence of interaction between effects of city and alcohol-group. The Hosmer-Lemeshow test is less useful here. The z-tests for city and alcohol are both clearly significant, so we do not bother calculating the likelihood ratio statistics (by fitting the relevant submodels and computing the differences in deviance). The odds-ratio for city (Philadelphia vs New York) is 0.51, which means that the risk (odds) of being diseased is about twice as high in the patient group in NY than in Ph. It is not clear what that really means, because it may very well relate to the selection of the patients. Maybe cities should be considered as blocks, and not be given any particular interpretation. The odds-ratio for alcohol group (alcoholic vs. non-alcoholic) is 9.1, which means that the risk (odds) for disease is much higher in the alcoholic group. Maybe not surprising, with today's understanding, but these data are from before 1942.