Solution file for additional exercise 2.3 ----------------------------------------- (Minitab version 15) The file gives commands for the different regression analyses, however no graphs which are in this case an essential part of the information. Statistical models: I: mort_i = beta_0 + beta_1*age_i + eps_i II: ln(mort_i) = beta_0 + beta_1*age_i + eps_i III: ln(mort_i) = beta_0 + beta_1*ln(age_i) + eps_i where for all models eps_1,...,eps_13 are i.i.d. and N(0,sigma^2). MTB > WOpen "R:\data_csv\hs02_3.csv"; SUBC> FType; SUBC> CSV; SUBC> DecSep; SUBC> Period; SUBC> Field; SUBC> Comma; SUBC> TDelimiter; SUBC> DoubleQuote. Retrieving worksheet from file: ‘R:\data_csv\hs02_3.csv’ Worksheet was saved on 12/01/2011 MTB > Name C3 'lnage' MTB > Let 'lnage' = ln('age') MTB > Name C4 'lnmort' MTB > Let 'lnmort' = ln('mort') MTB > Fitline 'mort' 'age'; SUBC> GFourpack; SUBC> RType 2; SUBC> Confidence 95.0. Regression Analysis: mort versus age The regression equation is mort = - 254.5 + 7.201 age S = 71.4596 R-Sq = 80.8% R-Sq(adj) = 79.0% Analysis of Variance Source DF SS MS F P Regression 1 235958 235958 46.21 0.000 Error 11 56171 5106 Total 12 292129 Fitted Line: mort versus age Residual Plots for mort MTB > Fitline 'lnmort' 'age'; SUBC> GFourpack; SUBC> RType 2; SUBC> Confidence 95.0. Regression Analysis: lnmort versus age The regression equation is lnmort = - 2.408 + 0.1114 age S = 0.384285 R-Sq = 97.2% R-Sq(adj) = 96.9% Analysis of Variance Source DF SS MS F P Regression 1 56.4323 56.4323 382.14 0.000 Error 11 1.6244 0.1477 Total 12 58.0567 Fitted Line: lnmort versus age Residual Plots for lnmort MTB > Fitline 'lnmort' 'lnage'; SUBC> GFourpack; SUBC> RType 2; SUBC> Confidence 95.0. Regression Analysis: lnmort versus lnage The regression equation is lnmort = - 17.35 + 5.346 lnage S = 0.100240 R-Sq = 99.8% R-Sq(adj) = 99.8% Analysis of Variance Source DF SS MS F P Regression 1 57.9462 57.9462 5766.93 0.000 Error 11 0.1105 0.0100 Total 12 58.0567 Fitted Line: lnmort versus lnage Residual Plots for lnmort Question 1: ----------- The 3 analyses and fitted line plots show clearly that the regression of ln(mort) on ln(age) is the best, and the only one that would be somewhat acceptable. The two other models show an appreciable lack of fit. Even the log-log model has a clear pattern of the points around the line, which would indicate that a further refinement of the model is necessary. However, we will for this exercise accept the present fit as satisfactory. At R^2=99.8% the model already explains almost all of the variation in the data. The estimated regression equation for the log-log model is ln(mort) = -17.3 + 5.35*ln(age) or mort = exp(-17.3468) * age^5.34556 The cancer increases as a power function of x, and an estimated power of about 5.35. (Note that many decimals are needed to give precise predictions.) Next, we rerun the models with prediction intervals: MTB > Regress 'mort' 1 'age' ; SUBC> Constant; SUBC> Predict 'age' ; SUBC> Brief 2. Regression Analysis: mort versus age The regression equation is mort = - 254 + 7.20 age Predictor Coef SE Coef T P Constant -254.47 59.04 -4.31 0.001 age 7.201 1.059 6.80 0.000 S = 71.4596 R-Sq = 80.8% R-Sq(adj) = 79.0% Analysis of Variance Source DF SS MS F P Regression 1 235958 235958 46.21 0.000 Residual Error 11 56171 5106 Total 12 292129 Unusual Observations Obs age mort Fit SE Fit Residual St Resid 13 82.5 462.0 339.6 37.5 122.4 2.01R R denotes an observation with a large standardized residual. Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 -92.4 37.5 (-174.9, -10.0) (-270.0, 85.1) 2 -56.4 33.1 (-129.2, 16.4) (-229.7, 116.9) 3 -20.4 29.0 ( -84.3, 43.4) (-190.2, 149.3) 4 15.6 25.4 ( -40.3, 71.5) (-151.3, 182.5) 5 51.6 22.5 ( 2.1, 101.1) (-113.3, 216.5) 6 87.6 20.5 ( 42.4, 132.7) ( -76.0, 251.2) 7 123.6 19.8 ( 80.0, 167.2) ( -39.6, 286.8) 8 159.6 20.5 ( 114.5, 204.8) ( -4.0, 323.2) 9 195.6 22.5 ( 146.2, 245.1) ( 30.7, 360.5) 10 231.6 25.4 ( 175.7, 287.5) ( 64.7, 398.5) 11 267.6 29.0 ( 203.8, 331.5) ( 97.9, 437.4) 12 303.6 33.1 ( 230.8, 376.4) ( 130.3, 477.0) 13 339.6 37.5 ( 257.2, 422.1) ( 162.1, 517.2) Values of Predictors for New Observations New Obs age 1 22.5 2 27.5 3 32.5 4 37.5 5 42.5 6 47.5 7 52.5 8 57.5 9 62.5 10 67.5 11 72.5 12 77.5 13 82.5 MTB > Regress 'lnmort' 1 'age' ; SUBC> Constant; SUBC> Predict 'age' ; SUBC> Brief 2. Regression Analysis: lnmort versus age The regression equation is lnmort = - 2.41 + 0.111 age Predictor Coef SE Coef T P Constant -2.4075 0.3175 -7.58 0.000 age 0.111367 0.005697 19.55 0.000 S = 0.384285 R-Sq = 97.2% R-Sq(adj) = 96.9% Analysis of Variance Source DF SS MS F P Regression 1 56.432 56.432 382.14 0.000 Residual Error 11 1.624 0.148 Total 12 58.057 Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 0.098 0.201 (-0.345, 0.542) (-0.857, 1.053) 2 0.655 0.178 ( 0.264, 1.047) (-0.277, 1.587) 3 1.212 0.156 ( 0.869, 1.555) ( 0.299, 2.125) 4 1.769 0.137 ( 1.468, 2.069) ( 0.871, 2.666) 5 2.326 0.121 ( 2.060, 2.592) ( 1.439, 3.212) 6 2.882 0.110 ( 2.640, 3.125) ( 2.002, 3.762) 7 3.439 0.107 ( 3.205, 3.674) ( 2.562, 4.317) 8 3.996 0.110 ( 3.753, 4.239) ( 3.116, 4.876) 9 4.553 0.121 ( 4.287, 4.819) ( 3.666, 5.440) 10 5.110 0.137 ( 4.809, 5.410) ( 4.212, 6.007) 11 5.667 0.156 ( 5.323, 6.010) ( 4.754, 6.579) 12 6.223 0.178 ( 5.832, 6.615) ( 5.291, 7.155) 13 6.780 0.201 ( 6.337, 7.224) ( 5.825, 7.735) Values of Predictors for New Observations New Obs age 1 22.5 2 27.5 3 32.5 4 37.5 5 42.5 6 47.5 7 52.5 8 57.5 9 62.5 10 67.5 11 72.5 12 77.5 13 82.5 MTB > Regress 'lnmort' 1 'lnage' ; SUBC> Constant; SUBC> Predict 'lnage'; SUBC> Brief 2. Regression Analysis: lnmort versus lnage The regression equation is lnmort = - 17.3 + 5.35 lnage Predictor Coef SE Coef T P Constant -17.3468 0.2751 -63.05 0.000 lnage 5.34556 0.07039 75.94 0.000 S = 0.100240 R-Sq = 99.8% R-Sq(adj) = 99.8% Analysis of Variance Source DF SS MS F P Regression 1 57.946 57.946 5766.93 0.000 Residual Error 11 0.111 0.010 Total 12 58.057 Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 -0.7033 0.0612 (-0.8381, -0.5685) (-0.9618, -0.4448) 2 0.3694 0.0491 ( 0.2614, 0.4774) ( 0.1238, 0.6150) 3 1.2624 0.0399 ( 1.1745, 1.3503) ( 1.0249, 1.4999) 4 2.0273 0.0334 ( 1.9537, 2.1010) ( 1.7948, 2.2599) 5 2.6964 0.0295 ( 2.6315, 2.7613) ( 2.4665, 2.9264) 6 3.2910 0.0279 ( 3.2296, 3.3523) ( 3.0620, 3.5200) 7 3.8260 0.0283 ( 3.7638, 3.8882) ( 3.5968, 4.0552) 8 4.3123 0.0301 ( 4.2461, 4.3785) ( 4.0819, 4.5426) 9 4.7580 0.0328 ( 4.6859, 4.8301) ( 4.5259, 4.9901) 10 5.1694 0.0359 ( 5.0903, 5.2485) ( 4.9350, 5.4038) 11 5.5514 0.0393 ( 5.4648, 5.6379) ( 5.3144, 5.7884) 12 5.9079 0.0428 ( 5.8137, 6.0020) ( 5.6680, 6.1478) 13 6.2421 0.0462 ( 6.1404, 6.3438) ( 5.9992, 6.4850) Values of Predictors for New Observations New Obs lnage 1 3.11 2 3.31 3 3.48 4 3.62 5 3.75 6 3.86 7 3.96 8 4.05 9 4.14 10 4.21 11 4.28 12 4.35 13 4.41 Question 1 (cont): ------------------ The above three tables give prediction intervals for the all three models. We compare the predictions for the two best models for a couple of age-values and with 95% prediction intervals: log(mort) on log(age) log(mort) on age age obs estim low PI upp PI estim low PI upp PI 27.5 1.42 1.45 1.13 1.85 1.93 0.76 4.89 52.5 45.7 45.9 36.5 57.7 31.2 13.0 75.0 77.5 369 368 289 655 504 339 1280 The difference between observed and fitted values is very small for the log-log model, and the PI intervals are much narrower than those for the other model. Question 2: ----------- MTB > Regress 'lnmort' 2 'age' 'lnage'; SUBC> GFourpack; SUBC> RType 2; SUBC> Constant; SUBC> Brief 2. Regression Analysis: lnmort versus age, lnage The regression equation is lnmort = - 16.2 + 0.00897 age + 4.93 lnage Predictor Coef SE Coef T P Constant -16.191 1.115 -14.53 0.000 age 0.008971 0.008388 1.07 0.310 lnage 4.9273 0.3973 12.40 0.000 S = 0.0995910 R-Sq = 99.8% R-Sq(adj) = 99.8% Analysis of Variance Source DF SS MS F P Regression 2 57.958 28.979 2921.73 0.000 Residual Error 10 0.099 0.010 Total 12 58.057 Source DF Seq SS age 1 56.432 lnage 1 1.525 Unusual Observations Obs age lnmort Fit SE Fit Residual St Resid 13 82.5 6.1356 6.2919 0.0654 -0.1563 -2.08R R denotes an observation with a large standardized residual. Residual Plots for lnmort Comments: --------- The fitted model with both an age-term and a log(age)-term is shown above. The age-term has the wrong sign (lambda should be >0), but is clearly non-significant, so that there is no evidence against the regression coefficient being zero or slightly negative. The estimated value of lambda is -0.009 (s.e. 0.008). The estimated regression coefficient of log(age) is 4.9 (s.e. 0.4), therefore the estimated value of n is 4.9+1=5.9 or 6 mutating genes. Without the age-term, the estimated n would be 5.3+1=6.3 or 6 as well. The model still shows some lack of fit in the residual plot, but without knowledge about the real data (the data analysed here are aggregated data) it is difficult to improve the model.