Solution file for additional exercise 7.9 ----------------------------------------- Data on length and height of English cathedrals (or parts hereof), of either Roman or Gothic architectural style. We will explore models for the logarithmic length as a function of the logarithmic height, y_ij = (natural log of) length in feet x_ij = (natural log of) height in feet of the j'th cathedral (part) in architectural group i, i=R,C; j=1,...,n_i. A plot of length against height show the cathedrals in Bath and Ripon to be somewhat outside the pattern of the others, and we omit them from the analysis. Question 1) ----------- Analysis of covariance model (parallel regression lines), using z as a covariate: y_ij = mu_i + beta * x_ij + eps_ij where the eps_ij's are assumed i.i.d. and N(0,sigma^2). MTB > WOpen "F:\VHM\VHM802\Data_csv\hs07_9.csv"; SUBC> FType; SUBC> CSV; SUBC> DecSep; SUBC> Period; SUBC> Field; SUBC> Comma; SUBC> TDelimiter; SUBC> DoubleQuote. Retrieving worksheet from file: 'F:\VHM\VHM802\Data_csv\hs07_9.csv' Worksheet was saved on 20/02/2014 MTB > Name C5 'lnh' MTB > Let 'lnh' = ln('h') MTB > Name C6 'lnl' MTB > Let 'lnl' = ln('l') MTB > Plot 'lnl'*'lnh'; SUBC> Symbol 'style'. Scatterplot of lnl vs lnh MTB > Copy 'lnl' c7; SUBC> Varnames. MTB > let c7(11)='*' MTB > let c7(19)='*' MTB > Name c8 "SRES1" c9 "TRES1" c10 "HI1" c11 "COOK1" MTB > GReg 'lnl_1' = lnh| style; SUBC> Categorical 'style'; SUBC> Constant; SUBC> Confidence 95.0; SUBC> Coding -1; SUBC> GFourpack; SUBC> RType 2 ; SUBC> TEquation; SUBC> TCoef; SUBC> TSummary; SUBC> TANOVA; SUBC> TDiag 0; SUBC> SResiduals 'SRES1'; SUBC> TResiduals 'TRES1'; SUBC> Hi 'HI1'; SUBC> CookD 'COOK1'. General Regression Analysis: lnl_1 versus lnh, style Regression Equation style G lnl_1 = 1.44891 + 1.06366 lnh R lnl_1 = 3.34201 + 0.652705 lnh 23 cases used, 2 cases contain missing values Coefficients Term Coef SE Coef T P Constant 2.39546 1.36658 1.75289 0.096 lnh 0.85818 0.31737 2.70407 0.014 style G -0.94655 1.36658 -0.69265 0.497 lnh*style G 0.20548 0.31737 0.64745 0.525 Summary of Model S = 0.147621 R-Sq = 74.47% R-Sq(adj) = 70.44% PRESS = 0.779432 R-Sq(pred) = 51.94% Analysis of Variance Source DF Seq SS Adj SS Adj MS F P Regression 3 1.20785 1.20785 0.402617 18.4755 0.000007 lnh 1 1.11505 0.15934 0.159343 7.3120 0.014065 style 1 0.08367 0.01045 0.010455 0.4798 0.496911 lnh*style 1 0.00913 0.00913 0.009135 0.4192 0.525089 Error 19 0.41405 0.41405 0.021792 Lack-of-Fit 17 0.39325 0.39325 0.023132 2.2242 0.354748 Pure Error 2 0.02080 0.02080 0.010400 Total 22 1.62190 Fits and Diagnostics for Unusual Observations Obs lnl_1 Fit SE Fit Residual St Resid 4 5.84064 6.05654 0.103583 -0.215898 -2.05268 R 22 5.20401 5.49791 0.081684 -0.293902 -2.39018 R R denotes an observation with a large standardized residual. Residual Plots for lnl_1 MTB > NormTest 'SRES1'. Probability Plot of SRES1 The P-value of the Anderson-Darling test is 0.474. Comments: --------- In this model, there is no significant interaction between style and lnh, that is, the lines for Roman and Gothic cathedrals are roughly parallel. As the sequential sum of squares (SS) for style is substantial larger than the partial SS, we should refit the model without the interaction term to get a better estimates for the parallel lines model. The model has one large residual (deletion residual = -2.78) for observation 22. It is the cathedral with smallest length and height. The Bonferroni-corrected P-value is 0.28, so it is far from significant. It has, however, a large value of Cook's statistic (0.63), so it may be expected to be rather influential. The normal probability plot does not look very straight but the A-D normality test does not indicate a significant deviation from normality. MTB > GReg 'lnl_1' = lnh style; SUBC> Categorical 'style'; SUBC> Constant; SUBC> Confidence 95.0; SUBC> Coding 1; SUBC> GFourpack; SUBC> RType 2 ; SUBC> TEquation; SUBC> TCoef; SUBC> TSummary; SUBC> TANOVA; SUBC> TDiag 0. General Regression Analysis: lnl_1 versus lnh, style Regression Equation style G lnl_1 = 1.55201 + 1.03953 lnh R lnl_1 = 1.67602 + 1.03953 lnh 23 cases used, 2 cases contain missing values Coefficients Term Coef SE Coef T P Constant 1.55201 0.629376 2.46595 0.023 lnh 1.03953 0.147057 7.06884 0.000 style R 0.12401 0.062364 1.98852 0.061 Summary of Model S = 0.145462 R-Sq = 73.91% R-Sq(adj) = 71.30% PRESS = 0.612406 R-Sq(pred) = 62.24% Analysis of Variance Source DF Seq SS Adj SS Adj MS F P Regression 2 1.19872 1.19872 0.59936 28.3263 0.000001 lnh 1 1.11505 1.05729 1.05729 49.9685 0.000001 style 1 0.08367 0.08367 0.08367 3.9542 0.060612 Error 20 0.42318 0.42318 0.02116 Lack-of-Fit 18 0.40238 0.40238 0.02235 2.1494 0.364673 Pure Error 2 0.02080 0.02080 0.01040 Total 22 1.62190 Fits and Diagnostics for Unusual Observations Obs lnl_1 Fit SE Fit Residual St Resid 22 5.20401 5.50913 0.0786558 -0.305124 -2.49363 R R denotes an observation with a large standardized residual. Comments: --------- In the additive model, there is a close to significant difference in lnl between styles, with Roman cathedrals being 0.124 units longer. Question 2) ----------- For this question we add a quadratic term in ln(h) to the model: y_ij = mu_i + beta_1 * x_ij + beta_2 * (x_ij)^2 + eps_ij MTB > Name c13 "SRES2" c14 "TRES2" c15 "HI2" c16 "COOK2" MTB > GReg 'lnl_1' = style lnh lnh2; SUBC> Categorical 'style'; SUBC> Constant; SUBC> Confidence 95.0; SUBC> Coding 1; SUBC> GFourpack; SUBC> RType 2 ; SUBC> TEquation; SUBC> TCoef; SUBC> TSummary; SUBC> TANOVA; SUBC> TDiag 0; SUBC> SResiduals 'SRES2'; SUBC> TResiduals 'TRES2'; SUBC> Hi 'HI2'; SUBC> CookD 'COOK2'. General Regression Analysis: lnl_1 versus lnh, lnh2, style Regression Equation style G lnl_1 = -26.3264 + 14.1707 lnh - 1.54063 lnh2 R lnl_1 = -26.2909 + 14.1707 lnh - 1.54063 lnh2 23 cases used, 2 cases contain missing values Coefficients Term Coef SE Coef T P Constant -26.3264 9.71374 -2.71022 0.014 style R 0.0355 0.06165 0.57592 0.571 lnh 14.1707 4.57001 3.10080 0.006 lnh2 -1.5406 0.53598 -2.87442 0.010 Summary of Model S = 0.124590 R-Sq = 81.82% R-Sq(adj) = 78.94% PRESS = 0.463819 R-Sq(pred) = 71.40% Analysis of Variance Source DF Seq SS Adj SS Adj MS F P Regression 3 1.32697 1.32697 0.442323 28.4954 0.000000 style 1 0.14143 0.00515 0.005149 0.3317 0.571425 lnh 1 1.05729 0.14925 0.149249 9.6150 0.005885 lnh2 1 0.12825 0.12825 0.128253 8.2623 0.009709 Error 19 0.29493 0.29493 0.015523 Lack-of-Fit 17 0.27413 0.27413 0.016125 1.5505 0.462953 Pure Error 2 0.02080 0.02080 0.010400 Total 22 1.62190 Fits and Diagnostics for Unusual Observations Obs lnl_1 Fit SE Fit Residual St Resid 5 6.00881 6.24443 0.044719 -0.235613 -2.02612 R 22 5.20401 5.29176 0.101279 -0.087754 -1.20937 X R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. Residual Plots for lnl_1 Comments: --------- This model has one added parameter to the parallel regression lines model, and it is significantly better (the partial t-statistic for lnh2 is 8.27). In this model, however, there is virtually no difference between the two architectural styles (P=0.57). Observation 22 is fitted much better by the model, but still has a high leverage and a high Cook's statistic. As the cathedral is the smallest one, it is not surprising to see this upon adding a quadratic term in height. MTB > Plot 'SRES2'*'lnh'; SUBC> Symbol 'style'. Scatterplot of SRES2 vs lnh Question 3) ----------- Plotting standardized residuals against ln(h) with different symbols for the architectural styles shows that the heights are in a very narrow range for the Roman cathedrals, and that the residuals have an inverse U-shape. This leads us to suspect that very different models would apply for the two architectural shapes if fitted separately. We can achieve this either by separating the observations in two columns, or by fitting a model with all interactions with style: MTB > Name c17 "SRES3" c18 "TRES3" c19 "HI3" c20 "COOK3" c21 "FITS3" MTB > GReg 'lnl_1' = style lnh lnh2 style*lnh style*lnh2; SUBC> Categorical 'style'; SUBC> Constant; SUBC> Confidence 95.0; SUBC> Coding 1; SUBC> GFourpack; SUBC> RType 2 ; SUBC> TEquation; SUBC> TCoef; SUBC> TSummary; SUBC> TANOVA; SUBC> TDiag 0; SUBC> SResiduals 'SRES3'; SUBC> TResiduals 'TRES3'; SUBC> Hi 'HI3'; SUBC> CookD 'COOK3'; SUBC> Fits 'FITS3'. General Regression Analysis: lnl_1 versus lnh, lnh2, style Regression Equation style G lnl_1 = -23.7754 + 12.9499 lnh - 1.39517 lnh2 R lnl_1 = -383.551 + 181.051 lnh - 21.0213 lnh2 23 cases used, 2 cases contain missing values Coefficients Term Coef SE Coef T P Constant -23.775 7.5377 -3.15422 0.006 style R -359.775 95.6984 -3.75947 0.002 lnh 12.950 3.5476 3.65037 0.002 lnh2 -1.395 0.4162 -3.35188 0.004 style*lnh R 168.101 44.6190 3.76749 0.002 style*lnh2 R -19.626 5.1993 -3.77474 0.002 Summary of Model S = 0.0962574 R-Sq = 90.29% R-Sq(adj) = 87.43% PRESS = 0.321084 R-Sq(pred) = 80.20% Analysis of Variance Source DF Seq SS Adj SS Adj MS F P Regression 5 1.46438 1.46438 0.292877 31.6094 0.000000 style 1 0.14143 0.13095 0.130955 14.1336 0.001562 lnh 1 1.05729 0.12346 0.123464 13.3252 0.001980 lnh2 1 0.12825 0.10410 0.104098 11.2351 0.003782 style*lnh 1 0.00539 0.13151 0.131514 14.1940 0.001535 style*lnh2 1 0.13202 0.13202 0.132021 14.2487 0.001512 Error 17 0.15751 0.15751 0.009265 Lack-of-Fit 15 0.13671 0.13671 0.009114 0.8763 0.654308 Pure Error 2 0.02080 0.02080 0.010400 Total 22 1.62190 Fits and Diagnostics for Unusual Observations Obs lnl_1 Fit SE Fit Residual St Resid 4 5.84064 5.83004 0.0876361 0.0105972 0.266146 X X denotes an observation whose X value gives it large leverage. Residual Plots for lnl_1 MTB > Plot 'SRES3'*'lnh'; SUBC> Symbol 'style'. Scatterplot of SRES3 vs lnh MTB > Plot 'FITS3'*'lnh'; SUBC> Symbol 'style'; SUBC> Connect 'style'. Scatterplot of FITS3 vs lnh Comments: --------- The model shows again a much improved fit to the data (compare the MSE's or refer to the strongly significant F-statistics for all terms in the model). The leverage and Cook's D for obs. 22 are even higher now, and also obs. 4 has high leverage, but the model fit seems very good (as assessed from the residual and normal plots). Estimated quadratic regression equations for the 2 architectural styles: G: -23.77 + 12.95*ln(h) - 1.395*ln(h)^2 R: 383.55 + 181.05*ln(h) - 21.012*ln(h)^2 The estimates and the plot show that the equations are very different. The two first models had the same relation between height and length in the two groups, which however does not seem to be supported by the data at all.