Supplementary Exercises 2.11, 2.48, 10.38 and 10.39 of IPS7e ------------------------------------------------------------ Measurements on 38 patients of MAV and HA angles (two deformations of the foot, the former of which is less severe and is hypothesized to be useful as a predictor of the latter). Two response variables, and samples can be assumed i.i.d. from a population for which the 38 patients are considered representative. 2.11: ------ Minitab commands: MTB > WOpen "R:\Chapter 2\ex02_011.mtw". Retrieving worksheet from file: ‘R:\Chapter 2\ex02_011.mtw’ Worksheet was saved on 07/11/2014 MTB > Plot 'hav'*'ma'; SUBC> Symbol. Scatterplot of hav vs ma Answers to questions: (a) We put the MA value on the x-axis because there is interest in predicting HAV from MA. This is, however, of primary interest in the context of regression, and for the purpose of studying the strength and direction of association between the two variables the axes might as well be reversed. Both variables are measured as responses, so none of them would (in our usage) be labeled as explanatory. (b) The association is clearly positive, and maybe approximately linear; the points are so scattered that it is hard to tell whether it is linear or not. There is one observation (no. 31; MA=12 and HAV=50) which seems outside the pattern of the other observations; we may consider this observation a suspected outlier. Even without this observation the association is rather weak. To quantify our impression of the association, we compute the correlation. MTB > Correlation 'hav' 'ma'. Correlation: hav, ma Pearson correlation of hav and ma = 0.302 P-Value = 0.065 (c) There may be a positive association between the two measurements but it seems too weak to allow any useful prediction of HAV from an MA-value. 2.48: ------ (a) Minitab commands and output: MTB > Fitline 'hav' 'ma'; SUBC> Confidence 95.0. Regression Analysis: hav versus ma The regression equation is hav = 19.72 + 0.3388 ma S = 7.22371 R-Sq = 9.1% R-Sq(adj) = 6.6% Analysis of Variance Source DF SS MS F P Regression 1 188.71 188.714 3.62 0.065 Error 36 1878.55 52.182 Total 37 2067.26 Fitted Line: hav versus ma (b) The listing above gives the prediction equation as HAV=19.72 + 0.3388*MA. Therefore, from an MA-value of 25 we predict the HAV to be 19.72+0.3388*25=28.19. (c) The question intention for a numerical measure of the "accuracy" of the prediction is probably R^2=0.091. (Note that if we distinguish between accuracy and precision, this would be referring to the precision.) However, there are two better ways of quantifying the precision in the equation. One is by the standard deviation, s=7.224. It measures the spread of the points about the line. Roughly speaking, the points should be within a band of +- 2*s of the line; this band is very wide and includes almost the entire range of HAV-values. Even better, we can compute a 95% prediction interval for a new observation (using the methods of Chapter 10). MTB > Regress; SUBC> Response 'hav'; SUBC> Nodefault; SUBC> Continuous 'ma'; SUBC> Terms ma; SUBC> Constant; SUBC> Unstandardized; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation; SUBC> TDiagnostics 0. Regression Analysis: hav versus ma Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value Regression 1 188.7 188.71 3.62 0.065 ma 1 188.7 188.71 3.62 0.065 Error 36 1878.5 52.18 Lack-of-Fit 17 1105.9 65.05 1.60 0.161 Pure Error 19 772.6 40.66 Total 37 2067.3 Model Summary S R-sq R-sq(adj) R-sq(pred) 7.22371 9.13% 6.60% 0.60% Coefficients Term Coef SE Coef T-Value P-Value VIF Constant 19.72 3.22 6.13 0.000 ma 0.339 0.178 1.90 0.065 1.00 Regression Equation hav = 19.72 + 0.339 ma Fits and Diagnostics for Unusual Observations Obs hav Fit Resid Std Resid 5 38.00 30.90 7.10 1.09 X 31 50.00 23.79 26.21 3.70 R 37 32.00 32.26 -0.26 -0.04 X R Large residual X Unusual X MTB > Predict 'hav'; SUBC> Nodefault; SUBC> KPredictors 25; SUBC> TEquation; SUBC> TPrediction. Prediction for hav Variable Setting ma 25 Fit SE Fit 95% CI 95% PI 28.1942 1.87073 (24.4001, 31.9882) (13.0605, 43.3278) Comment: -------- The Minitab listing shows the 95% prediction interval to be (13.1,43.3) - which is pretty much useless... 10.38: ------- (a)+(b) Answered above. (c) The linear regression model is: HAV_i = beta0 + beta1*MA_i + eps_i, where beta0 is the intercept, beta1 is the slope and the errors eps_1,...,eps_38 are assumed i.i.d. from N(0,sigma). (d) The question of interest (that can be translated into hypotheses) is whether there exists an association between MA and HAV in the population. The null and alternative hypotheses are: H0: beta1=0 (no association between MA and HAV) Ha: beta1<>0 (some association, but no particular direction) (e) The listings above give the t-test for H0: t=1.90, df=36, P=0.065. We conclude that there is no evidence against H0 at the 5% level. However, the P-value is close so we may say that there is some indication of a weak association. Note that we could equivalently have based our conclusion on the F-statistic in the ANOVA table: F=3.62, df=(1,36), P=0.065. The only advantage of the t-test over the F-test is that it can be used with a one-sided alternative hypothesis. 10.39: ------- The 95% CI for beta1 is computed as estimate +- t(.975,36)*SE(estimate) = 0.3388 +- 2.028*0.1782 = 0.339 +- 0.361 = (-0.023,0.700) (The t-value is obtained from Minitab.) By the fact that zero is contained in the confidence interval, we would be able to conclude that the test for H0 previously computed is non-significant at the 5% level. Confidence intervals for the regression parameters can be displayed in Minitab by choosing the Expanded tables under Results. ----- Answer to additional questions: Minitab commands and output for an analysis without the suspected outlier. MTB > Copy 'ma' c3; SUBC> Varnames. MTB > let c3(31)='*' MTB > Correlation 'hav' 'ma_1'. Correlation: hav, ma_1 Pearson correlation of hav and ma_1 = 0.443 P-Value = 0.006 MTB > Fitline 'hav' 'ma_1'; SUBC> Confidence 95.0. Regression Analysis: hav versus ma_1 The regression equation is hav = 17.66 + 0.4189 ma_1 S = 5.76345 R-Sq = 19.6% R-Sq(adj) = 17.3% Analysis of Variance Source DF SS MS F P Regression 1 284.20 284.205 8.56 0.006 Error 35 1162.61 33.217 Total 36 1446.81 Fitted Line: hav versus ma_1 Comments: --------- Without the suspected outlier, the association between MA and HAV is clearly significant (P=0.006). The correlation is still relatively low, and even if the standard deviation about the line dropped to 5.76 it is still quite large. The slope of the line increased to 0.42 and the intercept dropped to 17.7. Thus, the fitted line starts a little lower and is somewhat steeper, but the difference is not huge when comparing the plots. As to the question whether the suspected outlier should be deleted from the data, it is possible to show (using methods beyond this course) that the observation is very unlikely to have occurred by chance alone. A test for whether this observation can be described by the same model as the others, gives a P-value of 0.002. This still does not mean that it MUST be dropped, but it provides justification to do so. The practical difference between the results of the two analyses is only minor, because even the model without the suspected outlier is not useful for prediction.