Solution file for additional exercise 2.6 ----------------------------------------- The purpose of this solution file is to demonstrate and comment on Minitab commands that correspond to the Stata solutions presented in VHM 812 for linear regression exercises 1-3. Output is only included for demonstration of differences between Stata and Minitab. Exercise 1: ----------- * open the btb_episodes data set use btb_episodes.dta, clear MTB > WOpen "h:\VHM\VHM802\Data_csv\hs02_6.csv"; SUBC> FType; SUBC> CSV; SUBC> DecSep; SUBC> Period; SUBC> Field; SUBC> Comma; SUBC> TDelimiter; SUBC> DoubleQuote. Retrieving worksheet from file: 'h:\VHM\VHM802\Data_csv\hs02_6.csv' Worksheet was saved on 12/01/2011 * Q1a: distribution of outcome -intvl- codebook intvl histogram intvl MTB > GSummary 'intvl'. Summary for intvl * Q1b: natural log transformed outcome generate intvl_ln=ln(intvl) histogram intvl_ln MTB > Name C7 'intvl_ln' MTB > Let 'intvl_ln' = ln('intvl') MTB > GSummary 'intvl_ln'. Summary for intvl_ln * Q2: simple linear regressions regress intvl_ln p_rct regress intvl_ln p_year regress intvl_ln hdsize MTB > Regress 'intvl_ln' 1 'p_rct'; SUBC> Constant; SUBC> Brief 1. Regression Analysis: intvl_ln versus p_rct MTB > Regress 'intvl_ln' 1 'p_year'; SUBC> Constant; SUBC> Brief 1. Regression Analysis: intvl_ln versus p_year MTB > Regress 'intvl_ln' 1 'hdsize'; SUBC> Constant; SUBC> Brief 1. Regression Analysis: intvl_ln versus hdsize * Q3: multiple linear regression regress intvl_ln p_rct p_year hdsize MTB > Regress 'intvl_ln' 3 'p_rct' 'p_year' 'hdsize'; SUBC> Constant; SUBC> Brief 1. Regression Analysis: intvl_ln versus p_rct, p_year, hdsize * added calculation of correlation between p_rct and hdsize pwcorr p_rct hdsize MTB > Correlation 'p_rct' 'hdsize'. Correlations: p_rct, hdsize regress intvl_ln hdsize estimates store smpl regress intvl_ln p_rct p_year hdsize estimates store mltpl estimates table smpl mltpl Note: It is possible in Minitab to store estimates in a column in the worksheet, but any further processing needs to be done manually. * Q4: predicion intervals for means and individuals (based only on hdsize) regress intvl_ln hdsize * doing the calculations yourself predict pv, xb predict pv_mean_se, stdp scalar tstar=invttail(2985,.025) /* using DFE */ generate pv_mean_u=pv + tstar*pv_mean_se generate pv_mean_l=pv - tstar*pv_mean_se predict pv_ind_se, stdf generate pv_ind_u=pv + tstar*pv_ind_se generate pv_ind_l=pv - tstar*pv_ind_se twoway (scatter intvl_ln hdsize, msize(vsmall)) (line pv hdsize) /// (line pv_mean_u hdsize) (line pv_mean_l hdsize) /// (line pv_ind_u hdsize) (line pv_ind_l hdsize) * using the built in graphics capabilities (only simple linear regression) twoway (scatter intvl_ln hdsize, sort msize(vsmall)) /// (lfitci intvl_ln hdsize, ciplot(rline)) /// (lfitci intvl_ln hdsize, stdf ciplot(rline)) MTB > Fitline 'intvl_ln' 'hdsize'; SUBC> Confidence 95.0; SUBC> Ci; SUBC> Pi. Regression Analysis: intvl_ln versus hdsize MTB > Name c8 "PFIT1" c9 "CLIM1" c10 "CLIM2" c11 "PLIM1" c12 "PLIM2" MTB > Regress 'intvl_ln' 1 'hdsize'; SUBC> Constant; SUBC> Predict 'hdsize'; SUBC> PFits 'PFIT1'; SUBC> CLimits 'CLIM1'-'CLIM2'; SUBC> PLimits 'PLIM1'-'PLIM2'; SUBC> Brief 1. Regression Analysis: intvl_ln versus hdsize * Q5: R2 regress intvl_ln p_year hdsize regress intvl_ln p_rct p_year hdsize MTB > Regress 'intvl_ln' 2 'p_rct' 'hdsize'; SUBC> Constant; SUBC> Brief 1. Regression Analysis: intvl_ln versus p_rct, hdsize MTB > Regress 'intvl_ln' 3 'p_rct' 'p_year' 'hdsize'; SUBC> Constant; SUBC> Brief 1. Regression Analysis: intvl_ln versus p_rct, p_year, hdsize Exercise 2: ----------- * open the btb_episodes data set use btb_episodes.dta, clear MTB > WOpen "h:\VHM\VHM802\Data_csv\hs02_6.csv"; SUBC> FType; SUBC> CSV; SUBC> DecSep; SUBC> Period; SUBC> Field; SUBC> Comma; SUBC> TDelimiter; SUBC> DoubleQuote. Retrieving worksheet from file: 'h:\VHM\VHM802\Data_csv\hs02_6.csv' Worksheet was saved on 12/01/2011 * compute natural log transformed outcome generate intvl_ln=ln(intvl) MTB > Name C7 'intvl_ln' MTB > Let 'intvl_ln' = ln('intvl') * Q2: -p_year- as sole predictor regress intvl_ln p_year /* continuous predictor */ regress intvl_ln i.p_year /* categorical predictor */ MTB > Regress 'intvl_ln' 1 'p_year'; SUBC> Constant; SUBC> Brief 1. Regression Analysis: intvl_ln versus p_year MTB > GReg 'intvl_ln' = 'p_year'; SUBC> Categorical 'p_year'; SUBC> Constant; SUBC> Confidence 95.0; SUBC> Coding 1; SUBC> References 'p_year' 1989; SUBC> TCoef; SUBC> CI; SUBC> TSummary; SUBC> TANOVA. General Regression Analysis: intvl_ln versus p_year Coefficients Term Coef SE Coef T P 95% CI Constant 6.95797 0.062422 111.466 0.000 ( 6.83557, 7.08036) p_year 1990 0.02120 0.078812 0.269 0.788 (-0.13334, 0.17573) 1991 -0.00877 0.081922 -0.107 0.915 (-0.16940, 0.15186) 1992 0.12456 0.085360 1.459 0.145 (-0.04281, 0.29193) 1993 0.27241 0.094199 2.892 0.004 ( 0.08771, 0.45711) 1994 0.21709 0.106276 2.043 0.041 ( 0.00871, 0.42548) 1995 0.01496 0.110415 0.136 0.892 (-0.20153, 0.23146) 1996 0.12827 0.105811 1.212 0.226 (-0.07920, 0.33574) 1997 -0.02352 0.117306 -0.200 0.841 (-0.25352, 0.20649) 1998 -0.38491 0.118468 -3.249 0.001 (-0.61720, -0.15262) 1999 -0.35333 0.104693 -3.375 0.001 (-0.55861, -0.14805) 2000 -0.21752 0.113130 -1.923 0.055 (-0.43934, 0.00430) 2001 -0.14647 0.134597 -1.088 0.277 (-0.41038, 0.11744) 2002 -0.45370 0.130723 -3.471 0.001 (-0.71001, -0.19738) 2003 -0.51631 0.133918 -3.855 0.000 (-0.77889, -0.25373) 2004 -0.69396 0.150912 -4.598 0.000 (-0.98987, -0.39806) 2005 -0.94842 0.169650 -5.590 0.000 (-1.28107, -0.61578) 2006 -1.38073 0.169650 -8.139 0.000 (-1.71337, -1.04809) 2007 -1.91050 0.607348 -3.146 0.002 (-3.10136, -0.71963) Summary of Model S = 1.04639 R-Sq = 7.39% R-Sq(adj) = 6.83% PRESS = 3296.20 R-Sq(pred) = 6.47% Analysis of Variance Source DF Seq SS Adj SS Adj MS F P Regression 18 260.39 260.39 14.4659 13.2118 0 p_year 18 260.39 260.39 14.4659 13.2118 0 Error 2981 3263.97 3263.97 1.0949 Total 2999 3524.35 Note: In order to get the model parametrized in the same way as in Stata with the first category as the reference level, one needs to set the "Type of coding for categorical predictors" to (1,0), in the Options submenu. Otherwise the parameters will be determined from the restriction that the sum of all parameters across categories equals 0. This parametrization is also used in General Linear Model menu (with no alternatives provided). * improving interpretability of intercept generate pyear_1989=p_year-1989 regress intvl_ln pyear_1989 MTB > Let c8 = 'p_year'-1989 MTB > Name C8 'pyear_1989' MTB > GReg 'intvl_ln' = 'pyear_1989'; SUBC> Constant; SUBC> Confidence 95.0; SUBC> TCoef; SUBC> CI; SUBC> TSummary; SUBC> TANOVA. General Regression Analysis: intvl_ln versus pyear_1989 Coefficients Term Coef SE Coef T P 95% CI Constant 7.14404 0.0294498 242.584 0.000 ( 7.08630, 7.20179) pyear_1989 -0.04784 0.0041802 -11.445 0.000 (-0.05604, -0.03965) Summary of Model S = 1.06130 R-Sq = 4.19% R-Sq(adj) = 4.15% PRESS = 3380.97 R-Sq(pred) = 4.07% Analysis of Variance Source DF Seq SS Adj SS Adj MS F P Regression 1 147.55 147.55 147.547 130.996 0.0000000 pyear_1989 1 147.55 147.55 147.547 130.996 0.0000000 Error 2998 3376.81 3376.81 1.126 Lack-of-Fit 17 112.84 112.84 6.638 6.062 0.0000000 Pure Error 2981 3263.97 3263.97 1.095 Total 2999 3524.35 * manual computation of lack of fit test scalar F=(3376.80672-3263.96797)/(2998-2981)/1.09492384 display "lack-of-fit F-test: F=" F " P-value=" Ftail(2998-2981,2981,F) The lack of fit test was already computed above (listed in the ANOVA table). The P-value could also be computed manually in Minitab as MTB > CDF 6.0621326; SUBC> F 17 2981. Cumulative Distribution Function F distribution with 17 DF in numerator and 2981 DF in denominator x P( X <= x ) 6.06213 1 * Q3: linearity of -p_year- vs -intvl_ln- relationship twoway (scatter intvl_ln p_year, msize(vsmall)) (lowess intvl_ln p_year) generate pyear_sq = p_year^2 regress intvl_ln p_year pyear_sq MTB > Plot 'intvl_ln'*'p_year'; SUBC> Symbol; SUBC> Lowess. Scatterplot of intvl_ln vs p_year MTB > Name C9 'pyear_sq' MTB > Let 'pyear_sq' = 'p_year'**2 MTB > GReg 'intvl_ln' = 'p_year' 'pyear_sq'; SUBC> Constant; SUBC> Confidence 95.0; SUBC> TCoef; SUBC> VIF; SUBC> TSummary; SUBC> TANOVA. General Regression Analysis: intvl_ln versus p_year, pyear_sq * Q4: evaluating collinearity (VIFs) regress intvl_ln p_year pyear_sq estat vif generate pyear_ct=p_year-1999 replace pyear_sq=pyear_ct^2 regress intvl_ln pyear_ct pyear_sq estat vif MTB > Name C10 'pyear_ct' MTB > Let 'pyear_ct' = 'p_year'-1999 MTB > Name C11 'pyear_ct_sq' MTB > Let 'pyear_ct_sq' = 'pyear_ct'**2 MTB > GReg 'intvl_ln' = 'pyear_ct' 'pyear_ctsq'; SUBC> Constant; SUBC> Confidence 95.0; SUBC> TCoef; SUBC> VIF; SUBC> CI; SUBC> TSummary; SUBC> TANOVA. General Regression Analysis: intvl_ln versus pyear_ct, pyear_ctsq Note: see above for the VIF in the uncentered model. * Q5: evaluating confounding regress intvl_ln hdsize /* model without year */ regress intvl_ln pyear_ct pyear_sq hdsize /* model with year */ display "percent change=" abs(-.0002863-(-.0010355))/abs(-.0010355)*100 * added assessment of association hdsize vs year (should really adjust for herd) regress hdsize p_year MTB > Regress 'intvl_ln' 1 'hdsize'; SUBC> Constant; SUBC> Brief 1. Regression Analysis: intvl_ln versus hdsize MTB > Regress 'intvl_ln' 3 'hdsize' 'pyear_ct' 'pyear_ct_sq'; SUBC> Constant; SUBC> Brief 1. Regression Analysis: intvl_ln versus hdsize, pyear_ct, pyear_ct_sq MTB > Regress 'hdsize' 1 'p_year'; SUBC> Constant; SUBC> Brief 1. Regression Analysis: hdsize versus p_year * Q6: interaction egen pyear_c2=cut(p_year), at(0 1999 2999) icodes tab pyear_c2 p_year /* just checking */ regress intvl_ln pyear_c2##c.p_rct MTB > name c12 'pyear_c2' MTB > Code (0:1998) 0 (1999:2999) 1 'p_year' 'pyear_c2' MTB > GReg 'intvl_ln' = 'pyear_c2' 'p_rct' 'pyear_c2'* 'p_rct'; SUBC> Constant; SUBC> Confidence 95.0; SUBC> TCoef; SUBC> CI; SUBC> TSummary; SUBC> TANOVA. General Regression Analysis: intvl_ln versus pyear_c2, p_rct * Q7: model building regress intvl_ln hdsize pyear_c2##c.p_rct margins pyear_c2, at(hdsize=50 p_rct=(0(10)40)) expression(exp(predict(xb))) marginsplot MTB > Name c13 "newyear" MTB > Set 'newyear' DATA> 1( 0 : 1 / 1 )5 DATA> End. MTB > Name c14 'newrct' MTB > Set 'newrct' DATA> 2( 0 : 40 / 10 )1 DATA> End. MTB > Name c15 "newhds" MTB > Set 'newhds' DATA> 1( 50 : 50 / 1 )10 DATA> End. MTB > Name c16 "PFIT1" c17 "CLIM1" c18 "CLIM2" c19 "PLIM1" c20 "PLIM2" MTB > GReg 'intvl_ln' = 'pyear_c2' 'p_rct' 'pyear_c2'* 'p_rct' hdsize; SUBC> Constant; SUBC> Confidence 95.0; SUBC> PContinuous 'newyear' 'newrct' 'newhds'; SUBC> PFits 'PFIT1'; SUBC> CLimit 'CLIM1'-'CLIM2'; SUBC> PLimit 'PLIM1'-'PLIM2'; SUBC> TCoef; SUBC> CI; SUBC> TSummary; SUBC> TANOVA. General Regression Analysis: intvl_ln versus pyear_c2, p_rct, hdsize This produces columns with predicted values as well as lower and upper bounds for confidence and prediction intervals for the sets of predictor values specified in columns c13-c15. These can be plotted on transformed scale or with some further manipulation on original scale. Exercise 3: ----------- * open the btb_episodes data set use btb_episodes_data.dta, clear * compute extra variables needed generate intvl_ln=ln(intvl) generate pyear_ct=p_year-2000 generate pyear_sq=pyear_ct^2 sort herd p_year gen id=_n MTB > WOpen "h:\VHM\VHM802\Data_csv\hs02_6.csv"; SUBC> FType; SUBC> CSV; SUBC> DecSep; SUBC> Period; SUBC> Field; SUBC> Comma; SUBC> TDelimiter; SUBC> DoubleQuote. Retrieving worksheet from file: 'h:\VHM\VHM802\Data_csv\hs02_6.csv' Worksheet was saved on 12/01/2011 MTB > Sort 'herd'-'p_year' 'herd'-'p_year'; SUBC> By 'herd' 'p_year'. MTB > Set c7 DATA> 1( 1 : 3000 / 1 )1 DATA> End. MTB > name c7 'id' MTB > Name C8 'intvl_ln' MTB > Let 'intvl_ln' = ln('intvl') MTB > Name C9 'pyear_ct' MTB > Let 'pyear_ct' = 'p_year'-2000 MTB > Name C10 'pyear_sq' MTB > Let 'pyear_sq' = 'pyear_ct'**2 * run model regress intvl_ln p_rct hdsize pyear_ct pyear_sq MTB > Regress 'intvl_ln' 4 'p_rct' 'hdsize' 'pyear_ct' 'pyear_sq'; SUBC> Constant; SUBC> Brief 2. Regression Analysis: intvl_ln versus p_rct, hdsize, pyear_ct, pyear_sq * Q1: evaluate homoscedasticity estat hettest estat imtest * residual plot using standardised residuals predict stdres, rstandard predict fit, xb scatter stdres fit MTB > Name c11 "SRES1" MTB > Regress 'intvl_ln' 4 'p_rct' 'hdsize' 'pyear_ct' 'pyear_sq'; SUBC> SResiduals 'SRES1'; SUBC> GFourpack; SUBC> RType 2; SUBC> Constant; SUBC> Brief 1. Regression Analysis: intvl_ln versus p_rct, hdsize, pyear_ct, pyear_sq Note: Minitab does not offer tests for heteroscedasticity; this is no big deficit because such tests are often less useful than graphical tools based on the residuals. * Q2: evaluate normality * first graphically histogram stdres, normal qnorm stdres summarize stdres, detail * then with a formal test swilk stdres MTB > GSummary 'SRES1'. Summary for SRES1 MTB > NormTest 'SRES1'; SUBC> RJTest. Probability Plot of SRES1 Note: The normal plot was produced above. Normal plots in Minitab differ from those in Stata by having the y-axis and x-axis switched. Also, Minitab does not offer Shapiro-Wilks test for normality, but the Ryan-Joiner test is "similar". P-values for normality tests on residuals can never be trusted exactly anyway because they require independent observations, and residuals are not independent. * Q3: attempts to correct the overall fit * starting with the original -intvl- variable * use Box-Cox analysis to identify the optimal transformation boxcox intvl p_rct hdsize pyear_ct pyear_sq ... Note: We only illustrate here only the Box-Cox analysis in Minitab. The subsequent model checks are done in Minitab in the same way as before, just with a different outcome. MTB > Name c12 "SRES2" MTB > GReg 'intvl' = 'p_rct' hdsize 'pyear_ct' 'pyear_sq'; SUBC> Constant; SUBC> Confidence 95.0; SUBC> Coding -1; SUBC> Boxcox; SUBC> GFourpack; SUBC> RType 2 ; SUBC> TEquation; SUBC> TCoef; SUBC> CI; SUBC> TSummary; SUBC> TANOVA; SUBC> SResiduals 'SRES2'. General Regression Analysis: intvl versus p_rct, hdsize, pyear_ct, pyear_sq Box-Cox transformation of the response with rounded lambda = 0.164565 The 95% CI for lambda is (0.125, 0.195) Regression Equation intvl^0.164565 = 3.05155 + 0.00680397 p_rct - 0.000462158 hdsize - 0.0544861 pyear_ct - 0.00373416 pyear_sq 2987 cases used, 13 cases contain missing values Coefficients Term Coef SE Coef T P 95% CI Constant 3.05155 0.0222785 136.973 0.000 ( 3.00786, 3.09523) p_rct 0.00680 0.0017604 3.865 0.000 ( 0.00335, 0.01026) hdsize -0.00046 0.0002229 -2.073 0.038 (-0.00090, -0.00003) pyear_ct -0.05449 0.0041686 -13.071 0.000 (-0.06266, -0.04631) pyear_sq -0.00373 0.0004756 -7.851 0.000 (-0.00467, -0.00280) Summary of Model S = 0.527968 R-Sq = 7.28% R-Sq(adj) = 7.16% PRESS = 833.808 R-Sq(pred) = 6.99% Analysis of Variance Source DF Seq SS Adj SS Adj MS F P Regression 4 65.260 65.260 16.3150 58.529 0.000000 p_rct 1 1.192 4.164 4.1639 14.938 0.000113 hdsize 1 4.103 1.198 1.1979 4.297 0.038260 pyear_ct 1 42.782 47.623 47.6227 170.844 0.000000 pyear_sq 1 17.183 17.183 17.1832 61.644 0.000000 Error 2982 831.234 831.234 0.2788 Lack-of-Fit 2434 665.376 665.376 0.2734 0.903 0.939719 Pure Error 548 165.858 165.858 0.3027 Total 2986 896.494 Residual Plots for intvl Note: The Box-Cox analysis results in a choice of the "preferable" transformation, and a model is then fitted for the transformed variable. In this case, the choice did not involve any rounding off (of lambda), so the listing includes the strictly optimal transformation. * Q4: diagnostics * refit model regress intvl_ln p_rct hdsize pyear_ct pyear_sq * examine detailed aspects of fit for the original model * generate necessary residuals and fit statistics capture drop fit stdres predict fit, xb predict stdres, rstandard predict delres, rstudent predict cook, cooksd predict lev, leverage predict dfit, dfit MTB > Name c13 "SRES3" c14 "TRES3" c15 "HI3" c16 "COOK3" c17 "DFIT3" MTB > Regress 'intvl_ln' 4 'p_rct' 'hdsize' 'pyear_ct' 'pyear_sq'; SUBC> SResiduals 'SRES3'; SUBC> Tresiduals 'TRES3'; SUBC> Hi 'HI3'; SUBC> Cookd 'COOK3'; SUBC> DFits 'DFIT3'; SUBC> Constant; SUBC> Brief 1. Regression Analysis: intvl_ln versus p_rct, hdsize, pyear_ct, pyear_sq Note: Minitab does not offer dfbeta statistics. The list of "extreme" residuals (defined as abs(stdres)>=2) and leverages has been turned off because for a large dataset, the list becomes very long and mostly a nuisance. * largest outlier(s) summarize stdres, d sort stdres list id intvl_ln p_rct hdsize p_year fit stdres delres if abs(stdres)>2.7 & stdres~=., noobs MTB > Describe 'SRES3' 'TRES3' 'HI3' 'COOK3' 'DFIT3'; SUBC> Mean; SUBC> Median; SUBC> Minimum; SUBC> Maximum; SUBC> N; SUBC> NMissing. Descriptive Statistics: SRES3, TRES3, HI3, COOK3, DFIT3 Note: Minitab does not allow easy listing of observations subject to certain conditions. One option is to subset the dataset suitably, e.g. as follows: MTB > Subset; SUBC> Where "abs('SRES3')>2.5 And abs('SRES3') <> '*'"; SUBC> Name "Subset of hs02_6.csv"; SUBC> NoMatrices; SUBC> NoConstants; SUBC> Include. Subset worksheet Subset of hs02_6.csv created. * outlier test based on deletion residuals display 2*nobs*ttail(nobs-nparam-1, 2.967837) /* P=1, so NS */ display invttail(nobs-nparam-1,.025/nobs) Note: These tests can be computed semi-manually in Minitab as well. To illustrate, we compute the cut-off for significance of the test (using DF=2987-6=2981 and p=0.025/2987=0.00000837) MTB > InvCDF .00000837; SUBC> T 2981. Inverse Cumulative Distribution Function * compute threshold values display "leverage cutoff: " 2*nparam/nobs display "conservative leverage cutoff: " 3*nparam/nobs display "Cook's D cutoff: " 4/nobs display "DFITS cutoff: " 2*sqrt(nparam/nobs)*(nobs>=120)+1*(nobs<120) display "DFBETA cutoff: " 2/sqrt(nobs)*(nobs>=120)+1*(nobs<120) MTB > let k1=2*5/2987 MTB > name k1 'thres_lev' MTB > let k2=3*5/2987 MTB > name k2 'thres2_lev' MTB > let k3=4/2987 MTB > name k3 'thres_cookd' MTB > let k4=2*sqrt(5/2987)*(2987>=120)+(2987<120) MTB > Print 'thres_lev' 'thres2_lev' 'thres_cookd' 'thres_dfits'. Data Display Note: One could also introduce constants 'nobs' and 'nparam' and use those in the formulae. * refit the model two ways * first, leave out cooksd > 0.007 (n=5) regress intvl_ln p_rct hdsize pyear_ct pyear_sq if cook<=0.007 * R2 goes up slightly, coef. don't change * second, leave out herds with high values of p_rct regress intvl_ln p_rct hdsize pyear_ct pyear_sq if p_rct<=50 * very little change in model * ADVANCED MATERIAL - suggest using robust standard errors (Chapter 20) * to counteract problems of heteroscedasticity and non-normality regress intvl_ln p_rct hdsize pyear_ct pyear_sq, robust Note: Minitab does not offer robust standard errors, but this is indeed an advanced topic and such procedures should be used with utmost care.