Solution file for additional exercise 2.10 ------------------------------------------ The purpose of this solution file is to demonstrate and comment on Minitab commands that correspond to the Stata solutions presented in VHM 812 for the linear regression exercise 4 (model building exercise). Output is only included for demonstration of differences between Stata and Minitab. * open the pig_adg data set use pig_adg.dta, clear * create dichotomous predictor ar_sev egen ar_sev=cut(ar), at(0 4.5 99) icodes MTB > WOpen "H:\VHM\VHM802\Data_csv\pig_adg.csv"; SUBC> FType; SUBC> CSV; SUBC> DecSep; SUBC> Period; SUBC> Field; SUBC> Comma; SUBC> TDelimiter; SUBC> DoubleQuote. Retrieving worksheet from file: ‘H:\VHM\VHM802\Data_csv\pig_adg.csv’ Worksheet was saved on 24/01/2013 MTB > Let c14 = 'ar'>4 MTB > name c14 'ar_sev' * Q6: descriptive statistics, unconditional associations, correlation analysis codebook adg pn ar_sev worms farm sex histogram adg , bin(15) tab farm regress adg i.farm regress adg sex regress adg pn regress adg ar_sev regress adg worms pwcorr sex adg pn ar_sev worms MTB > GSummary 'adg' 'worms'. MTB > Tally 'pn' 'ar_sev' 'sex' 'farm'; SUBC> Counts. MTB > OneWay; SUBC> Response 'adg'; SUBC> Categorical 'farm'; SUBC> IType 0; SUBC> GMCI; SUBC> GIntPlot; SUBC> TMethod; SUBC> TFactor; SUBC> TANOVA; SUBC> TSummary; SUBC> TMeans; SUBC> Nodefault. MTB > TwoT 'adg' 'pn'. MTB > TwoT 'adg' 'sex'. MTB > TwoT 'adg' 'ar_sev'. MTB > Fitline 'adg' 'worms'; SUBC> Confidence 95.0. MTB > Correlation 'adg' 'sex' 'pn' 'ar_sev' 'worms'. Comments: The commands above correspond to univariate analyses (unconditional associations) by the most appropriate method for each type of predictor. It is possible, but less informative and hence not recommended, to run regression models for all binary predictors. Caution should be exercised when interpreting correlations between variables that are not both quantitative or do not show a linear relation. * Q8: forward, backward, stepwise regression * first evaluate the pn * ar_sev interaction xi:stepwise, lockterm1 pe(0.05): regress adg (i.farm sex) worms (i.pn*ar_sev) xi:stepwise, lockterm1 pr(0.05): regress adg (i.farm sex) worms (i.pn*ar_sev) xi:stepwise, lockterm1 pe(0.05) pr(0.051): regress adg (i.farm sex) worms (i.pn*ar_sev) MTB > Regress; SUBC> Response 'adg'; SUBC> Nodefault; SUBC> Continuous 'worms'; SUBC> Categorical 'sex' 'pn' 'ar_sev' 'farm'; SUBC> Terms worms sex pn C14 farm pn*C14; SUBC> Constant; SUBC> Unstandardized; SUBC> Stepwise; SUBC> Hierarchical; SUBC> Always; SUBC> Tmsdetails; SUBC> Full; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients. MTB > Regress; SUBC> Response 'adg'; SUBC> Nodefault; SUBC> Continuous 'worms'; SUBC> Categorical 'sex' 'pn' 'ar_sev' 'farm'; SUBC> Terms worms sex pn C14 farm pn*C14; SUBC> Constant; SUBC> Unstandardized; SUBC> Stepwise; SUBC> Enter worms sex pn C14 farm pn*C14; SUBC> Hierarchical; SUBC> Always; SUBC> Tmsdetails; SUBC> Full; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients. Comments: --------- These two analyses correspond to forward and backwise stepwise selection, with the default P-value cut-offs at 0.15 and without forcing any predictors into the models (as was done in Stata with farm and sex). It is also possible to modify the settings to match the analysis in Stata exactly. The results are the same anyway for the settings used. MTB > Let c15 = 'pn'*'ar_sev' MTB > name c15 'pn*ar_sev' MTB > Name C16 "farm_1" C17 "farm_2" C18 "farm_3" C19 "farm_4" C20 "farm_5" & CONT> C21 "farm_6" C22 "farm_7" C23 "farm_8" C24 "farm_9" C25 "farm_10" & CONT> C26 "farm_11" C27 "farm_12" C28 "farm_13" C29 "farm_14" C30 "farm_15". MTB > Indicator 'farm' 'farm_1' 'farm_2' 'farm_3' 'farm_4' 'farm_5' 'farm_6' 'farm_7' & CONT> 'farm_8' 'farm_9' 'farm_10' 'farm_11' 'farm_12' 'farm_13' 'farm_14' & CONT> 'farm_15'. MTB > BReg 'adg' 'sex' 'pn' 'worms' 'ar_sev' 'pn*ar_sev' & CONT> 'farm_2'-'farm_15'; SUBC> Include 'farm_2'-'farm_15'; SUBC> NVars 1 5; SUBC> Best 2; SUBC> TExpand; SUBC> Constant. Best Subsets Regression: adg versus sex, pn, ... Response is adg p n * a a w r r o _ _ s r s s Total R-Sq R-Sq Mallows e p m e e Vars R-Sq (adj) PRESS (pred) Cp S x n s v v 15 58.3 56.4 787571.6 54.0 64.7 46.856 X 15 58.2 56.3 785610.8 54.1 65.8 46.929 X 16 62.8 60.9 708854.0 58.6 27.0 44.368 X X 16 61.8 59.9 726816.8 57.6 35.3 44.916 X X 17 63.6 61.7 696687.3 59.3 21.5 43.937 X X X 17 63.6 61.6 697074.2 59.3 21.7 43.949 X X X 18 64.2 62.2 693282.7 59.5 18.2 43.642 X X X X 18 63.6 61.6 701958.6 59.0 23.4 43.998 X X X X 19 64.2 62.1 698259.1 59.2 20.0 43.699 X X X X X At your request, the best subsets procedure included these variables in every model: farm_2 farm_3 farm_4 farm_5 farm_6 farm_7 farm_8 farm_9 farm_10 farm_11 farm_12 farm_13 farm_14 farm_15 Comments: --------- The Best Subsets regression gives as the model with lowest Mallow Cp (and highest predictive R-square) the the model also selected by the stepwise procedures. Note that even though the interaction pn*ar_sev can be constructed and included in the model selection process as shown above, this procedure will not work if one of the main effects turn out to be marginally non-significant. In this case, the stepwise procedures will want to drop (or not include) this term, but in practice one would rarely want to include an interaction term without both main effects present. * Q9: forcing deleted variables back and comparing models regress adg i.farm sex worms pn##ar_sev estimates store full regress adg i.farm sex pn##ar_sev estimates store red estimates table full red MTB > GLM 'adg' = farm sex worms pn 'ar_sev' pn* 'ar_sev'; SUBC> Covariates 'worms'; SUBC> Brief 3 . MTB > GLM 'adg' = farm sex pn 'ar_sev' pn* 'ar_sev'; SUBC> Covariates 'worms'; SUBC> Brief 3 . Comments: The F-test to compare the two models must be computed manually, except that in this particular case the reduced model has just one predictor less than the full model. The assessment should also be based on the parameter estimates which are therefore listed in full. * Q11: reliability * build model with 60% of data, predict for other 40% generate rand=uniform() regress adg i.farm sex pn##ar_sev if rand<0.6 predict pv, xb pwcorr adg pv if rand <0.6 pwcorr adg pv if rand >=0.6 MTB > Regress; SUBC> Response 'adg'; SUBC> Nodefault; SUBC> Categorical 'sex' 'pn' 'ar_sev' 'farm'; SUBC> Terms sex pn C14 farm pn*C14; SUBC> Constant; SUBC> TExpand; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients. Regression Analysis: adg versus sex, pn, ar_sev, farm Method Categorical predictor coding (1, 0) Analysis of Variance Source DF Seq SS Contribution Adj SS Adj MS F-Value P-Value Regression 18 1099088 64.19% 1099088 61060 32.06 0.000 sex 1 59738 3.49% 70941 70941 37.25 0.000 pn 1 50602 2.96% 18557 18557 9.74 0.002 ar_sev 1 132468 7.74% 10593 10593 5.56 0.019 farm 14 846018 49.41% 842940 60210 31.61 0.000 pn*ar_sev 1 10263 0.60% 10263 10263 5.39 0.021 Error 322 613280 35.81% 613280 1905 Lack-of-Fit 58 132367 7.73% 132367 2282 1.25 0.122 Pure Error 264 480913 28.08% 480913 1822 Total 340 1712368 100.00% Model Summary S R-sq R-sq(adj) PRESS R-sq(pred) 43.6417 64.19% 62.18% 693283 59.51% ... Comments: --------- The predictive R2 is 59.5% compared to the usual R2=64.2%, and thus loss in predictive ability with leave-one-out cross-validation is not very substantial. For information, we also include Minitab code to carry out the split- sample analysis, although one would in most cases be happy with referring to the predictive R2. MTB > Base 160131. MTB > Name c31 "ranuni" MTB > Random 341 'ranuni'; SUBC> Uniform 0.0 1.0. MTB > Name c32 'randadg1' MTB > Let 'randadg1' = 'adg'*('ranuni'<0.6) MTB > Code (0) '*' 'randadg1' 'randadg1' MTB > Name C33 'randadg2' MTB > Let 'randadg2' = 'adg'*('ranuni'>=0.6) MTB > Code (0) '*' 'randadg2' 'randadg2' MTB > Name C34 "FITS". MTB > Regress; SUBC> Response 'randadg1'; SUBC> Nodefault; SUBC> Categorical 'sex' 'pn' 'ar_sev' 'farm'; SUBC> Terms sex pn C14 farm pn*C14; SUBC> Constant; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Fits 'FITS'. Regression Analysis: randadg1 versus sex, pn, ar_sev, farm Method Categorical predictor coding (1, 0) Rows unused 123 Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value Regression 18 677690 37649 19.05 0.000 sex 1 27598 27598 13.97 0.000 pn 1 7313 7313 3.70 0.056 ar_sev 1 19584 19584 9.91 0.002 farm 14 498107 35579 18.00 0.000 pn*ar_sev 1 1849 1849 0.94 0.335 Error 199 393244 1976 Lack-of-Fit 54 111431 2064 1.06 0.382 Pure Error 145 281813 1944 Total 217 1070934 Model Summary S R-sq R-sq(adj) R-sq(pred) 44.4533 63.28% 59.96% 55.70% Coefficients Term Coef SE Coef T-Value P-Value VIF Constant 564.5 12.3 45.78 0.000 sex 1 23.40 6.26 3.74 0.000 1.08 pn 1 -13.51 7.02 -1.92 0.056 1.36 ar_sev 1 -47.5 15.1 -3.15 0.002 2.64 farm 2 15.2 15.6 0.98 0.328 1.71 3 36.6 16.0 2.29 0.023 1.81 4 -100.2 15.6 -6.42 0.000 1.83 5 -31.8 18.3 -1.74 0.083 1.46 6 -69.4 16.6 -4.17 0.000 1.59 7 -12.7 16.3 -0.78 0.437 1.64 8 -49.7 15.9 -3.12 0.002 2.23 9 -53.8 16.0 -3.36 0.001 1.70 10 -81.7 16.4 -4.99 0.000 1.66 11 -89.5 16.0 -5.61 0.000 1.69 12 -109.6 17.1 -6.39 0.000 1.55 13 -112.9 16.0 -7.06 0.000 1.81 14 -51.9 15.3 -3.39 0.001 1.96 15 25.8 15.7 1.64 0.103 1.86 pn*ar_sev 1 1 -18.6 19.2 -0.97 0.335 2.62 MTB > Correlation 'randadg1' 'FITS'; SUBC> NoPValues. Correlation: randadg1, FITS Pearson correlation of randadg1 and FITS = 0.795 MTB > Correlation 'randadg2' 'FITS'; SUBC> NoPValues. Correlation: randadg2, FITS Pearson correlation of randadg2 and FITS = 0.798 Comments: --------- Quite unusually, the correlation obtained on the validation subset is larger than the one in the estimation subset. This can happen by chance, but is not the general rule. As for the Minitab coding, in most cases the easiest way to get Minitab to analyze a subset of the data is to subset the entire worksheet (Data-Subset Worksheet menu). However, as the predictions are required for the entire dataset, it is better to keep the data in one worksheet and generate two new variables with missing values corresponding to whether the observations are used for analysis or validation.