Solution file for additional exercise 2.10 ------------------------------------------ The purpose of this solution file is to demonstrate and comment on Minitab commands that correspond to the Stata solutions presented in VHM 812 for the linear regression exercise 4 (model building exercise). Output is only included for demonstration of differences between Stata and Minitab. * open the pig_adg data set use pig_adg.dta, clear * create dichotomous predictor ar_sev egen ar_sev=cut(ar), at(0 4.5 99) icodes MTB > WOpen "h:\VHM\VHM802\Data_csv\hs02_6.csv"; SUBC> FType; SUBC> CSV; SUBC> DecSep; SUBC> Period; SUBC> Field; SUBC> Comma; SUBC> TDelimiter; SUBC> DoubleQuote. Retrieving worksheet from file: 'h:\VHM\VHM802\Data_csv\hs02_6.csv' Worksheet was saved on 27/01/2011 MTB > Let c14 = 'ar'>4 MTB > name c14 'ar_sev' * Q6: descriptive statistics, unconditional associations, correlation analysis codebook adg pn ar_sev worms farm sex histogram adg , bin(15) tab farm regress adg i.farm regress adg sex regress adg pn regress adg ar_sev regress adg worms pwcorr sex adg pn ar_sev worms MTB > GSummary 'adg' 'worms'. MTB > Tally 'pn' 'ar_sev' 'sex' 'farm'; SUBC> Counts. MTB > Oneway 'adg' 'farm'. MTB > TwoT 'adg' 'pn'. MTB > TwoT 'adg' 'sex'. MTB > TwoT 'adg' 'ar_sev'. MTB > Regress 'adg' 1 'worms'; SUBC> Constant; SUBC> Brief 2. MTB > Correlation 'adg' 'sex' 'pn' 'ar_sev' 'worms'. Comments: The commands above correspond to univariate analyses (unconditional associations) by the most appropriate method for each type of predictor. It is possible, but less informative and hence not recommended, to run regression models for all binary predictors. Caution should be exercised when interpreting correlations between variables that are not both quantitative or do not show a linear relation. * Q8: forward, backward, stepwise regression * first evaluate the pn * ar_sev interaction xi:stepwise, lockterm1 pe(0.05): regress adg (i.farm sex) worms (i.pn*ar_sev) xi:stepwise, lockterm1 pr(0.05): regress adg (i.farm sex) worms (i.pn*ar_sev) xi:stepwise, lockterm1 pe(0.05) pr(0.051): regress adg (i.farm sex) worms (i.pn*ar_sev) MTB > Let c15 = 'pn' * 'ar_sev' MTB > name c15 'pn*ar_sev' MTB > Stepwise 'adg' 'sex' 'pn' 'worms' 'ar_sev' 'pn*ar_sev'; SUBC> AEnter 0.15; SUBC> ARemove 0.15; SUBC> Best 0; SUBC> Constant. MTB > Stepwise 'adg' 'sex' 'pn' 'worms' 'ar_sev' 'pn*ar_sev'; SUBC> Enter 'sex' 'pn' 'worms' 'ar_sev' 'pn*ar_sev'; SUBC> AEnter 0.15; SUBC> ARemove 0.15; SUBC> Best 0; SUBC> Constant. MTB > BReg 'adg' 'sex' 'pn' 'ar_sev' 'pn*ar_sev' 'worms' & CONT> 'farm_1'-'farm_14'; SUBC> Include 'farm_1'-'farm_14'; SUBC> NVars 1 5; SUBC> Best 2; SUBC> Constant. Best Subsets Regression: adg versus sex, pn, ... Response is adg The following variables are included in all models: farm_1 farm_2 farm_3 farm_4 farm_5 farm_6 farm_7 farm_8 farm_9 farm_10 farm_11 farm_12 farm_13 farm_14 p n * a a r r w _ _ o s s s r Mallows e p e e m Vars R-Sq R-Sq(adj) Cp S x n v v s 1 58.3 56.4 64.7 46.856 X 1 58.2 56.3 65.8 46.929 X 2 62.8 60.9 27.0 44.368 X X 2 61.8 59.9 35.3 44.916 X X 3 63.6 61.7 21.5 43.937 X X X 3 63.6 61.6 21.7 43.949 X X X 4 64.2 62.2 18.2 43.642 X X X X 4 63.6 61.6 23.4 43.998 X X X X 5 64.2 62.1 20.0 43.699 X X X X X Comments: The analysis reveals some restrictions in Minitab. First, it's not possible to force indicator (dummy) variables for all farms into the stepwise regression models, perhaps because the number of farms is too large. This means that one must ignore farms because allowing the procedure to include only a few selected farm indicator variables is not meaningful. The Best Subsets regression (above) does work fine and gives as the model with lowest Mallow Cp the model also selected in Stata. Second, the interaction pn*ar_sev can be constructed and included in the model selection process as shown above, but this procedure will not work if one of the main effects turn out to be marginally non-significant. In this case, the stepwise procedures will want to drop (or not include) this term, but in practice one would rarely want to include an interaction term without both main effects present. * Q9: forcing deleted variables back and comparing models regress adg i.farm sex worms pn##ar_sev estimates store full regress adg i.farm sex pn##ar_sev estimates store red estimates table full red MTB > GLM 'adg' = farm sex worms pn 'ar_sev' pn* 'ar_sev'; SUBC> Covariates 'worms'; SUBC> Brief 3 . MTB > GLM 'adg' = farm sex pn 'ar_sev' pn* 'ar_sev'; SUBC> Covariates 'worms'; SUBC> Brief 3 . Comments: The F-test to compare the two models must be computed manually, except that in this particular case the reduced model has just one predictor less than the full model. The assessment should also be based on the parameter estimates which are therefore listed in full. * Q11: reliability * build model with 60% of data, predict for other 40% generate rand=uniform() regress adg i.farm sex pn##ar_sev if rand<0.6 predict pv, xb pwcorr adg pv if rand <0.6 pwcorr adg pv if rand >=0.6 MTB > Name c31 "ranuni" MTB > Random 341 'ranuni'; SUBC> Uniform 0.0 1.0. MTB > Name c32 'randadg1' MTB > Let 'randadg1' = 'adg'*('ranuni'<0.6) MTB > Code (0) '*' 'randadg1' 'randadg1' MTB > Name C33 'randadg2' MTB > Let 'randadg2' = 'adg'*('ranuni'>=0.6) MTB > Code (0) '*' 'randadg2' 'randadg2' MTB > Name c34 "FITS1" MTB > Regress 'randadg' 18 'sex' 'pn' 'ar_sev' 'pn*ar_sev' & CONT> 'farm_2'-'farm_15'; SUBC> Fits 'FITS1'; SUBC> Constant; SUBC> Brief 2. MTB > Correlation 'randadg1' 'FITS1'; SUBC> NoPValues. Correlations: randadg1, FITS1 Pearson correlation of randadg and FITS1 = 0.817 MTB > Correlation 'randadg2' 'FITS1'; SUBC> NoPValues. Correlations: randadg2, FITS1 Pearson correlation of randadg2 and FITS1 = 0.720 Comments: In most cases, the easiest way to get Minitab to analyze a subset of the data is to subset the entire worksheet (Data-Subset Worksheet menu). However, as the predictions are required for the entire dataset, it is better to keep the data in one worksheet and generate two new variables with missing values corresponding to whether the observations are used for analysis or validation. Additional note: Instead of the reliability analysis, one may compute the PRESS statistic (General Regression menu) based on fitting the model repeatedly without each observation individually. MTB > GReg 'adg' = farm sex pn| 'ar_sev'; SUBC> Categorical 'farm' 'pn' 'ar_sev'; SUBC> Constant; SUBC> Confidence 95.0; SUBC> Coding 1; SUBC> TEquation; SUBC> TCoef; SUBC> TSummary; SUBC> TANOVA; SUBC> TDiag 0. General Regression Analysis: adg versus sex, farm, pn, ar_sev ... Summary of Model S = 43.6417 R-Sq = 64.19% R-Sq(adj) = 62.18% PRESS = 693283 R-Sq(pred) = 59.51% Here the correlation drops from sqrt(.6419)=.80 to sqrt(.5951)=0.77 when based on the PRESS instead of the usual R^2.