Solution file for additional exercise 2.10
------------------------------------------

The purpose of this solution file is to demonstrate and comment on
Minitab commands that correspond to the Stata solutions presented in VHM
812 for the linear regression exercise 4 (model building exercise). Output 
is only included for demonstration of differences between Stata and Minitab.

* open the pig_adg data set
use pig_adg.dta, clear
* create dichotomous predictor ar_sev
egen ar_sev=cut(ar), at(0 4.5 99) icodes

MTB > WOpen "H:\VHM\VHM802\Data_csv\pig_adg.csv";
SUBC>   FType;
SUBC>     CSV;
SUBC>   DecSep;
SUBC>     Period;
SUBC>   Field;
SUBC>     Comma;
SUBC>   TDelimiter;
SUBC>     DoubleQuote.
Retrieving worksheet from file: ‘H:\VHM\VHM802\Data_csv\pig_adg.csv’
Worksheet was saved on 24/01/2013
MTB > Let c14 = 'ar'>4
MTB > name c14 'ar_sev'


* Q6: descriptive statistics, unconditional associations, correlation analysis
codebook adg pn ar_sev worms farm sex
histogram adg , bin(15)
tab farm
regress adg i.farm
regress adg sex
regress adg pn
regress adg ar_sev
regress adg worms
pwcorr sex adg pn ar_sev worms

MTB > GSummary 'adg' 'worms'.
MTB > Tally 'pn' 'ar_sev' 'sex' 'farm';
SUBC>   Counts.
MTB > OneWay;
SUBC>   Response 'adg';
SUBC>   Categorical 'farm';
SUBC>   IType 0;
SUBC>   GMCI;
SUBC>   GIntPlot;
SUBC>   TMethod;
SUBC>   TFactor;
SUBC>   TANOVA;
SUBC>   TSummary;
SUBC>   TMeans;
SUBC>   Nodefault.
MTB > TwoT 'adg' 'pn'.
MTB > TwoT 'adg' 'sex'.
MTB > TwoT 'adg' 'ar_sev'.
MTB > Fitline 'adg' 'worms';
SUBC>   Confidence 95.0.
MTB > Correlation 'adg' 'sex' 'pn' 'ar_sev' 'worms'.

Comments: The commands above correspond to univariate analyses
(unconditional associations) by the most appropriate method for each
type of predictor. It is possible, but less informative and hence not
recommended, to run regression models for all binary predictors. Caution
should be exercised when interpreting correlations between variables
that are not both quantitative or do not show a linear relation.


* Q8: forward, backward, stepwise regression
* first evaluate the pn * ar_sev interaction
xi:stepwise, lockterm1 pe(0.05): regress adg (i.farm sex) worms (i.pn*ar_sev)
xi:stepwise, lockterm1 pr(0.05): regress adg (i.farm sex) worms (i.pn*ar_sev)
xi:stepwise, lockterm1 pe(0.05) pr(0.051): regress adg (i.farm sex) worms (i.pn*ar_sev)

MTB > Regress;
SUBC>   Response 'adg';
SUBC>   Nodefault;
SUBC>   Continuous 'worms';
SUBC>   Categorical 'sex' 'pn' 'ar_sev' 'farm';
SUBC>   Terms worms sex pn C14 farm pn*C14;
SUBC>   Constant;
SUBC>   Unstandardized;
SUBC>   Stepwise;
SUBC>     Hierarchical;
SUBC>       Always;
SUBC>   Tmsdetails;
SUBC>     Full;
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients.

MTB > Regress;
SUBC>   Response 'adg';
SUBC>   Nodefault;
SUBC>   Continuous 'worms';
SUBC>   Categorical 'sex' 'pn' 'ar_sev' 'farm';
SUBC>   Terms worms sex pn C14 farm pn*C14;
SUBC>   Constant;
SUBC>   Unstandardized;
SUBC>   Stepwise;
SUBC>     Enter worms sex pn C14 farm pn*C14;
SUBC>     Hierarchical;
SUBC>       Always;
SUBC>   Tmsdetails;
SUBC>     Full;
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients.

Comments:
---------
These two analyses correspond to forward and backwise stepwise
selection, with the default P-value cut-offs at 0.15 and without forcing
any predictors into the models (as was done in Stata with farm and
sex). It is also possible to modify the settings to match the analysis
in Stata exactly. The results are the same anyway for the settings used.

MTB > Let c15 = 'pn'*'ar_sev'
MTB > name c15 'pn*ar_sev'
MTB > Name C16 "farm_1" C17 "farm_2" C18 "farm_3" C19 "farm_4" C20 "farm_5" &
CONT>     C21 "farm_6" C22 "farm_7" C23 "farm_8" C24 "farm_9" C25 "farm_10" &
CONT>     C26 "farm_11" C27 "farm_12" C28 "farm_13" C29 "farm_14" C30 "farm_15".
MTB > Indicator 'farm' 'farm_1' 'farm_2' 'farm_3' 'farm_4' 'farm_5' 'farm_6' 'farm_7' &
CONT>     'farm_8' 'farm_9' 'farm_10' 'farm_11' 'farm_12' 'farm_13' 'farm_14' &
CONT>     'farm_15'.
MTB > BReg 'adg' 'sex' 'pn' 'worms' 'ar_sev' 'pn*ar_sev'  &
CONT>     'farm_2'-'farm_15';
SUBC>   Include 'farm_2'-'farm_15';
SUBC>   NVars 1 5;
SUBC>   Best 2;
SUBC>   TExpand;
SUBC>   Constant.
Best Subsets Regression: adg versus sex, pn, ... 

Response is adg
                                                               p
                                                               n
                                                               *
                                                             a a
                                                           w r r
                                                           o _ _
                                                       s   r s s
Total         R-Sq              R-Sq  Mallows          e p m e e
 Vars  R-Sq  (adj)     PRESS  (pred)       Cp       S  x n s v v
   15  58.3   56.4  787571.6    54.0     64.7  46.856          X
   15  58.2   56.3  785610.8    54.1     65.8  46.929  X
   16  62.8   60.9  708854.0    58.6     27.0  44.368  X       X
   16  61.8   59.9  726816.8    57.6     35.3  44.916  X     X
   17  63.6   61.7  696687.3    59.3     21.5  43.937  X X   X
   17  63.6   61.6  697074.2    59.3     21.7  43.949  X X     X
   18  64.2   62.2  693282.7    59.5     18.2  43.642  X X   X X
   18  63.6   61.6  701958.6    59.0     23.4  43.998  X X X X
   19  64.2   62.1  698259.1    59.2     20.0  43.699  X X X X X

At your request, the best subsets procedure included these variables in every model:
     farm_2 farm_3 farm_4 farm_5 farm_6 farm_7 farm_8 farm_9 farm_10 farm_11 farm_12
     farm_13 farm_14 farm_15

Comments: 
---------
The Best Subsets regression gives as the model with lowest Mallow Cp (and highest
predictive R-square) the the model also selected by the stepwise procedures.
Note that even though the interaction pn*ar_sev can be constructed and included 
in the model selection process as shown above, this procedure will not work
if one of the main effects turn out to be marginally non-significant. In this
case, the stepwise procedures will want to drop (or not include) this term,
but in practice one would rarely want to include an interaction term without 
both main effects present.


* Q9: forcing deleted variables back and comparing models
regress adg i.farm sex worms pn##ar_sev
estimates store full
regress adg i.farm sex pn##ar_sev
estimates store red
estimates table full red

MTB > GLM 'adg' = farm sex worms pn 'ar_sev' pn* 'ar_sev';
SUBC>   Covariates 'worms';
SUBC>   Brief 3 .
MTB > GLM 'adg' = farm sex pn 'ar_sev' pn* 'ar_sev';
SUBC>   Covariates 'worms';
SUBC>   Brief 3 .

Comments: The F-test to compare the two models must be computed
manually, except that in this particular case the reduced model has just
one predictor less than the full model. The assessment should also be
based on the parameter estimates which are therefore listed in full.


* Q11: reliability
* build model with 60% of data, predict for other 40%
generate rand=uniform()
regress adg i.farm sex pn##ar_sev if rand<0.6
predict pv, xb
pwcorr adg pv if rand <0.6
pwcorr adg pv if rand >=0.6

MTB > Regress;
SUBC>   Response 'adg';
SUBC>   Nodefault;
SUBC>   Categorical 'sex' 'pn' 'ar_sev' 'farm';
SUBC>   Terms sex pn C14 farm pn*C14;
SUBC>   Constant;
SUBC>   TExpand;
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients.
Regression Analysis: adg versus sex, pn, ar_sev, farm 

Method
Categorical predictor coding  (1, 0)

Analysis of Variance
Source          DF   Seq SS  Contribution   Adj SS  Adj MS  F-Value  P-Value
Regression      18  1099088        64.19%  1099088   61060    32.06    0.000
  sex            1    59738         3.49%    70941   70941    37.25    0.000
  pn             1    50602         2.96%    18557   18557     9.74    0.002
  ar_sev         1   132468         7.74%    10593   10593     5.56    0.019
  farm          14   846018        49.41%   842940   60210    31.61    0.000
  pn*ar_sev      1    10263         0.60%    10263   10263     5.39    0.021
Error          322   613280        35.81%   613280    1905
  Lack-of-Fit   58   132367         7.73%   132367    2282     1.25    0.122
  Pure Error   264   480913        28.08%   480913    1822
Total          340  1712368       100.00%

Model Summary
      S    R-sq  R-sq(adj)   PRESS  R-sq(pred)
43.6417  64.19%     62.18%  693283      59.51%
...

Comments:
---------
The predictive R2 is 59.5% compared to the usual R2=64.2%, and thus loss in
predictive ability with leave-one-out cross-validation is not very substantial.

For information, we also include Minitab code to carry out the split-
sample analysis, although one would in most cases be happy with
referring to the predictive R2.

MTB > Base 160131.
MTB > Name c31 "ranuni"
MTB > Random 341 'ranuni';
SUBC>   Uniform 0.0 1.0.
MTB > Name c32 'randadg1'
MTB > Let 'randadg1' = 'adg'*('ranuni'<0.6)
MTB > Code (0) '*' 'randadg1' 'randadg1' 
MTB > Name C33 'randadg2'
MTB > Let 'randadg2' = 'adg'*('ranuni'>=0.6)
MTB > Code (0) '*'  'randadg2'  'randadg2'
MTB > Name C34 "FITS".
MTB > Regress;
SUBC>   Response 'randadg1';
SUBC>   Nodefault;
SUBC>   Categorical 'sex' 'pn' 'ar_sev' 'farm';
SUBC>   Terms sex pn C14 farm pn*C14;
SUBC>   Constant;
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Fits 'FITS'.
Regression Analysis: randadg1 versus sex, pn, ar_sev, farm 

Method
Categorical predictor coding  (1, 0)
Rows unused                   123

Analysis of Variance
Source          DF   Adj SS  Adj MS  F-Value  P-Value
Regression      18   677690   37649    19.05    0.000
  sex            1    27598   27598    13.97    0.000
  pn             1     7313    7313     3.70    0.056
  ar_sev         1    19584   19584     9.91    0.002
  farm          14   498107   35579    18.00    0.000
  pn*ar_sev      1     1849    1849     0.94    0.335
Error          199   393244    1976
  Lack-of-Fit   54   111431    2064     1.06    0.382
  Pure Error   145   281813    1944
Total          217  1070934

Model Summary
      S    R-sq  R-sq(adj)  R-sq(pred)
44.4533  63.28%     59.96%      55.70%


Coefficients
Term         Coef  SE Coef  T-Value  P-Value   VIF
Constant    564.5     12.3    45.78    0.000
sex
  1         23.40     6.26     3.74    0.000  1.08
pn
  1        -13.51     7.02    -1.92    0.056  1.36
ar_sev
  1         -47.5     15.1    -3.15    0.002  2.64
farm
  2          15.2     15.6     0.98    0.328  1.71
  3          36.6     16.0     2.29    0.023  1.81
  4        -100.2     15.6    -6.42    0.000  1.83
  5         -31.8     18.3    -1.74    0.083  1.46
  6         -69.4     16.6    -4.17    0.000  1.59
  7         -12.7     16.3    -0.78    0.437  1.64
  8         -49.7     15.9    -3.12    0.002  2.23
  9         -53.8     16.0    -3.36    0.001  1.70
  10        -81.7     16.4    -4.99    0.000  1.66
  11        -89.5     16.0    -5.61    0.000  1.69
  12       -109.6     17.1    -6.39    0.000  1.55
  13       -112.9     16.0    -7.06    0.000  1.81
  14        -51.9     15.3    -3.39    0.001  1.96
  15         25.8     15.7     1.64    0.103  1.86
pn*ar_sev
  1 1       -18.6     19.2    -0.97    0.335  2.62

MTB > Correlation 'randadg1' 'FITS';
SUBC>   NoPValues.
Correlation: randadg1, FITS 
Pearson correlation of randadg1 and FITS = 0.795

MTB > Correlation 'randadg2' 'FITS';
SUBC>   NoPValues.
Correlation: randadg2, FITS 
Pearson correlation of randadg2 and FITS = 0.798

Comments: 
---------
Quite unusually, the correlation obtained on the validation subset is
larger than the one in the estimation subset. This can happen by chance,
but is not the general rule.

As for the Minitab coding, in most cases the easiest way to get Minitab 
to analyze a subset of the data is to subset the entire worksheet (Data-Subset
Worksheet menu). However, as the predictions are required for the entire
dataset, it is better to keep the data in one worksheet and generate two
new variables with missing values corresponding to whether the
observations are used for analysis or validation.