Solution file for additional exercise 2.6
-----------------------------------------
The purpose of this solution file is to demonstrate and comment on
Minitab commands that correspond to the Stata solutions presented in VHM
812 for linear regression exercises 1-3. Output is only included for
demonstration of differences between Stata and Minitab.

Exercise 1:
-----------
* open the btb_episodes data set
use btb_episodes.dta, clear

MTB > WOpen "h:\VHM\VHM802\Data_csv\hs02_6.csv";
SUBC>   FType;
SUBC>     CSV;
SUBC>   DecSep;
SUBC>     Period;
SUBC>   Field;
SUBC>     Comma;
SUBC>   TDelimiter;
SUBC>     DoubleQuote.
Retrieving worksheet from file: 'h:\VHM\VHM802\Data_csv\hs02_6.csv'
Worksheet was saved on 12/01/2011

* Q1a: distribution of outcome -intvl-
codebook intvl
histogram intvl

MTB > GSummary 'intvl'.
Summary for intvl 

* Q1b: natural log transformed outcome
generate intvl_ln=ln(intvl)
histogram intvl_ln

MTB > Name C7 'intvl_ln'
MTB > Let 'intvl_ln' = ln('intvl')
MTB > GSummary  'intvl_ln'.
Summary for intvl_ln 


* Q2: simple linear regressions
regress intvl_ln p_rct
regress intvl_ln p_year
regress intvl_ln hdsize

MTB > Regress 'intvl_ln' 1 'p_rct';
SUBC>   Constant;
SUBC>   Brief 1.
Regression Analysis: intvl_ln versus p_rct 
MTB > Regress 'intvl_ln' 1  'p_year';
SUBC>   Constant;
SUBC>   Brief 1.
Regression Analysis: intvl_ln versus p_year 
MTB > Regress 'intvl_ln' 1  'hdsize';
SUBC>   Constant;
SUBC>   Brief 1.
Regression Analysis: intvl_ln versus hdsize 


* Q3: multiple linear regression
regress intvl_ln p_rct p_year hdsize

MTB > Regress 'intvl_ln' 3 'p_rct' 'p_year'  'hdsize';
SUBC>   Constant;
SUBC>   Brief 1.
Regression Analysis: intvl_ln versus p_rct, p_year, hdsize 

* added calculation of correlation between p_rct and hdsize
pwcorr p_rct hdsize

MTB > Correlation 'p_rct' 'hdsize'.
Correlations: p_rct, hdsize 

regress intvl_ln hdsize
estimates store smpl
regress intvl_ln p_rct p_year hdsize
estimates store mltpl
estimates table smpl mltpl

Note: It is possible in Minitab to store estimates in a column in the
worksheet, but any further processing needs to be done manually.


* Q4: predicion intervals for means and individuals (based only on hdsize)
regress intvl_ln hdsize
* doing the calculations yourself
predict pv, xb
predict pv_mean_se, stdp
scalar tstar=invttail(2985,.025) /* using DFE */
generate pv_mean_u=pv + tstar*pv_mean_se
generate pv_mean_l=pv - tstar*pv_mean_se
predict pv_ind_se, stdf
generate pv_ind_u=pv + tstar*pv_ind_se
generate pv_ind_l=pv - tstar*pv_ind_se
twoway (scatter intvl_ln hdsize, msize(vsmall)) (line pv hdsize) ///
  (line pv_mean_u hdsize) (line pv_mean_l hdsize) ///
  (line pv_ind_u hdsize) (line pv_ind_l hdsize)
* using the built in graphics capabilities (only simple linear regression)
twoway (scatter intvl_ln hdsize, sort msize(vsmall)) ///
  (lfitci intvl_ln hdsize, ciplot(rline)) ///
  (lfitci intvl_ln hdsize, stdf ciplot(rline))

MTB > Fitline 'intvl_ln' 'hdsize';
SUBC>   Confidence 95.0;
SUBC>   Ci;
SUBC>   Pi.
Regression Analysis: intvl_ln versus hdsize 
MTB > Name c8 "PFIT1" c9 "CLIM1" c10 "CLIM2" c11 "PLIM1" c12 "PLIM2"
MTB > Regress 'intvl_ln' 1 'hdsize';
SUBC>   Constant;
SUBC>   Predict 'hdsize';
SUBC>     PFits 'PFIT1';
SUBC>     CLimits 'CLIM1'-'CLIM2';
SUBC>     PLimits 'PLIM1'-'PLIM2';
SUBC>   Brief 1.
Regression Analysis: intvl_ln versus hdsize 


* Q5: R2
regress intvl_ln p_year hdsize
regress intvl_ln p_rct p_year hdsize

MTB > Regress 'intvl_ln' 2 'p_rct' 'hdsize';
SUBC>   Constant;
SUBC>   Brief 1.
Regression Analysis: intvl_ln versus p_rct, hdsize 
MTB > Regress 'intvl_ln' 3 'p_rct' 'p_year'  'hdsize';
SUBC>   Constant;
SUBC>   Brief 1.
Regression Analysis: intvl_ln versus p_rct, p_year, hdsize 


Exercise 2:
-----------
* open the btb_episodes data set
use btb_episodes.dta, clear

MTB > WOpen "h:\VHM\VHM802\Data_csv\hs02_6.csv";
SUBC>   FType;
SUBC>     CSV;
SUBC>   DecSep;
SUBC>     Period;
SUBC>   Field;
SUBC>     Comma;
SUBC>   TDelimiter;
SUBC>     DoubleQuote.
Retrieving worksheet from file: 'h:\VHM\VHM802\Data_csv\hs02_6.csv'
Worksheet was saved on 12/01/2011

* compute natural log transformed outcome
generate intvl_ln=ln(intvl)

MTB > Name C7 'intvl_ln'
MTB > Let 'intvl_ln' = ln('intvl')


* Q2: -p_year- as sole predictor
regress intvl_ln p_year	/* continuous predictor */
regress intvl_ln i.p_year /* categorical predictor */

MTB > Regress 'intvl_ln' 1 'p_year';
SUBC>   Constant;
SUBC>   Brief 1.
Regression Analysis: intvl_ln versus p_year

MTB > GReg 'intvl_ln' = 'p_year';
SUBC>   Categorical 'p_year';
SUBC>   Constant;
SUBC>   Confidence 95.0;
SUBC>   Coding 1;
SUBC>     References 'p_year' 1989;
SUBC>   TCoef;
SUBC>     CI;
SUBC>   TSummary;
SUBC>   TANOVA.
General Regression Analysis: intvl_ln versus p_year 

Coefficients

Term          Coef   SE Coef        T      P         95% CI
Constant   6.95797  0.062422  111.466  0.000  ( 6.83557,  7.08036)
p_year
  1990     0.02120  0.078812    0.269  0.788  (-0.13334,  0.17573)
  1991    -0.00877  0.081922   -0.107  0.915  (-0.16940,  0.15186)
  1992     0.12456  0.085360    1.459  0.145  (-0.04281,  0.29193)
  1993     0.27241  0.094199    2.892  0.004  ( 0.08771,  0.45711)
  1994     0.21709  0.106276    2.043  0.041  ( 0.00871,  0.42548)
  1995     0.01496  0.110415    0.136  0.892  (-0.20153,  0.23146)
  1996     0.12827  0.105811    1.212  0.226  (-0.07920,  0.33574)
  1997    -0.02352  0.117306   -0.200  0.841  (-0.25352,  0.20649)
  1998    -0.38491  0.118468   -3.249  0.001  (-0.61720, -0.15262)
  1999    -0.35333  0.104693   -3.375  0.001  (-0.55861, -0.14805)
  2000    -0.21752  0.113130   -1.923  0.055  (-0.43934,  0.00430)
  2001    -0.14647  0.134597   -1.088  0.277  (-0.41038,  0.11744)
  2002    -0.45370  0.130723   -3.471  0.001  (-0.71001, -0.19738)
  2003    -0.51631  0.133918   -3.855  0.000  (-0.77889, -0.25373)
  2004    -0.69396  0.150912   -4.598  0.000  (-0.98987, -0.39806)
  2005    -0.94842  0.169650   -5.590  0.000  (-1.28107, -0.61578)
  2006    -1.38073  0.169650   -8.139  0.000  (-1.71337, -1.04809)
  2007    -1.91050  0.607348   -3.146  0.002  (-3.10136, -0.71963)

Summary of Model

S = 1.04639      R-Sq = 7.39%        R-Sq(adj) = 6.83%
PRESS = 3296.20  R-Sq(pred) = 6.47%

Analysis of Variance

Source        DF   Seq SS   Adj SS   Adj MS        F  P
Regression    18   260.39   260.39  14.4659  13.2118  0
  p_year      18   260.39   260.39  14.4659  13.2118  0
Error       2981  3263.97  3263.97   1.0949
Total       2999  3524.35

Note: In order to get the model parametrized in the same way as in Stata
with the first category as the reference level, one needs to set the
"Type of coding for categorical predictors" to (1,0), in the Options
submenu. Otherwise the parameters will be determined from the
restriction that the sum of all parameters across categories equals 0.
This parametrization is also used in General Linear Model menu (with no
alternatives provided).

* improving interpretability of intercept
generate pyear_1989=p_year-1989
regress intvl_ln pyear_1989	

MTB > Let c8 = 'p_year'-1989
MTB > Name C8 'pyear_1989'
MTB > GReg 'intvl_ln' =  'pyear_1989';
SUBC>   Constant;
SUBC>   Confidence 95.0;
SUBC>   TCoef;
SUBC>     CI;
SUBC>   TSummary;
SUBC>   TANOVA.
General Regression Analysis: intvl_ln versus pyear_1989 

Coefficients

Term            Coef    SE Coef        T      P         95% CI
Constant     7.14404  0.0294498  242.584  0.000  ( 7.08630,  7.20179)
pyear_1989  -0.04784  0.0041802  -11.445  0.000  (-0.05604, -0.03965)

Summary of Model

S = 1.06130      R-Sq = 4.19%        R-Sq(adj) = 4.15%
PRESS = 3380.97  R-Sq(pred) = 4.07%

Analysis of Variance

Source           DF   Seq SS   Adj SS   Adj MS        F          P
Regression        1   147.55   147.55  147.547  130.996  0.0000000
  pyear_1989      1   147.55   147.55  147.547  130.996  0.0000000
Error          2998  3376.81  3376.81    1.126
  Lack-of-Fit    17   112.84   112.84    6.638    6.062  0.0000000
  Pure Error   2981  3263.97  3263.97    1.095
Total          2999  3524.35

* manual computation of lack of fit test
scalar F=(3376.80672-3263.96797)/(2998-2981)/1.09492384
display "lack-of-fit F-test:  F=" F "   P-value=" Ftail(2998-2981,2981,F)

The lack of fit test was already computed above (listed in the ANOVA
table). The P-value could also be computed manually in Minitab as

MTB > CDF 6.0621326;
SUBC>   F 17 2981.
Cumulative Distribution Function 

F distribution with 17 DF in numerator and 2981 DF in denominator

      x  P( X <= x )
6.06213            1


* Q3: linearity of -p_year- vs -intvl_ln- relationship
twoway (scatter intvl_ln p_year, msize(vsmall)) (lowess intvl_ln p_year) 
generate pyear_sq = p_year^2
regress intvl_ln p_year pyear_sq

MTB > Plot 'intvl_ln'*'p_year';
SUBC>   Symbol;
SUBC>   Lowess.
Scatterplot of intvl_ln vs p_year 
MTB > Name C9 'pyear_sq'
MTB > Let 'pyear_sq' = 'p_year'**2
MTB > GReg 'intvl_ln' = 'p_year'  'pyear_sq';
SUBC>   Constant;
SUBC>   Confidence 95.0;
SUBC>   TCoef;
SUBC>     VIF;
SUBC>   TSummary;
SUBC>   TANOVA.
General Regression Analysis: intvl_ln versus p_year, pyear_sq 


* Q4: evaluating collinearity (VIFs)
regress intvl_ln p_year pyear_sq
estat vif
generate pyear_ct=p_year-1999
replace pyear_sq=pyear_ct^2
regress intvl_ln pyear_ct pyear_sq
estat vif

MTB > Name C10 'pyear_ct'
MTB > Let 'pyear_ct' = 'p_year'-1999
MTB > Name C11 'pyear_ct_sq'
MTB > Let 'pyear_ct_sq' = 'pyear_ct'**2
MTB > GReg 'intvl_ln' =  'pyear_ct' 'pyear_ctsq';
SUBC>   Constant;
SUBC>   Confidence 95.0;
SUBC>   TCoef;
SUBC>     VIF;
SUBC>     CI;
SUBC>   TSummary;
SUBC>   TANOVA.
General Regression Analysis: intvl_ln versus pyear_ct, pyear_ctsq 

Note: see above for the VIF in the uncentered model.


* Q5: evaluating confounding
regress intvl_ln hdsize	/* model without year */
regress intvl_ln pyear_ct pyear_sq hdsize /* model with year */
display "percent change=" abs(-.0002863-(-.0010355))/abs(-.0010355)*100
* added assessment of association hdsize vs year (should really adjust for herd)
regress hdsize p_year

MTB > Regress 'intvl_ln' 1 'hdsize';
SUBC>   Constant;
SUBC>   Brief 1.
Regression Analysis: intvl_ln versus hdsize 
MTB > Regress 'intvl_ln' 3 'hdsize' 'pyear_ct' 'pyear_ct_sq';
SUBC>   Constant;
SUBC>   Brief 1.
Regression Analysis: intvl_ln versus hdsize, pyear_ct, pyear_ct_sq 
MTB > Regress 'hdsize' 1 'p_year';
SUBC>   Constant;
SUBC>   Brief 1.
Regression Analysis: hdsize versus p_year 


* Q6: interaction
egen pyear_c2=cut(p_year), at(0 1999 2999) icodes
tab pyear_c2 p_year /* just checking */
regress intvl_ln pyear_c2##c.p_rct

MTB > name c12 'pyear_c2'
MTB > Code (0:1998) 0 (1999:2999) 1 'p_year'  'pyear_c2'
MTB > GReg 'intvl_ln' = 'pyear_c2' 'p_rct' 'pyear_c2'* 'p_rct';
SUBC>   Constant;
SUBC>   Confidence 95.0;
SUBC>   TCoef;
SUBC>     CI;
SUBC>   TSummary;
SUBC>   TANOVA.
General Regression Analysis: intvl_ln versus pyear_c2, p_rct 


* Q7: model building 
regress intvl_ln hdsize pyear_c2##c.p_rct
margins  pyear_c2, at(hdsize=50 p_rct=(0(10)40)) expression(exp(predict(xb)))
marginsplot

MTB > Name c13 "newyear"
MTB > Set 'newyear'
DATA>   1( 0 : 1 / 1 )5
DATA>   End.
MTB > Name c14 'newrct'
MTB > Set 'newrct'
DATA>   2( 0 : 40 / 10 )1
DATA>   End.
MTB > Name c15 "newhds"
MTB > Set 'newhds'
DATA>   1( 50 : 50 / 1 )10
DATA>   End.
MTB > Name c16 "PFIT1" c17 "CLIM1" c18 "CLIM2" c19 "PLIM1" c20 "PLIM2"
MTB > GReg 'intvl_ln' = 'pyear_c2' 'p_rct' 'pyear_c2'* 'p_rct' hdsize;
SUBC>   Constant;
SUBC>   Confidence 95.0;
SUBC>   PContinuous 'newyear' 'newrct' 'newhds';
SUBC>   PFits 'PFIT1';
SUBC>   CLimit 'CLIM1'-'CLIM2';
SUBC>   PLimit 'PLIM1'-'PLIM2';
SUBC>   TCoef;
SUBC>     CI;
SUBC>   TSummary;
SUBC>   TANOVA.
General Regression Analysis: intvl_ln versus pyear_c2, p_rct, hdsize 

This produces columns with predicted values as well as lower and upper
bounds for confidence and prediction intervals for the sets of predictor
values specified in columns c13-c15. These can be plotted on transformed 
scale or with some further manipulation on original scale.


Exercise 3:
-----------
* open the btb_episodes data set
use btb_episodes_data.dta, clear
* compute extra variables needed
generate intvl_ln=ln(intvl)
generate pyear_ct=p_year-2000
generate pyear_sq=pyear_ct^2
sort herd p_year
gen id=_n

MTB > WOpen "h:\VHM\VHM802\Data_csv\hs02_6.csv";
SUBC>   FType;
SUBC>     CSV;
SUBC>   DecSep;
SUBC>     Period;
SUBC>   Field;
SUBC>     Comma;
SUBC>   TDelimiter;
SUBC>     DoubleQuote.
Retrieving worksheet from file: 'h:\VHM\VHM802\Data_csv\hs02_6.csv'
Worksheet was saved on 12/01/2011
MTB > Sort 'herd'-'p_year' 'herd'-'p_year';
SUBC>   By 'herd' 'p_year'.
MTB > Set c7
DATA>   1( 1 : 3000 / 1 )1
DATA>   End.
MTB > name c7 'id'
MTB > Name C8 'intvl_ln'
MTB > Let 'intvl_ln' = ln('intvl')
MTB > Name C9 'pyear_ct'
MTB > Let 'pyear_ct' = 'p_year'-2000
MTB > Name C10 'pyear_sq'
MTB > Let 'pyear_sq' = 'pyear_ct'**2

* run model
regress intvl_ln p_rct hdsize pyear_ct pyear_sq

MTB > Regress 'intvl_ln' 4 'p_rct' 'hdsize' 'pyear_ct' 'pyear_sq';
SUBC>   Constant;
SUBC>   Brief 2.
Regression Analysis: intvl_ln versus p_rct, hdsize, pyear_ct, pyear_sq 


* Q1: evaluate homoscedasticity
estat hettest
estat imtest
* residual plot using standardised residuals
predict stdres, rstandard
predict fit, xb
scatter stdres fit

MTB > Name c11 "SRES1"
MTB > Regress 'intvl_ln' 4 'p_rct' 'hdsize' 'pyear_ct' 'pyear_sq';
SUBC>   SResiduals 'SRES1';
SUBC>   GFourpack;
SUBC>   RType 2;
SUBC>   Constant;
SUBC>   Brief 1.
Regression Analysis: intvl_ln versus p_rct, hdsize, pyear_ct, pyear_sq 

Note: Minitab does not offer tests for heteroscedasticity; this is no
big deficit because such tests are often less useful than graphical tools
based on the residuals.


* Q2: evaluate normality
* first graphically
histogram stdres, normal
qnorm stdres
summarize stdres, detail
* then with a formal test
swilk stdres

MTB > GSummary 'SRES1'.
Summary for SRES1 

MTB > NormTest 'SRES1';
SUBC>   RJTest.
Probability Plot of SRES1 

Note: The normal plot was produced above. Normal plots in Minitab differ
from those in Stata by having the y-axis and x-axis switched. Also,
Minitab does not offer Shapiro-Wilks test for normality, but the Ryan-Joiner 
test is "similar". P-values for normality tests on residuals can never 
be trusted exactly anyway because they require independent observations,
and residuals are not independent.


* Q3: attempts to correct the overall fit
* starting with the original -intvl- variable
* use Box-Cox analysis to identify the optimal transformation
boxcox intvl p_rct hdsize pyear_ct pyear_sq
...

Note: We only illustrate here only the Box-Cox analysis in Minitab. The 
subsequent model checks are done in Minitab in the same way as before, 
just with a different outcome.

MTB > Name c12 "SRES2"
MTB > GReg 'intvl' = 'p_rct' hdsize 'pyear_ct' 'pyear_sq';
SUBC>   Constant;
SUBC>   Confidence 95.0;
SUBC>   Coding -1;
SUBC>   Boxcox;
SUBC>   GFourpack;
SUBC>   RType 2 ;
SUBC>   TEquation;
SUBC>   TCoef;
SUBC>     CI;
SUBC>   TSummary;
SUBC>   TANOVA;
SUBC>   SResiduals 'SRES2'.
General Regression Analysis: intvl versus p_rct, hdsize, pyear_ct, pyear_sq 

Box-Cox transformation of the response with rounded lambda = 0.164565
The 95% CI for lambda is (0.125, 0.195)

Regression Equation

intvl^0.164565  =  3.05155 + 0.00680397 p_rct - 0.000462158 hdsize - 0.0544861 pyear_ct -
                   0.00373416 pyear_sq

2987 cases used, 13 cases contain missing values

Coefficients

Term          Coef    SE Coef        T      P         95% CI
Constant   3.05155  0.0222785  136.973  0.000  ( 3.00786,  3.09523)
p_rct      0.00680  0.0017604    3.865  0.000  ( 0.00335,  0.01026)
hdsize    -0.00046  0.0002229   -2.073  0.038  (-0.00090, -0.00003)
pyear_ct  -0.05449  0.0041686  -13.071  0.000  (-0.06266, -0.04631)
pyear_sq  -0.00373  0.0004756   -7.851  0.000  (-0.00467, -0.00280)

Summary of Model

S = 0.527968     R-Sq = 7.28%        R-Sq(adj) = 7.16%
PRESS = 833.808  R-Sq(pred) = 6.99%

Analysis of Variance

Source           DF   Seq SS   Adj SS   Adj MS        F         P
Regression        4   65.260   65.260  16.3150   58.529  0.000000
  p_rct           1    1.192    4.164   4.1639   14.938  0.000113
  hdsize          1    4.103    1.198   1.1979    4.297  0.038260
  pyear_ct        1   42.782   47.623  47.6227  170.844  0.000000
  pyear_sq        1   17.183   17.183  17.1832   61.644  0.000000
Error          2982  831.234  831.234   0.2788
  Lack-of-Fit  2434  665.376  665.376   0.2734    0.903  0.939719
  Pure Error    548  165.858  165.858   0.3027
Total          2986  896.494
 
Residual Plots for intvl 

Note: The Box-Cox analysis results in a choice of the "preferable"
transformation, and a model is then fitted for the transformed variable.
In this case, the choice did not involve any rounding off (of lambda),
so the listing includes the strictly optimal transformation.


* Q4: diagnostics
* refit model
regress intvl_ln p_rct hdsize pyear_ct pyear_sq
* examine detailed aspects of fit for the original model
* generate necessary residuals and fit statistics
capture drop fit stdres
predict fit, xb
predict stdres, rstandard
predict delres, rstudent
predict cook, cooksd
predict lev, leverage
predict dfit, dfit

MTB > Name c13 "SRES3" c14 "TRES3" c15 "HI3" c16 "COOK3" c17 "DFIT3"
MTB > Regress 'intvl_ln' 4 'p_rct' 'hdsize' 'pyear_ct' 'pyear_sq';
SUBC>   SResiduals 'SRES3';
SUBC>   Tresiduals 'TRES3';
SUBC>   Hi 'HI3';
SUBC>   Cookd 'COOK3';
SUBC>   DFits 'DFIT3';
SUBC>   Constant;
SUBC>   Brief 1.
Regression Analysis: intvl_ln versus p_rct, hdsize, pyear_ct, pyear_sq 

Note: Minitab does not offer dfbeta statistics. The list of "extreme"
residuals (defined as abs(stdres)>=2) and leverages has been turned off
because for a large dataset, the list becomes very long and mostly a
nuisance.

* largest outlier(s)
summarize stdres, d
sort stdres
list id intvl_ln p_rct hdsize p_year fit stdres delres if abs(stdres)>2.7 & stdres~=., noobs

MTB > Describe 'SRES3' 'TRES3' 'HI3' 'COOK3' 'DFIT3';
SUBC>   Mean;
SUBC>   Median;
SUBC>   Minimum;
SUBC>   Maximum;
SUBC>   N;
SUBC>   NMissing.
Descriptive Statistics: SRES3, TRES3, HI3, COOK3, DFIT3 

Note: Minitab does not allow easy listing of observations subject to
certain conditions. One option is to subset the dataset suitably, e.g. as
follows:

MTB > Subset;
SUBC>   Where "abs('SRES3')>2.5  And abs('SRES3') <> '*'";
SUBC>   Name "Subset of hs02_6.csv";
SUBC>   NoMatrices;
SUBC>   NoConstants;
SUBC>   Include.
Subset worksheet Subset of hs02_6.csv created.

* outlier test based on deletion residuals
display 2*nobs*ttail(nobs-nparam-1, 2.967837) /* P=1, so NS */
display invttail(nobs-nparam-1,.025/nobs)

Note: These tests can be computed semi-manually in Minitab as well. To
illustrate, we compute the cut-off for significance of the test (using
DF=2987-6=2981 and p=0.025/2987=0.00000837)

MTB > InvCDF .00000837;
SUBC>   T 2981.
Inverse Cumulative Distribution Function 

* compute threshold values
display "leverage cutoff:  " 2*nparam/nobs
display "conservative leverage cutoff:  " 3*nparam/nobs
display "Cook's D cutoff:   " 4/nobs
display "DFITS cutoff:   " 2*sqrt(nparam/nobs)*(nobs>=120)+1*(nobs<120)
display "DFBETA cutoff:   " 2/sqrt(nobs)*(nobs>=120)+1*(nobs<120)

MTB > let k1=2*5/2987
MTB > name k1 'thres_lev'
MTB > let k2=3*5/2987
MTB > name k2 'thres2_lev'
MTB > let k3=4/2987
MTB > name k3 'thres_cookd'
MTB > let k4=2*sqrt(5/2987)*(2987>=120)+(2987<120)
MTB > Print 'thres_lev' 'thres2_lev' 'thres_cookd' 'thres_dfits'.
Data Display 

Note: One could also introduce constants 'nobs' and 'nparam' and use
those in the formulae.

* refit the model two ways
* first, leave out cooksd > 0.007 (n=5)
regress intvl_ln p_rct hdsize pyear_ct pyear_sq if cook<=0.007
* R2 goes up slightly, coef. don't change
* second, leave out herds with high values of p_rct
regress intvl_ln p_rct hdsize pyear_ct pyear_sq if p_rct<=50
* very little change in model

* ADVANCED MATERIAL - suggest using robust standard errors (Chapter 20)
* to counteract problems of heteroscedasticity and non-normality
regress intvl_ln p_rct hdsize pyear_ct pyear_sq, robust

Note: Minitab does not offer robust standard errors, but this is indeed
an advanced topic and such procedures should be used with utmost care.