BOOTSTRAP WITH LINEAR REGRESSION MODEL - Example with Longley data from SYSTAT 9 help system

Linear Models

This example involves the famous Longley (1967) regression data. These real data were collected by James Longley at the Bureau of Labor Statistics to test the limits of regression software. The predictor variables in the data set are highly collinear, and several coefficients of variation are extremely large. The input is:

USE LONGLEY
GLM
     MODEL TOTAL=CONSTANT+DEFLATOR..TIME
     SAVE BOOT / COEF
     ESTIMATE / SAMPLE=BOOT(2500,16)

OUTPUT TEXT1
USE LONGLEY
MODEL TOTAL=CONSTANT+DEFLATOR..TIME
ESTIMATE

USE BOOT
STATS
     STATS X(1..6)
OUTPUT *

BEGIN
     DEN X(1..6) / NORM
     DEN X(1..6)
END
Notice that we save the coefficients into the file BOOT. We request 2500 bootstrap samples of size 16 (the number of cases in the file). Then we fit the Longley data with a single regression to compare the result to our bootstrap. Finally, we use the bootstrap file and compute basic statistics on the bootstrap estimated regression coefficients. The OUTPUT command is used to save this part of the output to a file. We should not use it earlier in the program unless we want to save the output for the 2500 regressions. To view the bootstrap distributions, we create histograms on the coefficients to see their distribution.

The resulting output is:

 
 Variables in the SYSTAT Rectangular file are:
  DEFLATOR     GNP          UNEMPLOY     ARMFORCE     POPULATN     TIME
  TOTAL
 
 Dep Var: TOTAL   N: 16   Multiple R: 0.998   Squared multiple R: 0.995
 
 Adjusted squared multiple R: 0.992   Standard error of estimate: 304.854
 
 Effect         Coefficient    Std Error     Std Coef Tolerance     t   P(2 Tail)
 
 CONSTANT      -3482258.635   890420.384        0.0        .      -3.911    0.004
 DEFLATOR            15.062       84.915        0.046     0.007    0.177    0.863
 GNP                 -0.036        0.033       -1.014     0.001   -1.070    0.313
 UNEMPLOY            -2.020        0.488       -0.538     0.030   -4.136    0.003
 ARMFORCE            -1.033        0.214       -0.205     0.279   -4.822    0.001
 POPULATN            -0.051        0.226       -0.101     0.003   -0.226    0.826
 TIME              1829.151      455.478        2.480     0.001    4.016    0.003
 
                              Analysis of Variance
 
 Source             Sum-of-Squares   df  Mean-Square     F-ratio       P
 
 Regression           1.84172E+08     6  3.06954E+07     330.285       0.000
 Residual              836424.056     9    92936.006
 ----------------------------------------------------------------------------------------------------------------------------------
 
 
 Durbin-Watson D Statistic     2.559
 First Order Autocorrelation  -0.348
 
 Variables in the SYSTAT Rectangular file are:
  CONSTANT     X(1..6)
 
                           X(1)        X(2)        X(3)        X(4)        X(5)        X(6)
   N of cases             2500        2500        2500        2500        2500        2499
   Minimum            -816.248      -0.846     -12.994      -8.864      -2.591   -5050.438
   Maximum            1312.052       0.496       7.330       2.617    3142.235   12645.703
   Mean                 20.648      -0.049      -2.214      -1.118       1.295    1980.382
   Standard Dev        128.301       0.064       0.903       0.480      62.845     980.870

<Discussion

Standard Errors

The bootstrapped standard errors are all larger than the normal-theory standard errors. The most dramatically different are the ones for the POPULATN coefficient (62.845 versus 0.226). It is well known that multicollinearity leads to large standard errors for regression coefficients, but the bootstrap makes this even clearer.>

Following is the plot of the results:

<Discussion

Distributions

Normal curves have been superimposed on the histograms, showing that the coefficients are not normally distributed. We have run a relatively large number of samples (2500) to reveal these long-tailed distributions. Were these data to be analyzed formally, it would take a huge number of samples to get useful standard errors.
Beaton, Rubin, and Barone (1976) used a randomization technique to highlight this problem. They added a uniform random extra digit to Longley’s data so that their data sets rounded to Longley’s values and found in a simulation that the variance of the simulated coefficient estimates was larger in many cases than the miscalculated solutions from the poorer designed regression programs.>



Last modified 20 Apr 2000 - from SYSTAT 9 help system