This example involves the famous Longley (1967) regression data. These real data were collected by James Longley at the Bureau of Labor Statistics to test the limits of regression software. The predictor variables in the data set are highly collinear, and several coefficients of variation are extremely large. The input is:
USE LONGLEY
GLM
MODEL TOTAL=CONSTANT+DEFLATOR..TIME
SAVE BOOT / COEF
ESTIMATE / SAMPLE=BOOT(2500,16)
OUTPUT TEXT1
USE LONGLEY
MODEL TOTAL=CONSTANT+DEFLATOR..TIME
ESTIMATE
USE BOOT
STATS
STATS X(1..6)
OUTPUT *
BEGIN
DEN X(1..6) / NORM
DEN X(1..6)
END
Notice that we save the coefficients into the file BOOT. We request
2500 bootstrap samples of size 16 (the number of cases in the file). Then
we fit the Longley data with a single regression to compare the result
to our bootstrap. Finally, we use the bootstrap file and compute basic
statistics on the bootstrap estimated regression coefficients. The OUTPUT
command is used to save this part of the output to a file. We should not
use it earlier in the program unless we want to save the output for the
2500 regressions. To view the bootstrap distributions, we create histograms
on the coefficients to see their distribution.
The resulting output is:
Variables in the SYSTAT Rectangular file are:
DEFLATOR GNP
UNEMPLOY ARMFORCE POPULATN
TIME
TOTAL
Dep Var: TOTAL N: 16 Multiple R:
0.998 Squared multiple R: 0.995
Adjusted squared multiple R: 0.992 Standard error
of estimate: 304.854
Effect Coefficient
Std Error Std Coef Tolerance
t P(2 Tail)
CONSTANT -3482258.635
890420.384 0.0
. -3.911 0.004
DEFLATOR
15.062 84.915
0.046 0.007 0.177
0.863
GNP
-0.036 0.033
-1.014 0.001 -1.070
0.313
UNEMPLOY
-2.020 0.488
-0.538 0.030 -4.136
0.003
ARMFORCE
-1.033 0.214
-0.205 0.279 -4.822
0.001
POPULATN
-0.051 0.226
-0.101 0.003 -0.226
0.826
TIME
1829.151 455.478
2.480 0.001 4.016
0.003
Analysis of Variance
Source
Sum-of-Squares df Mean-Square
F-ratio P
Regression
1.84172E+08 6 3.06954E+07
330.285 0.000
Residual
836424.056 9 92936.006
----------------------------------------------------------------------------------------------------------------------------------
Durbin-Watson D Statistic 2.559
First Order Autocorrelation -0.348
Variables in the SYSTAT Rectangular file are:
CONSTANT X(1..6)
X(1) X(2)
X(3) X(4)
X(5) X(6)
N of cases
2500 2500
2500 2500
2500 2499
Minimum
-816.248 -0.846 -12.994
-8.864 -2.591 -5050.438
Maximum
1312.052 0.496
7.330 2.617 3142.235
12645.703
Mean
20.648 -0.049
-2.214 -1.118
1.295 1980.382
Standard Dev
128.301 0.064
0.903 0.480
62.845 980.870
<Discussion
Standard Errors
The bootstrapped standard errors are all larger than the normal-theory standard errors. The most dramatically different are the ones for the POPULATN coefficient (62.845 versus 0.226). It is well known that multicollinearity leads to large standard errors for regression coefficients, but the bootstrap makes this even clearer.>
Following is the plot of the results:
<Discussion
Distributions
Normal curves have been superimposed on the histograms, showing
that the coefficients are not normally distributed. We have run a relatively
large number of samples (2500) to reveal these long-tailed distributions.
Were these data to be analyzed formally, it would take a huge number of
samples to get useful standard errors.
Beaton, Rubin, and Barone (1976) used a randomization technique
to highlight this problem. They added a uniform random extra digit to Longley’s
data so that their data sets rounded to Longley’s values and found in a
simulation that the variance of the simulated coefficient estimates was
larger in many cases than the miscalculated solutions from the poorer designed
regression programs.>