Soci709 (formerly 209)  Module 13 - THE BOOTSTRAP

Resources: ALSM5e pp. 458-464; ALSM4e pp. 429-434

1.  PRINCIPLE OF THE BOOTSTRAP

The bootstrap is a computer-intensive method developed by Bradley Efron and others to derive standard errors of estimate from information in the sample and do statistical inference (hypothesis tests and confidence intervals) even in nonstandard estimation situations when there is no known analytical method to do so.
The scope of the bootrap is much broader than linear regression models.  The bootrap has been used to do statistical inference with Efron's standard illustration of the bootstrap is the calculation of a standard error of estimate for the correlation of average GPA with average LSAT scores of applicants for a sample of 15 U.S. law schools (out of a total population of 82).  The following exhibits are taken from a Scientific American article by Diaconis & Efron (1983; see also Efron & Tibshirani 1986, 1993).

The next exhibit illustrates the concept of sampling variation in estimating the correlation with n=15.

Exhibit:  Sampling variation of r with n=15 (Diaconis & Efron 1983 p. 122)
The original sample of 15 schools is shown.
Exhibit:  Scatterplot of average GPA against average LSAT score (Diaconis & Efron 1983 p. 118)
The principle of the bootstrap is to select a large number of samples of size n with replacement from the original sample.  The samples are called bootstrap samples.  Some cases are typically represented more than once in a bootstrap sample.
Exhibit:  Principle of the bootstrap (Diaconis & Efron 1983 p. 119)
The standard error of estimate of the parameter (here the correlation) is derived from the observed frequency distribution of the estimates in the bootstrap samples (the bootstrap distribution).
Exhibit:  Frequency distribution of r with 1,000 bootstrap samples (Diaconis & Efron 1983 p. 120)
The bootstrap distribution can be shown to be a highly accurate approximation of the true sampling distribution of the estimator.
Exhibit:  Comparison of bootstrap distribution of r with true distribution & with analytically derived distribution (Diaconis & Efron 1983 p. 123)

2.  THE BOOTSTRAP IN REGRESSION ANALYSIS

1.  Fixed X versus Random X Sampling

In the regression context, bootstrap sampling is done in two different ways:

2.  Fixed X Sampling

Fixed X sampling can be used when The fixed X sampling procedure is
  1. fit the original regression
  2. calculate the ei
  3. sample e1*, e2*, ..., en* from the ei with replacement and "reconstitute" observations as Yi* = ^Yi + ei* for each original ^Yi (i = 1, ..., n)
  4. regress Yi* on the original Xi (corresponding to that ^Yi), obtaining b*
  5. do 3. to 4. "many times" (see below)

3.  Random X Sampling

The random X sampling procedure can be used when The random X sampling procedure is
  1. fit the original regression, obtaining b
  2. sample n cases from the original sample with replacement, obtaining n sets (X*, Y*)
  3. regress Y* on X*, obtaining b*
  4. repeat 2. to 3. "many times" (see below)
The bootstrap distribution (distribution of estimates obtained by bootstrap) for a regression coefficient looks like the following exhibit:
Exhibit:  Bootstrapping with random X sampling  (NKNW Table 10.8 & Figure 10.9 p. 433)

4.  How Many is "Many Times"?

Depending on problem and the method of estimating s*{bk*}:
  1. to calculate s*{bk*} from teh bootstrap distribution, from 50 to 200 is sufficient
  2. to use the percentile method, 1,000 or more may be needed (as the calculation of s*{bk*} depends on the tails of the bootstrap distribution)
  3. in some situations it is possible to recalculate s*{bk*} as number of bootstrap samples increases and stop adding samples when s*{bk*} stabilizes

3.  COMPUTING  s*{bk*}  FROM THE BOOTSTRAP DISTRIBUTION

1.  Normal Approximation (a.k.a. Naive Bootstrap)

Calculate s*{bk*} as the standard deviation of the bk*.  Then calculate a CI for bk assuming that the sampling distribution of bk is normal, as
CI{bk} = (bk(obs) - t(1-a/2, k-1)s*{bk*}, bk(obs) + t(1-a/2, k-1)s*{bk*})
where bk(obs) is the estimate of the regression coefficient from the original sample, k is the number of bootstrap samples, and t(1-a/2, k-1) is the 100(1-a/2) percentile of the Student t distribution with k-1 df.  (For reasons that are not yet entirely clear to me this method is sometimes called "naive".)
Another quantity associated with this approach is the estimated bias, which is calculated as bk(obs) minus the mean of  the bootstrap estimates of bk.

2.  Percentile Method

The CI for bk is estimated by the percentile method as
(bk* (a/2), bk*(1-a/2))
where bk*(p) is the 100pth percentile of the empirical bootstrap distribution.  So for example the 95 percent CI is bound by the 2.5th and 97.5th percentiles of the bootstrap distribution.  The percentile method requires at least 500 bootstrap samples because the method uses the tails of the bootstrap distribution.
(ALSM5e and ALSM4e discuss a variant of the percentile method called the reflection method that I do not fully understand.)

3.  Other Methods

STATA offers two additional methods for estimating the CI called the bias-corrected method and the bias-corrected and accelerated method.    Sounds irresistible.  See STATA documentation for command [R] bootstrap.

4.  EXAMPLES

1.  Spearman Rank Correlation for Efron's Law School Data (SYSTAT)

This example is from SYSTAT V7 New Statistics pp. 8-9.
This replicates Diaconis and Efron's (1983) analysis but for the Spearman rank correlation rather than the Pearson correlation; 1,000 bootstrap samples are produced.
Exhibit:  Scatterplot of GPA by LSAT -  original sample (n=15)
m13008.gif (alternate picture)
Exhibit:  Bootstrap analysis of Spearman rank correlation of GPA with LSAT (1,000 samples)
Exhibit:  Bootstrap analysis of Spearman rank correlation of GPA with LSAT- program only
Exhibit:  Frequency distribution of bootstrap estimates (n=1,000)
m13011.gif (alternate picture)

2.  Linear Regression Model with the Longley Data (SYSTAT)

This example is provided with the SYSTAT 9 help system.
Exhibit:  Bootstrap analysis of the Longley data

3.  Robust Estimation of Yule Model Using Bisquare Formula (SYSTAT)

(The following example uses a syntax to save teh bootstrap estimates that was originally undocumented.)
Exhibit:  Bootstrap analysis of Yule's model with OLS and robust regression (bisquare 3.5)
Exhibit:  Same - program only (yuleboot.syc)

4.  Robust Estimation of the Graduation Rate Model (STATA)

Exhibit: Bootstrap analysis of robust regression of graduation rate model
Exhibit: Bootstrap distribution of b_pbla (histogram)
Exhibit: Bootstrap distribution of b_inc (histogram)
Exhibit: Bootstrap distribution of b_phis (histogram)
Exhibit: Bootstrap distribution of b_edexp (histogram)

 



Last modified 17 Apr 2006