Module 11 - COLLINEARITY

(Sometimes "multicollinearity".  Belsley, Kuh, and Welsch (1980) dislike "multi-" as redundant.)

1.  SYMPTOMS OF COLLINEARITY

The following symptoms may indicate a collinearity problem Note in the next exhibits how correlations among independent variables are not extreme, as seen in the splom, but in the regression regression coefficients are n.s. despite high R2.

2.  TOLERANCE (TOL) & VARIANCE INFLATION FACTOR (VIF)

The tolerance for variable Xk is
(TOL)k = 1 - Rk2    k = 1, 2, ..., p-1
where Rk2 is the R square when Xk is regressed on the other independent variables in the model including a constant.
The variance inflation factor for variable Xk is the inverse of the tolerance
(VIF)k = 1/(TOL)k
Why the expression "variance inflation factor"?

The standardized regression model is a regression model in which all the variables (X and Y) are standardized into z-scores with mean 0 and standard deviation 1, and divided by (n - 1)1/2 (see NKNW Section 7.5, pp. 277-284).
In the standardized regression model the normal equations X'Xb = X'Y become

rXXb* = rYX
where rXX is the matrix of correlations of X, rYX are the correlations of Y with the X, and b* is the vector of standardized regression coefficients.
One can show that (VIF)k is the kth diagonal element of (rXX)-1 so that
s2{bk*} = (s*)2(VIF)k = (s*)2/(1 - Rk2)
where (s*)2 is the error variance of the standardized model.
Thus (VIF)k measures how much the variance of the standardized regression coefficient bk* is inflated by collinearity.

Using TOL or VIF for Diagnosis

As discussed earlier, a common rule of thumb is to take as an indication that collinearity may be a problem.
For the body fat data the values of TOL andVIF) are
Independent variable TOL VIF
TRICEPS .001411 708.717
THIGH .001772 564.334
MIDARM .00956 104.603

These values indicate high collinearity levels.
Similarly, the average VIF

(VIF). = (Si=1 to p-1 (VIF)k)/(p-1)
is an indicator of collinearity for the entire model.

3.  (Optional) ADVANCED COLLINEARITY DIAGNOSTICS

Advanced collinearity diagnostics discussed by Belsley, Kuh, and Welsh (1980) are output by SYSTAT with the extended output option (print = long).  The following exhibits show advanced diagnostics for the body fat data and for the Longley data: 

The following diagnostics are produced (see body fat exhibit):

These diagnostics are further explained in Belsley, Kuh, and Welsh (1980).   In my experience with a number of data sets these advanced diagnostics are sometimes inconclusive; a common pattern is conclusion of high collinearity involving the constant term, or another variable, when in fact there is no collinearity problem with that variable.  In most cases TOL or VIF is a sufficient diagnostic of collinearity.

4.  REMEDIES FOR COLLINEARITY

The fundamental problem with collinearity is that the pattern of intercorrelations among variables makes the X'X matrix nearly singular, which makes estimation of the regression coefficients imprecise/unstable.  Once collinearity has been diagnosed, a number of strategies may be considered, starting with the most obvious ones: An example of an "easy" situation is the Longley data; dropping the n.s. variables in the original regression produces a non-collinear model.

5.  RIDGE REGRESSION

Collinearity inflates the variance of the coefficient estimates.  Ridge regression introduces a small bias in order to reduce s{bk}.  The goal is an estimator that has a higher probability of being close to the true value of the coefficient.  The measure of "probability of being close to the true value" that combines the effects of bias and sampling variation is the mean squared error
E{(bR - b)2} = s2{bR} + (E{bR} - b)2
where bR is the ridge estimator; the mean squared error is seen as the sum of the variance of the estimate and the squared bias.

Ridge regression is based on the standardized regression model with normal equations

rXXb* = rYX
where rXX is the matrix of correlations of X, rYX are the correlations of Y with the X, and b* is the vector of standardized regression coefficients.
The ridge estimators adds a "biasing constant" c >= 0 so that the normal equations become
(rXX + cI)bR = rYX
where bR contains the ridge coefficient estimates, so that
bR = (rXX + cI)-1rYX
The strategy is to try several successive values of c starting from zero and choose the value for which This is an informal judgement based on the graph of the bR against c called the ridge trace.
The following exhibits show an example of ridge estimation for the body fat data. The following exhibits show a matrix procedure for ridge regression using SYSTAT's matrix module. Once the value of c is chosen and the standardized coefficients bR obtained, one can recover the unstandardized coefficient estimates using the formulas (NKNW 7.53a and 7.53b p. 282)
bk = (sY/sk)bkR   (k = 1, ..., p-1)
b0 = Y. - b1X.1 - ... - bp-1X.,p-1
where sY and sk are the sample standard deviations of Y and Xk, respectively.



Last modified 6 Apr 2006