University of North Carolina
at Chapel Hill

SOCI 209/HPA 332 - LINEAR REGRESSION MODELS - Spring 2003
Professor François Nielsen

Assignment 4 - Released Tue 15 April (should have been Tue 8 April; my mistake!)
DUE Thu 24 April

PROBLEMS ON OUTLYING & INFLUENTIAL OBSERVATIONS, COLLINEARITY, HETEROSCEDASTICITY, & AUTOCORRELATED ERRORS

(Use a regression program of your choice to do problems requiring data analysis.)

From Neter, Kutner, Nachtsheim, and Wasserman (NKNW):

1.  9.3 p. 392 (just discard influential cases?)

2.  This problem uses the Yule data set.  It focuses on diagnostics and remedial measures for outliers and influential cases.

a.  Estimate the full model paup = constant + outratio + propold + pop for the 32 unions and save the regression diagnostics.
b.  Use the studentized deleted residuals (STUDENT in SYSTAT) to identify outliers in the Y dimension, using the Bonferroni procedure with an initial a = .01 level.  State the decision rule and conclusion.
c.  Identify any X-outlying (high-leverage) observation using the appropriate diagnostic and rule of thumb.
d.  Identify any influential observation by looking at an index plot of Cook's distance (COOK in SYSTAT) and calculating the corresponding percentiles of the appropriate F distribution for cases with high values of COOK; compare the percentiles with the cutoffs suggested in NKNW.
e.  Use the Hadi procedure for robust outlier detection.  (Make sure you specify the print = long option in SYSTAT to get the list of outliers.)  Are results of the Hadi procedure consistent with those of the other diagnostics?  Why are these particular unions deviant?  (How would you find more about the various neighborhoods of metropolitan London in late 19th century?)  On what grounds could one justify removing these deviant cases?
f.  After selecting out the outliers identified by the Hadi procedure, estimate the following 3 models.  (In SYSTAT specify print = short if you don't want all the collinearity diagnostics.)   If appropriate estimate a final (4th) trimmed model (if you do, disregard any new warnings concerning outliers).
paup = constant + outratio
paup = constant + outratio + propold
paup = constant + outratio + propold + pop
paup = ?
Present the regression results in a tabular form suitable for publication.
g.  Reestimate the full model with the 32 cases using robust regression (IRLS) with the bisquare weight function with parameters 3.5. (See NKNW Figure 10.4 p. 419 for the shape of the bisquare weight function.)  Look at the example with the GRAD data on the web for using SYSTAT's nonlin module for robust regression; the commands for this problem will look like
>nonlin
>model paup = b0 + b1*outratio+b2*propold+b3*pop
>robust bisquare=3.5
>estimate
How do these estimates compare to OLS with the 32 cases and OLS with the outliers removed?


3.   9.13 p. 395 (cosmetics sales; clues of collinearity)  To do part d. with SYSTAT you need to go to the CORR module and enter the command pearson x1 x2 x3

4.  9.14 p. 395 (cosmetics sales; interpreting VIF, advantage of experiment)  Note that VIFk = 1/TOLk, and conversely TOLk = 1/VIFk, where TOLk = (1 - Rk2) and Rk2 is the coefficient of multiple determination when Xk is regressed on the other independent variables in the model.  SYSTAT outputs TOL instead of VIF.

5.  10.6 p. 445 (computer-assisted learning; handling heteroscedasticity)  This is a small but complete paradigm for handling heteroscedasticity.  Disregard the detailed instructions in NKNW.  Instead use the data set and do the following steps

  1. run the regression in SYSTAT or STATA and plot the residuals against the estimate; do you see any funny funnel pattern?
  2. run the regression of Y on X in STATA, and calculate the Breusch-Pagan aka Cook-Weisberg test of heteroscedasticity.  (Hint: in STATA just enter hettest after the regression).  Is the test significant?  What does this mean?
  3. in STATA rerun the regression 3 times using (1) the Huber-White robust standard errors (option robust), (2) the MacKinnon-White HC2 standard errors (option hc2), and (3) the MacKinnon HC3 standard errors (option hc3)
  4. construct a table (see exhibit at the end of Module 12 for an example) comparing the width of the 95% CI obtained using options robust, hc2, and hc3; comment.
6.  12.13 p. 523. (advertising agency; detecting autocorrelation of errors)  In part b. since the cases are ordered over time and the observations are equally spaced, the plot of residuals against time is the same as an index plot (plot residual); note also that the SYSTAT output automatically provides the standard errors of estimate, the D-W test, and the estimated autocorrelation, so you do not have to calculate them yourself.

7.  12.14 p. 523 (advertising agency; Cochrane-Orcutt procedure) Omit part f and g.



Last modified 14 Apr 2003