s209 module 3

However, when the residual is plotted against ^Y the megaphone pattern fans out to the right. (Why?)

2. Plot of Absolute Residual |e| or Squared Residual e² by ^Y (Diagnostic, Informal)

3. Tests for Homoskedasticity (Constancy of s² ) (Diagnostic, Formal)

4. Variable Transformation To Equalize the Variance of Y (Remedy)

5. Weighted Least Squares (Remedy)

4. Error Terms Are Not Independent

1. Residual e by Time or Other Sequence (Diagnostic, Informal)

Remedial techniques for lack of independence are discussed in Module 14 - Autocorrelation in Time Series Data.

2. Durbin-Watson Test of Independence (Diagnostic, Formal)

5. Model Fits All But One Or A Few Outlier Observations

1. Outliers in Scatterplot or Residual Plot (Diagnostic, Informal)

2. Tests for Outlier and Influential Cases (Diagnostic, Formal)

6. Error Terms Are Not Normally Distributed

1. Box Plot, Stem-and-Leaf, and Other Displays of the Distribution of e (Diagnostic, Informal)

2. Normal Probability Plot of e (Diagnostic, Informal)

3. Correlation Test for Normality (Diagnostic, Formal)

4. Data Transformations Affecting the Distribution of a Variable (Remedy)

1. Standardizing a Variable - Does Not Affect Shape of Distribution

2. Transforming a Variable to Look Normally Distributed

a. Radical normalization with the rankit transformation

b. Tukey's ladder of powers

c. Automatic choice of ladder of powers transformation with the Box-Cox procedure

but l can take any value in between.
In the simplest version the Box-Cox procedure estimates the parameter l by maximum likelihood so as to maximize the fit of the transformed data Y' to a normal distribution. There is only one variable involved.

Example: Find the transformation of V181 (female life expectancy) that best normalizes the distribution. STATA estimates lambda as

lambda	Std. Error	z (same as t*)	P{\|Z\|>z}	CI Low	CI Up	Sigma
.1362285	.0660046	2.06	.039	.0068619	.265595	3.136496

The program does chi-square tests of 3 hypotheses on the value of lambda:

H₀: l =	Chi-square	P-value
-1	300.88	.000
0	4.35	.037
1	140.35	.000

So estimated l (.1362285) is close to 0 (logarithm transformation), but H₀: l = 0 is rejected at the .05 level (although not at the .01 level).

d. Box-Cox procedure to optimize the linear relationship between X and Y

Note that the exponent theta for X is 1.422, but the 95% CI includes 1.0 so one cannot reject the hypothesis that the optimal theta is 1.0 (optimal transformation is no transformation).
On the Box-Cox procedure see ALSM5e pp. 134-137.

d. Arcsine transformation for proportions and percents

e. Fisher's z transformation for correlation coefficients

7. One or Several Important Predictor Variables Have Been Omitted From Model

Plot of Residual e by Omitted Predictor Variable Z

8. Simple Regression Diagnostics & Remedies in Practice

1. SYSTAT Examples

Effect         Coefficient    Std Error     Std Coef Tolerance     t   P(2 Tail)
CONSTANT            11.811        0.948        0.000      .      12.461    0.000
INCOME              -0.120        0.035       -0.210     1.000   -3.425    0.001

Analysis of Variance
Source             Sum-of-Squares   df Mean-Square     F-ratio       P
Regression               893.479     1      893.479      11.729       0.001
Residual               19348.271   254       76.174

SYSTAT Rectangular file Z:\mydocs\s208\resid.SYD,
created Mon Jan 20, 2003 at 10:04:34, contains variables:

ESTIMATE RESIDUAL LEVERAGE COOK STUDENT SEPRED
ID SEX AGE MARITAL EDUCATN EMPLOY

>let absres = abs(residual)
>plot absres*estimate/stick=out smooth=lowess

>rem calculate expected residuals under normality explicitly
>rem to do correlation test
>sort residual

Case number     RESIDUAL      RESRANK
        1          -11.212        1.000
        2          -11.092        2.000
        3          -10.973        3.000
        4          -10.733        4.000
        5          -10.733        5.000
        6          -10.494        6.000
        7          -10.015        7.000
        8           -9.733        8.000
        9           -9.571        9.000
       10           -9.536       10.000

>let expres = zif((resrank - 0.5)/256)
>rem next plot should be same as pplot command
>plot expres*residual/stick=out

>rem test for normality using correlation test and
>rem ALSM5e Table B.6
>plot residual*female/stick=out smooth=linear

>den income/stick=out kernel
>rem the kernel density of income is "normalized" by setting the power
>rem exponent interactively after double-clicking on the graph, as shown
>rem in following picture

>rem scatterplot with LOWESS curve shows strong non-linear
>rem pattern; relationship "straightend" by setting power
>rem exponent of V181 to 0.2 interactively after double-
>rem clicking on graph

2. STATA Examples

We highly recommend the UCLA STATA module for diagnostics

http://www.ats.ucla.edu/stat/stata/modules/reg/nonnorm.htm

. use "C:\Stata\auto.dta", clear
*1978 Automobile Data

. boxcox mpg, nolog level(95)

Transform: (mpg^L-1)/L

                 L       [95% Conf. Interval]     Log Likelihood
            ----------------------------------------------------
              -0.3584      -1.1296     0.4078          -123.2664

     Test: L == -1      chi2(1) =     2.68    Pr>chi2 = 0.1018
            L == 0      chi2(1) =     0.82    Pr>chi2 = 0.3649
            L == 1      chi2(1) =    12.25    Pr>chi2 = 0.0005

*nolog level(95) tells Stata not to display the iterations log but to include
a 95% confidnence interval. The interval rejects the hypothesis that the best
transformation is no transformation.
*next create a new variable newmpg based on the optimal exponent found by boxcox procedure
. gen newmpg = (mpg^(-0.3584) - 1)/(-0.3584)
*next compare distributions of mpg and newmpg
. graph mpg, bin(10) ylabel xlabel norm t1(raw data)

. graph newmpg, bin(10) ylabel xlabel norm t1(transformed data)

. kdensity mpg, normal border

. kdensity newmpg, normal border

. ksm mpg weight, lowess xlab ylab border

. ksm newmpg weight, lowess xlab ylab border

. stem mpg

Stem-and-leaf plot for mpg (Mileage (mpg))

1t | 22
1f | 44444455
1s | 66667777
1. | 88888888899999999
2* | 00011111
2t | 22222333
2f | 444455555
2s | 666
2. | 8889
3* | 001
3t |
3f | 455
3s |
3. |
4* | 1

. ladder mpg

Transformation         formula             Chi-sq(2)     P(Chi-sq)
------------------------------------------------------------------
cube                   mpg^3                  43.59        0.000
square                 mpg^2                  27.03        0.000
raw                    mpg                    10.95        0.004
square-root            sqrt(mpg)               4.94        0.084
log                    log(mpg)                0.87        0.647
reciprocal root        1/sqrt(mpg)             0.20        0.905
reciprocal             1/mpg                   2.36        0.307
reciprocal square      1/(mpg^2)              11.99        0.002
reciprocal cube        1/(mpg^3)              24.30        0.000

In this formula, m=mu and s=the standard deviation and I is the observation number.

. pnorm mpg

. pnorm newmpg

. regress mpg weight

Source |       SS       df       MS                  Number of obs =      74
---------+------------------------------               F( 1,    72) = 134.62
   Model |   1591.9902     1   1591.9902               Prob > F      = 0.0000
Residual | 851.469256    72 11.8259619               R-squared     = 0.6515
---------+------------------------------               Adj R-squared = 0.6467
   Total | 2443.45946    73 33.4720474               Root MSE      = 3.4389

------------------------------------------------------------------------------
     mpg |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
weight | -.0060087   .0005179    -11.603   0.000      -.0070411   -.0049763
   _cons |   39.44028   1.614003     24.436   0.000       36.22283    42.65774
------------------------------------------------------------------------------

. fpredict yhat

. fpredict e,resid

. graph e yhat

. rvfplot, oneway twoway box yline(0) ylabel xlabel

. hettest

Cook-Weisberg test for heteroscedasticity using fitted values of mpg
     Ho: Constant variance
         chi2(1)      =     11.05
         Prob > chi2 =      0.0009

. ovtest

Ramsey RESET test using powers of the fitted values of mpg
       Ho: model has no omitted variables
                  F(3, 69) =      1.77
                  Prob > F =      0.1616

Lambda (l)	Transformation	Name
2	Y' = X²	square
1	Y' = X	identity
.5	Y' = (X)^1/2	square root
0	Y' = log(X)	logarithm (any base)
-.5	Y' = 1/(X)^1/2	inverse square root
-1	Y' = 1/Y	inverse

Coefficient	Coefficient	Std. Error	z (same as t*)	P{\|Z\|>z}	95% CI Low	95% CI Up
lambda	.1772316	.0705726	2.51	0.012	.0389119	.3155514
theta	1.422054	.3605058	3.94	0.000	.7154761	2.128633
V181	16.74134	Sigma =	31.67667

SOCI709 (formerly 209) - Module 3 - Diagnostics & Remedies in Simple Regression

1. Residual Analysis & the Healthy Regression

1. Residual Plot of the Healthy Regression

2. Potential Problems With the Simple Regression Model

3. Using Informal (Graphic) Versus Formal Diagnostic Tests

2. Regression Function is Not Linear

1. Scatterplot or Residual Plot With LOWESS Robust Nonparametric Regression Curve (Diagnostic, Informal)

2. Linearity or Lack of Fit Test (Diagnostic, Formal, Limited Applicability)

3. Polynomial Regression (Remedy)

4. Transforming Variables To Linearize the Relationship (Remedy)

1. Error Variance Appears Constant

2. Error Variance Appears Not Constant

3. Transformation to Simultaneously Linearize the Relationship and Normalize the Distribution of Errors

3. Error Terms Do Not Have Constant Variance (Heteroskedasticity)

1. Funnel-Shape in in Residual Plot (Diagnostic, Informal)

2. Plot of Absolute Residual |e| or Squared Residual e2 by ^Y (Diagnostic, Informal)

3. Tests for Homoskedasticity (Constancy of s2 ) (Diagnostic, Formal)