Soci 709 (formerly 209) Module 7 - QUALITATIVE INDEPENDENT VARIABLES

1.  INDICATOR VARIABLES

An indicator (or binary variable) is a variable that can have only two values, 0 or 1.
Indicator variables are used to represent qualitative (or nominal or categorical) variables.
A qualitative (nominal or categorical) variable with k classes is represented in the regression model by k-1 indicators.
(Indicators are also called "dummy variables" but this usage is unfortunate as it incorrectly implies that indicators are "fake" variables in some sense, which they are not.)
Regression models with indicator variables are very important in part because they represent a bridge between regression analysis and a set of statistical techniques called analysis of variance (ANOVA) and analysis of covariance (ANCOVA) - see the second part of ALSM5e or ALSM4e.  All ANOVA or ANCOVA models can be represented as regression models with suitably defined indicators.

Examples:

Note on naming an indicator variable: consider naming an indicator variable after the category that has the value 1.  For example a variable that 1 for female, and 0 for male can be named FEMALE; the meaning remains clear. If the variable is named SEX or GENDER, you will forget which one is 1 and which one is 0 six months later.

Detailed example 1: Restaurants Study

In a study of restaurants, volume of sales (in dollars) is the dependent variable.  One independent variable (x1) is Number of Households in the area (a regular continuous variable). The other independent variable is restaurant location, a qualitative variable with 3 categories: Highway, Shopping Mall, or Street.  The indicators used are
X2 = 1 if Shopping Mall location, 0 otherwise
X3 = 1 if Street location, 0 otherwise
Highway location does not have an indicator associated with it.  Highway location is called the reference, baseline, or omitted category or class.  The reference class is the class for which every indicator is set to zero.  The restaurant data are set up as in the next exhibit. The regression model is
yi  = b0 + b1Xi1 + b2Xi2 + b3Xi3  + ei
where Xi1 is  and Xi2 and Xi3  are the two location indicators.
The meaning of the coefficients is revealed by examining the regression function
E{Y}  = b0 + b1X1 + b2X2 + b3X3
There are three cases, depending on location
 
Values of Regression Function E{Y}  = b0 + b1X1 + b2X2 + b3X3 for Different Restaurant Locations
Restaurant Location
Values of Indicators
Regression Function
Highway
E{Y}  = b0 + b1X1 + b2(0)+ b3(0) E{Y}  = b0 + b1X1
Shopping Mall
E{Y}  = b0 + b1X1 + b2(1) + b3(0) E{Y}  = ( b0 + b2) + b1X1
Street
E{Y}  = b0 + b1X1 + b2(0) + b3(1) E{Y}  = (b0 + b3) + b1X1

The table shows that  b2 and b represent the differences in intercept for Shopping Mall and Street location, respectively, relative to Highway (the omitted category).  This can be seen by plotting the regression function for the three locations.

Detailed example 2: Depression Scores Study

Models of the depression score with the Afifi & Clark (1984) data set.

Number of Categories

A categorical variable with k classes (categories) is represented by k-1 indicators, with one category omitted.  Using k indicators to represent a categorical variable with k classes would make the X'X matrix singular, so b cannot be estimated.

2.  INDICATOR MODELS WITH INTERACTIONS

The indicator model with interaction is discussed in the context of a substantive example.

Detailed example

(From Hamilton 2006: pp. 180--185.)

Units are states of the U.S.  Variables are

y (dependent variable) is csat (=mean composite SAT score by State)
x1 is percent = percent HS graduates taking SAT (a continuous variable)
x2 is reg2 = 1 for North East Region, 0 for other regions (an indicator variable)

The coefficient of a continuous independent variable y may be allowed to vary as a function of an indicator variable by using an interaction term.  The response function for the interaction model becomes

E{y} = b0 + b1x1 + b2x2 + b3x1x2
where y = mean composite state SAT score, x1 = percent HS graduates taking the SAT and x2 = North-East indicator (1 if state is in North-East, 0 otherwise).

Again to understand the meaning of the coefficients one must examine the regression (response) function for each category of the qualitative variable:
 

Values of Regression Function E{y} = b0 + b1x1 + b2x2 + b3x1x2 for Different States
Region
Values of x2 and x1x2
Regression function
not North-East (x2=0) E{y} = b0 + b1x1 + b2(0) + b3(0)  E{y} = b0 + b1x1
North-East (x2=1) E{y} = b0 + b1x1 + b2(1) + b3x1(1)  E{y} = (b0 + b2) + (b1+ b3 )x1

STATA (V9) commands are:

use "Z:\mydocs\s209\hamiltonv9\states.dta", clear
describe region
tabulate region
tabulate region, gen(reg)
* gen(reg) creates 4 (0,1) indicators, for each region
regress csat percent reg2
gen nexpct=reg2*percent
*nexpct is reg2 by percent interaction term
regress csat percent reg2 nexpct
*next graph is equivalent to conditional effects plot
graph twoway lfit csat percent || scatter csat percent || , by(reg2)

The estimated model is (t-ratios in parentheses):

y = 1035.519 -2.859x1 -241.357x2 4.180(x1x2) R2=.843 n=50
  (150.01) (-14.06) (-2.07) (2.58)    

The significant interaction effect suggests that percent (x1) affects csat negatively in most regions but positively in the North East.  (Is the effect of North East of the reinforcement or interference type?). Thus in the interaction model both the intercept and the slope of the continuous variable x1 differ across types of firm. This can be visualized with a conditional effects that compares the regression functions for the two types of states.

3.  COMPARISON OF TWO OR MORE REGRESSION FUNCTIONS

A very common research strategy is to study the similarities and differences between regression models for 2 or more populations.

Example:  a social scientist wants to compare a regression model of earnings as a function of education and experience (time in the labor force) for full-time year-round employed men and women, perhaps to detect any discriminatory treatment suffered by one of the groups

Example:  a sociologist wants to compare the status attainment process in the 1960s and 1990s by estimating the same model of occupational prestige as a function of education and family background characteristics (F's Education, M's Education, F's Occupation, etc.) using subjects in the same age range in the 1960s and 1990s, perhaps to monitor any trend of increasing or decreasing social mobility between the two periods

Example: a much more exciting project is a comparison of two production lines for making soap bars.  For each production line, the relation of interest is that between the amount of scrap for the day (the dependent variable) and the speed of the production line.  A symbolic scatter plot of Amount of Scrap against Line Speed suggests that the regression relation is not the same for the two production lines (next exhibit).

When it is reasonable to assume that the error term variances in the regression models for the different populations are equal, one can use indicator variables to test the equality of the different regression functions.  (When the variances are not equal, a suitable transformation of Y can equalize them approximately.  See NKNW p. 472 for an example of formal test of the equality of the error variances for two populations.)  This is done by considering the different populations as classes of a predictor variable, define indicator variables for the different populations, and estimating a single regression model containing appropriate interaction terms, similar to the insurance innovation interaction model above. For the Soap Production data define an interaction model with regression function
E{Y} = b0 + b1X1 + b2X2 + b3X1X2
where Y = Amount of Scrap, X1 = Line Speed, and X2 is Line 1 (=1 if production line 1, 0 if production line 2).
The estimated regression model is (t-ratios in parentheses)
^Y = 7.57 + 1.322X1 + 90.39X2 - .1767X1X2
     (.36)   (14.27)    (3.19)    (-1.37)
so one concludes that the slope of the relationship between Amount of Scrap and Speed does not differ across production lines (b3 is non significant with t* = -1.37) but that the intercepts are significantly different, so that Amount of Scrap is overall higher in Line 1 than in Line 2 (b2 is significant with t* = 3.19).
(One can formally test the equality of the regression lines with the joint test of H0: b2 = b3 = 0 ; see NKNW pp. 472-473.  We will study joint testing later.)

Example: comparing the return to education (in terms of income) for men and women, using the survey2.syd data set.

Q -- What kind of substantive research often uses models like the ones in the last exhibit?
 

4.  PIECEWISE LINEAR REGRESSION & DISCONTINUITIES

1.  Piecewise Regression -- Change of Slope Point Known

Indicator variables can be used to model situations in which the slope of the regression of Y on X  differs for two ranges of X, as in the following exhibit: Assuming that Xp (the point where the slopes change) is known, the piecewise linear relation is represented by a regression model with response function
E{Y} = b0 + b1X1 + b2(X1 - 500)X2
where Y = Unit Cost, X1 = Lot Size, X2 = 1 if X1>500, = 0 otherwise, and Xp = 500.
One can convince oneself that this function represents the piecewise linear relation by examining the response function separately for the range X1 <= 500 and the range X1 > 500, as shown in the next exhibit The model can be estimated from the data The estimated regression function is
^Y = 5.89545 - .00395X1 - .00389(X1 - 500)X2
Piecewise linear modelling can be easily extended to more than 2 pieces.  See NKNW p. 477.
When Xp is not known, one may The following link shows a complete piecewise linear analysis using the Lerner data.

2.  Piecewise Regression -- Change of Slope Point Unknown

When the xp is not known, xp and the regression function can be estimated simultaneously using nonlinear least-squares.
The following exhibit shows how to estimate a piecewise regression model when xp is not known, using STATA and the Lerner data.  Note that the estimate of  xp is 49, which is a bit less than the value suggested by the lowess regression curve (about 60), although the 95% CI for xp is wide so 60 is not an implausible estimate of the "elbow" of the relationship (28.9053 to 69.09471).

3.  Discontinuity in Regression Function

One can also use indicators to model discontinuous piecewise linear regression relations, as in the following exhibit.  See NKNW pp. 477-478 for details.

5.  INDICATORS VERSUS QUANTITATIVE VARIABLES

1.  Indicators Versus Allocated Codes

Qualitative variables with ordinal categories can often be represented either by allocated codes or by indicators.  For example, persons of Hispanic origin might be asked the question "How often do you use Spanish at home?" with response categories Frequently, Occasionally, Never.  The variable may be coded with allocated codes in variable X1, or with two indicators X2 and X3:
 
Alternative codings of frequency of Spanish use at home: 
X1: allocated codes; X2 and X3: indicators
Class
X1
X2: Frequent User
X3: Occasional User
Never
1
0
0
Occasionally
2
0
1
Frequently
3
1
0
Using allocated codes X1 in the model with response function E{Y} = b0b1X1 constrains differences in response function among classes to be the same, as can be seen by writing the response functions for each category of use:
Never:  E{Y} =  b0 + 1. b1
Occasionally:  E{Y} =  b0 + 2. b1
Frequently:  E{Y} =  b0 + 3. b1
Therefore
E{Y|Frequently} - E{Y|Occasionally} = E{Y|Occasionally} - E{Y|Never} =  b1
(Say this in words.)  The assumption of constant differences in effect among contiguous classes may or may not be substantively plausible.

Using indicators X2 and X3 in the model with response function E{Y} =  b0b2X2 + b3X3 allows differences in response functions among classes to be different:

E{Y|Frequently} - E{Y|Never} = b2
E{Y|Occasional} - E{Y|Never} =  b3
E{Y|Frequently} - E{Y|Occasional} = (b2 - b3)
Thus in the indicators model the effects of the classes are not arbitrarily restricted.  On the other hand categories use up more degrees of freedom (k-1 variables) than allocated codes (one variable).

2.  Indicators Versus Quantitative Variable

For the same reasons, it may be useful to use a set of indicators rather than continuous values even when a variable is inherently quantitative.
Example: to study the relationship of earnings with age, age in years is represented by a set of indicators corresponding to 5-years categories 20-24, 25-29, 30-34, etc.  Using indicators may allow for a better "tracking" of the non-linear relationship between Y (earnings) and X (age).  The disadvantage of indicators is that more degrees of freedom are consumed (k-1 indicators versus 1 continuous variable), but this is not a problem with large data sets.
Example: The variable EDUCATN in the Afifi & Clark (1984) data set can be viewed as an allocated code or as a quantitative variable.  One can create a set of indicators from the code corresponding to various levels of education.  The next exhibit compares models of the effect of education on income using years of education (as a quantitative variable), on one hand, and using a set of indicators, on the other.

3.  Alternative Coding for Indicators

Alternatives to the coding of a qualitative variable with k-1 (0,1) indicators and a constant term are See ALSM5e ???; ALSM4e pp. 481-482 and Section 16.11 (pp. 696-701) for discussion of the relationship between regression and ANOVA.

6.  QUASI-INDICATORS: DF ANALYSIS OF FAMILY DATA

This section is to be added later.



Last modified 27 Feb 2006