soci209 module 7 - qualitative independent variables

Soci 709 (formerly 209) Module 7 - QUALITATIVE INDEPENDENT VARIABLES

1. INDICATOR VARIABLES

An indicator (or binary variable) is a variable that can have only two values, 0 or 1.
Indicator variables are used to represent qualitative (or nominal or categorical) variables.
A qualitative (nominal or categorical) variable with k classes is represented in the regression model by k-1 indicators.
(Indicators are also called "dummy variables" but this usage is unfortunate as it incorrectly implies that indicators are "fake" variables in some sense, which they are not.)
Regression models with indicator variables are very important in part because they represent a bridge between regression analysis and a set of statistical techniques called analysis of variance (ANOVA) and analysis of covariance (ANCOVA) - see the second part of ALSM5e or ALSM4e. All ANOVA or ANCOVA models can be represented as regression models with suitably defined indicators.

Examples:

in a study of children, sex is represented by the indicator GIRL (1 if female, 0 otherwise), or, alternatively, by the indicator BOY (1 if male, 0 otherwise)
in a study of social stratification in preindustrial societies, societal type corresponds to a set of indicators describing the principal mode of subsistence of a society. Exhibit: Regressions of Jurisdictional Hierarchy and Technological Specialization on societal type [s209m7table6.pdf]

Note on naming an indicator variable: consider naming an indicator variable after the category that has the value 1. For example a variable that 1 for female, and 0 for male can be named FEMALE; the meaning remains clear. If the variable is named SEX or GENDER, you will forget which one is 1 and which one is 0 six months later.

Detailed example 1: Restaurants Study

In a study of restaurants, volume of sales (in dollars) is the dependent variable. One independent variable (x₁) is Number of Households in the area (a regular continuous variable). The other independent variable is restaurant location, a qualitative variable with 3 categories: Highway, Shopping Mall, or Street. The indicators used are

X₂ = 1 if Shopping Mall location, 0 otherwise
X₃ = 1 if Street location, 0 otherwise

Highway location does not have an indicator associated with it. Highway location is called the reference, baseline, or omitted category or class. The reference class is the class for which every indicator is set to zero. The restaurant data are set up as in the next exhibit.

Exhibit: Restaurant sales data (NWW Table 20.3 p. 617)

The regression model is

y_i = b₀ + b₁X_i1 + b₂X_i2 + b₃X_i3 + e_i

where X_i1 is and X_i2 and X_i3 are the two location indicators.
The meaning of the coefficients is revealed by examining the regression function

E{Y} = b₀ + b₁X₁ + b₂X₂ + b₃X₃

There are three cases, depending on location

**Values of Regression Function E{Y} = b₀ + b₁X₁ + b₂X₂ + b₃X₃ for Different Restaurant Locations**
Restaurant Location	Values of Indicators	Regression Function
Highway	E{Y} = b₀ + b₁X₁ + b₂(0)+ b₃(0)	E{Y} = b₀ + b₁X₁
Shopping Mall	E{Y} = b₀ + b₁X₁ + b₂(1) + b₃(0)	E{Y} = ( b₀ + b₂) + b₁X₁
Street	E{Y} = b₀ + b₁X₁ + b₂(0) + b₃(1)	E{Y} = (b₀ + b₃) + b₁X₁

The table shows that b₂ and b₃ represent the differences in intercept for Shopping Mall and Street location, respectively, relative to Highway (the omitted category). This can be seen by plotting the regression function for the three locations.

Detailed example 2: Depression Scores Study

Models of the depression score with the Afifi & Clark (1984) data set.

Number of Categories

A categorical variable with k classes (categories) is represented by k-1 indicators, with one category omitted. Using k indicators to represent a categorical variable with k classes would make the X'X matrix singular, so b cannot be estimated.

Exhibit: How using k indicators makes X'X singular (NKNW p. 456, modified)

2. INDICATOR MODELS WITH INTERACTIONS

The indicator model with interaction is discussed in the context of a substantive example.

Detailed example

(From Hamilton 2006: pp. 180--185.)

Units are states of the U.S. Variables are

y (dependent variable) is csat (=mean composite SAT score by State)
x₁ is percent = percent HS graduates taking SAT (a continuous variable)
x₂ is reg2 = 1 for North East Region, 0 for other regions (an indicator variable)

The coefficient of a continuous independent variable y may be allowed to vary as a function of an indicator variable by using an interaction term. The response function for the interaction model becomes

E{y} = b₀ + b₁x₁ + b₂x₂ + b₃x₁x₂

where y = mean composite state SAT score, x₁ = percent HS graduates taking the SAT and x₂ = North-East indicator (1 if state is in North-East, 0 otherwise).

Again to understand the meaning of the coefficients one must examine the regression (response) function for each category of the qualitative variable:

**Values of Regression Function E{y} = b₀ + b₁x₁ + b₂x₂ + b₃x₁x₂ for Different States**
Region	Values of x₂ and x₁x₂	Regression function
not North-East (x₂=0)	E{y} = b₀ + b₁x₁ + b₂(0) + b₃(0)	E{y} = b₀ + b₁x₁
North-East (x₂=1)	E{y} = b₀ + b₁x₁ + b₂(1) + b₃x₁(1)	E{y} = (b₀ + b₂) + (b₁+ b₃ )x₁

STATA (V9) commands are:

use "Z:\mydocs\s209\hamiltonv9\states.dta", clear
describe region
tabulate region
tabulate region, gen(reg)
* gen(reg) creates 4 (0,1) indicators, for each region
regress csat percent reg2
gen nexpct=reg2*percent
*nexpct is reg2 by percent interaction term
regress csat percent reg2 nexpct
*next graph is equivalent to conditional effects plot
graph twoway lfit csat percent || scatter csat percent || , by(reg2)

The estimated model is (t-ratios in parentheses):

y =	1035.519	-2.859x₁	-241.357x₂	4.180(x₁x₂)	R²=.843	n=50
	(150.01)	(-14.06)	(-2.07)	(2.58)

The significant interaction effect suggests that percent (x₁) affects csat negatively in most regions but positively in the North East. (Is the effect of North East of the reinforcement or interference type?). Thus in the interaction model both the intercept and the slope of the continuous variable x₁ differ across types of firm. This can be visualized with a conditional effects that compares the regression functions for the two types of states.

Exhibit: Conditional effects plot for composite SAT interaction model

3. COMPARISON OF TWO OR MORE REGRESSION FUNCTIONS

A very common research strategy is to study the similarities and differences between regression models for 2 or more populations.

Example: a social scientist wants to compare a regression model of earnings as a function of education and experience (time in the labor force) for full-time year-round employed men and women, perhaps to detect any discriminatory treatment suffered by one of the groups

Example: a sociologist wants to compare the status attainment process in the 1960s and 1990s by estimating the same model of occupational prestige as a function of education and family background characteristics (F's Education, M's Education, F's Occupation, etc.) using subjects in the same age range in the 1960s and 1990s, perhaps to monitor any trend of increasing or decreasing social mobility between the two periods

Example: a much more exciting project is a comparison of two production lines for making soap bars. For each production line, the relation of interest is that between the amount of scrap for the day (the dependent variable) and the speed of the production line. A symbolic scatter plot of Amount of Scrap against Line Speed suggests that the regression relation is not the same for the two production lines (next exhibit).

Exhibit: Symbolic scatterplot for soap production lines (NKNW Figure 11.6 p. 470)

When it is reasonable to assume that the error term variances in the regression models for the different populations are equal, one can use indicator variables to test the equality of the different regression functions. (When the variances are not equal, a suitable transformation of Y can equalize them approximately. See NKNW p. 472 for an example of formal test of the equality of the error variances for two populations.) This is done by considering the different populations as classes of a predictor variable, define indicator variables for the different populations, and estimating a single regression model containing appropriate interaction terms, similar to the insurance innovation interaction model above.

Exhibit: Soap production lines data (NKNW Table 11.4 p. 469)

For the Soap Production data define an interaction model with regression function

E{Y} = b₀ + b₁X₁ + b₂X₂ + b₃X₁X₂

where Y = Amount of Scrap, X₁ = Line Speed, and X₂ is Line 1 (=1 if production line 1, 0 if production line 2).
The estimated regression model is (t-ratios in parentheses)

^Y = 7.57 + 1.322X₁ + 90.39X₂ - .1767X₁X₂
(.36) (14.27) (3.19) (-1.37)

so one concludes that the slope of the relationship between Amount of Scrap and Speed does not differ across production lines (b₃ is non significant with t* = -1.37) but that the intercepts are significantly different, so that Amount of Scrap is overall higher in Line 1 than in Line 2 (b₂ is significant with t* = 3.19).
(One can formally test the equality of the regression lines with the joint test of H₀: b₂ = b₃ = 0 ; see NKNW pp. 472-473. We will study joint testing later.)

Example: comparing the return to education (in terms of income) for men and women, using the survey2.syd data set.

Exhibit: Comparing income returns to education and age for men and women (survey2.syd data)

Q -- What kind of substantive research often uses models like the ones in the last exhibit?

4. PIECEWISE LINEAR REGRESSION & DISCONTINUITIES

1. Piecewise Regression -- Change of Slope Point Known

Indicator variables can be used to model situations in which the slope of the regression of Y on X differs for two ranges of X, as in the following exhibit:

Exhibit: Principle of piecewise linear regression (NKNW Figure 11.8 p. 475)

Assuming that X_p (the point where the slopes change) is known, the piecewise linear relation is represented by a regression model with response function

E{Y} = b₀ + b₁X₁ + b₂(X₁ - 500)X₂

where Y = Unit Cost, X₁ = Lot Size, X₂ = 1 if X₁>500, = 0 otherwise, and X_p = 500.
One can convince oneself that this function represents the piecewise linear relation by examining the response function separately for the range X₁ <= 500 and the range X₁ > 500, as shown in the next exhibit

Exhibit: Piecewise linear regression related to regression coefficients (NKNW Figure 11.9 p. 476)

The model can be estimated from the data

Exhibit: Lot size data (NKNW Table 11.6 p. 477)

The estimated regression function is

^Y = 5.89545 - .00395X₁ - .00389(X₁ - 500)X₂

Piecewise linear modelling can be easily extended to more than 2 pieces. See NKNW p. 477.
When X_p is not known, one may

attempt to guess its position from the scatterplot
use non-linear regression techniques to estimate X_p together with the other parameters of the model iteratively.

The following link shows a complete piecewise linear analysis using the Lerner data.

Exhibit: Piecewise regression of disability applications (apply) on activities of Handicapped People's Movement (sumnews) with Lerner's data from Wilkinson et al. (1996) [m7019.htm]

2. Piecewise Regression -- Change of Slope Point Unknown

When the x_p is not known, x_p and the regression function can be estimated simultaneously using nonlinear least-squares.
The following exhibit shows how to estimate a piecewise regression model when x_p is not known, using STATA and the Lerner data. Note that the estimate of x_p is 49, which is a bit less than the value suggested by the lowess regression curve (about 60), although the 95% CI for x_p is wide so 60 is not an implausible estimate of the "elbow" of the relationship (28.9053 to 69.09471).

Exhibit: Piecewise regression of disability applications (apply) on activities of Handicapped People's Movement (sumnews) using non-linear least-squares (STATA)

3. Discontinuity in Regression Function

One can also use indicators to model discontinuous piecewise linear regression relations, as in the following exhibit. See NKNW pp. 477-478 for details.

Exhibit: Response function for discontinuous piecewise linear regression (NKNW Figure 11.10 p. 478)

5. INDICATORS VERSUS QUANTITATIVE VARIABLES

1. Indicators Versus Allocated Codes

Qualitative variables with ordinal categories can often be represented either by allocated codes or by indicators. For example, persons of Hispanic origin might be asked the question "How often do you use Spanish at home?" with response categories Frequently, Occasionally, Never. The variable may be coded with allocated codes in variable X₁, or with two indicators X₂ and X₃:

**Alternative codings of frequency of Spanish use at home:**
**X₁: allocated codes; X₂ and X₃: indicators**
Class	X₁	X₂: Frequent User	X₃: Occasional User
Never	1	0	0
Occasionally	2	0	1
Frequently	3	1	0

Using allocated codes X₁ in the model with response function E{Y} = b₀ + b₁X₁ constrains differences in response function among classes to be the same, as can be seen by writing the response functions for each category of use:

Never: E{Y} = b₀ + 1. b₁
Occasionally: E{Y} = b₀ + 2. b₁
Frequently: E{Y} = b₀ + 3. b₁

Therefore

E{Y|Frequently} - E{Y|Occasionally} = E{Y|Occasionally} - E{Y|Never} = b₁

(Say this in words.) The assumption of constant differences in effect among contiguous classes may or may not be substantively plausible.

Using indicators X₂ and X₃ in the model with response function E{Y} = b₀ + b₂X₂ + b₃X₃ allows differences in response functions among classes to be different:

E{Y|Frequently} - E{Y|Never} = b₂
E{Y|Occasional} - E{Y|Never} = b₃
E{Y|Frequently} - E{Y|Occasional} = (b₂- b₃)

Thus in the indicators model the effects of the classes are not arbitrarily restricted. On the other hand categories use up more degrees of freedom (k-1 variables) than allocated codes (one variable).

2. Indicators Versus Quantitative Variable

For the same reasons, it may be useful to use a set of indicators rather than continuous values even when a variable is inherently quantitative.
Example: to study the relationship of earnings with age, age in years is represented by a set of indicators corresponding to 5-years categories 20-24, 25-29, 30-34, etc. Using indicators may allow for a better "tracking" of the non-linear relationship between Y (earnings) and X (age). The disadvantage of indicators is that more degrees of freedom are consumed (k-1 indicators versus 1 continuous variable), but this is not a problem with large data sets.
Example: The variable EDUCATN in the Afifi & Clark (1984) data set can be viewed as an allocated code or as a quantitative variable. One can create a set of indicators from the code corresponding to various levels of education. The next exhibit compares models of the effect of education on income using years of education (as a quantitative variable), on one hand, and using a set of indicators, on the other.

Exhibit: Models of income with education as a quantitative versus categorical variable.

3. Alternative Coding for Indicators

Alternatives to the coding of a qualitative variable with k-1 (0,1) indicators and a constant term are

k-1 indicators coded (0,1,-1) and a constant term. This is the same as (0,1) coding except that observations corresponding to the reference class are coded -1 for all the indicators. One can show that the intercept then represents an average intercept for the classes. This type of coding is common in ANOVA.
k (0,1) indicators with no constant term. The coefficients of the indicators are then interpreted as class-specific intercepts.

See ALSM5e ???; ALSM4e pp. 481-482 and Section 16.11 (pp. 696-701) for discussion of the relationship between regression and ANOVA.

6. QUASI-INDICATORS: DF ANALYSIS OF FAMILY DATA

This section is to be added later.

Last modified 27 Feb 2006