University of North Carolina
at Chapel Hill

SOCI 209 - LINEAR REGRESSION MODELS - Spring 2006
Professor François Nielsen

Assignment 3 - Released Tue 21 Mar
DUE Tue 4 Apr

PROBLEMS ON GENERAL LINEAR TESTS, POLYNOMIAL REGRESSION, INTERACTION MODELS, & QUALITATIVE INDEPENDENT VARIABLES

Use a regression program of your choice to do problems requiring data analysis.
ALSM5e = Applied Linear Statistical Models 5e (2004) OR Applied Linear Regression Models 4e (2004) (new editions).
ALSM4e = Applied Linear Statistical Models 4e (1996) OR Applied Linear Regression Models 3e (1996) (old editions).
Some problems are in ALSM5e only.
 

General note:  When ALSM5e/ALSM4e use a phrase like "test whether variable Z can be dropped from the model", they mean "test the significance of the coefficient of variable Z" (since Z can be safely removed from the model if its coefficient is non-significant).

1.  (This was problem 7.27 p. 269 in a previous edition of the text; it is not in ALSM5e or ALSM4e) (fitting regression model with a known coefficient) (Hint: The answer is very short; you have to find a "trick".)
An analyst wanted to fit the regression model Yi = b0 + b1Xi1 + b2Xi2 + b3Xi3 + ei, i = 1,...,n by the method of least squares when it is known that b2 = 4.  How can the analyst obtain the desired fit using a multiple regression computer program?

For the next 4 problems you can use any of 3 approaches: (1) the general method of comparing full and reduced models as taught in class; (2)  the equivalent method of extra sums of squares (explained in ALSM5e Sections 7.1 to 7.3, pp. 256-268 [ALSM4e Sections 7.1 to 7.3, pp. 260-274] but not discussed in class); or (3) the test command in STATA or the hypothesis command of SYSTAT (explained in Module 8, Section 5), or equivalent commands in other statistical programs. Use file knnch06pr18 posted on the course site.

2.  ALSM5e 6.18 p. 251 [not in ALSM4e].  (Commercial properties.)  Part b, c, f only.  This is to set up the context for the next 3 problems.
A commercial real estate company evaluates vacancy rates, square footage, rental rates, and operating expenses for commercial properties in a large metropolitan area in order to provide clients  with quantitative information upon which to make rental decisions.  The data below are taken from 81 suburban commercial properties that are the newest, best located, most attractive, and expensive for five specific geographic areas.  [The variables are] are the age (X1), operating expenses and taxes (X2), vacancy rates (X3), total square footage (X4), and rental rates (Y).
b.  Obtain the scatter plot matrix and the correlation matrix.  Interpret these and state your principal findings.
c.  Fit regression model (6.5) for four predictor variables to the data.  State the estimated regression function.
f.  Can you conduct a formal test for lack of fit here?

3.  ALSM5e 7.7 p. 289 [not in ALSM4e].  (Commercial properties; test b3= 0.)  Hint: Do not follow instructions in the text; instead run the regression with X1, X2, X3 and X4 and test whether X3 can be safely dropped from the model using the appropriate t-test.

4.  ALSM5e 7.8 p. 290 [not in ALSM4e].  (Commercial properties; test b2 = 0 & b3 = 0)
Test whether both X2 and X3 can be dropped from the regression model given that X1 and X4 are retained; use a=.01.  State the alternatives, decision rule, and conclusion.  What is the P-value of the test?

5.  ALSM5e 7.10 p. 290 [not in ALSM4e].  (Commercial properties; test b1 = -.1 and b2 = .4)
Test whether beta1 = -.1 and beta2 = .4; use a=.01.  State the alternatives, full and reduced models, decision rule, and conclusion.
---------

6.  ALSM5e 7.32 p. 292 [ALSM4e 7.47 p. 324].  (Reduced models for various tests)  See Module 8, Section 4.

The next two problems are optional.  The figures for problems 7 and 8 are provided in the linked MS Word file soci209a3p7nd8sup.doc or PDF file soci209a3p7nd8sup.pdf.  You can use these figures in your assignment if you have difficulty accessing SYSTAT or producing the figures with STATA or other software.

7.  (Optional) ALSM5e 8.1 p. 335 [ALSM4e 7.28 p. 320].  (Plotting response surface and contour curves of quadratic model)  For this problem, also plot the response surface directly in 3-dimensional space and rotate the display to see the surface from various angles. You can use SYSTAT to plot the response surface in 3-D, and to represent it with contour curves.
To show the response surface in 3-D, from the command prompt in the interactive window enter the command

fplot y=140 + 4*x1*x1 - 2*x2*x2 + 5*x1*x2
To rotate the surface in space double-click on the graph.  Then repeatedly click the rotation buttons on the upper-right of the graph window to rotate the surface around a vertical or horizontal axis until you find a perspective that most appeal to you. Then print the graph.
To represent the response surface with contour curves, go back to the interactive window and press F9 to copy the previous command line, then add a semi-colon and the option contour, like this
fplot y=140 + 4*x1*x1 - 2*x2*x2 + 5*x1*x2; contour
The contour plot cannot be rotated in space.
See also Module 6, Section 3 for examples.

8. (optional)  ALSM5e 8.9 p. 336 [ALSM4e 7.38 p. 323].  (Plotting response surface and contour curves of model with interaction)
Hint for Part a:  To do the conditional effects plot requested you can use SYSTAT to represent the response surface in 3-D by going to the interactive window and entering

fplot y=25+3*x1+4*x2+1.5*x1*x2; surface=ycut ymin=3 ymax=6
The surface=ycut option will represent the response surface with lines corresponding to fixed values of x2 (which SYSTAT identifies in 3-D graphs as the "y" dimension regardless of the name you gave to the corresponding variable).  Double-click on the graph and rotate the graph clockwise in the horizontal plane until the x2-axis is almost head-on, and the x1-axis is horizontal.  Then your conditional effects plot is given by the blue line closest to you (corresponding to x2=3) and the blue line farthest from you (corresponding to x2=6).  This is equivalent to the conditional effects plots illustrated in ALSM4e Figure 7.10 p. 310.  Describe the nature of the interaction effect, as requested.
Hint for  Part b:  You can use SYSTAT to represent the response surface with contour curves like those used to represent elevation in geographical maps (similar to ALSM4e Figure 7.12 (b) p. 313).  In the interactive window use the command
fplot y = 25 + 3*x1 + 4*x2 + 1.5*x1*x2; ymin=3 ymax=6 contour
Using both methods of representation (Part a and Part b) should give you a feel for the shape of the response surface of a typical interaction model.

9.  ALSM5e 8.12 p. 337 [ALSM4e 11.1 p. 490].  (Why X'X singular in model with indicators?)

10.  ALSM5e 8.16 p. 337 [ALSM4e 11.5 p. 490]. (Grade point average; one indicator.)  Part a, b, and c only.  Use file knnch08pr16 posted on course site.

11.  ALSM5e 8.20 p. 338 [ALSM4e 11.9 p. 491].  (Grade point average; interaction with indicator.)  Use file knnch08pr16 posted on course site.

12.  ALSM5e 8.23 p. 339 [ALSM4e 11.13 p. 492].  (Interpretation of seasonal indicators.)

13.  ALSM5e 8.24 p. 339 [not in ALSM4e].  (Corner lot location and assessed valuations.)
Assessed valuations.  A tax consultant studied the current relation between selling price and assessed valuation of one-family residential dwellings in a large tax district by obtaining data for a random sample of 16 "arm's length" sales transactions of one-family dwellings located on corner lots and for a random sample of 48 recent sales of one-family dwellings not located on corner lots.  In the data that follow, both selling price (Y) and seessed valuation (X1) are expressed in thousand dollars, whereas the location (X2) is coded 1 for corner lots.  (Beginning of data file: Xi1 76.4 Xi2 0 Yi 78.8 ....)  Assume the error variances in the two populations are equal and that regssion model (8.49) is appropriate.
a.  Plot the sample data for the two populations as a symbolic scatter plot.  Does the regression relation appear to be the same for the two populations?
b.  Test for identity of the regression functions for dwellings on corner lots and dwellings in other locations; control the risk of Type I error at .05.  State the alternatives, decision rule, and conclusion.
c.  Plot the estimated regression functions for the two populations and describe the nature of the differences between them.

14.  ALSM5e 8.27 p. 339 [ALSM4e 11.19 p. 494].  (# of older siblings; quantitative variable vs. indicators.)

15.  ALSM5e 8.39 p. 341 [not in ALSM4e].  (CDI county data; model of # of active physicians.)  The raw CDI data are in the file APPENC02.DAT on the diskette that comes with the textbook; you can also use file knnappenc02 posted on the course site.  The data set is described in ALSM5e Appendix C.2.  You will need to create indicators for the regions on the basis of the region variable.  For Part b instead of calculating a confidence interval test directly the hypothesis that b3 = b4.
The number of active physicians (Y) is to be regressed against total population (X1), total personal income (X2), and geographic region (X3, X4, X5).
a.  Fit a first-order regression model.  Let X3 = 1 if NE and 0 otherwise, X4 = 1 if NC and 0 otherwise, and X5 = 1 if S and 0 otherwise (thus the omitted category is W).  Geographic Region is the last variable in teh data set.
b.  Examine whether the effect for the northeastern region on number of active physicians differs from the effect for the northcentral region by constructing an appropriate 90 percent confidence interval.  Interpret your interval estimate.  (See note above about testing the hypothesis directly instead.)
c.  Test whether any geographic effects are present; use alpha=.10.  State the alternatives, decision rule, and conclusion.  What is the P-value of the test?

Note on Problem 15: this problem does not work as stated because X3 (NE region indicator) is collinear with other variables in the model. You have two options:

1. Do the problem after replacing X1 (total population) by its logarithm, i.e.
a. in STATA you would create a variable, say

generate l10x1=log10(x1)
b. in SYSTAT you would create a variable, say l10x1, as
let l10x1=l10(x1)
The regression called for in the problem with the transformed variable works as intended.

2. Alternatively, do ALSM5e 8.41 p. 342 [ALSM4e 11.27 p. 496] (using the SENIC data set, posted on the course site as knnappenc01.dta) instead of 8.39 p. 341. For Part c, do not use the Bonferroni procedure; instead test directly the hypothesis that the coefficients of all 3 regional indicators are the same.



Last modified 3 Apr  2006