SOCI 209 - LINEAR REGRESSION MODELS
- Spring 2006
Professor François Nielsen
Assignment 3 - Released Tue 21 Mar
DUE Tue 4 Apr
ALSM5e = Applied Linear Statistical Models 5e (2004) OR Applied Linear Regression Models 4e (2004) (new editions).Some problems are in ALSM5e only.
ALSM4e = Applied Linear Statistical Models 4e (1996) OR Applied Linear Regression Models 3e (1996) (old editions).
General note: When ALSM5e/ALSM4e use a phrase like "test whether variable Z can be dropped from the model", they mean "test the significance of the coefficient of variable Z" (since Z can be safely removed from the model if its coefficient is non-significant).
1. (This was problem 7.27 p. 269 in
a previous edition of the text; it is not in ALSM5e or ALSM4e) (fitting
regression model with a known coefficient) (Hint: The answer is very short;
you have to find a "trick".)
An analyst wanted to fit the regression model
Yi = b0
+ b1Xi1
+ b2Xi2
+ b3Xi3
+ ei,
i = 1,...,n by the method of least squares when it is known that b2
= 4. How can the analyst obtain the desired fit using a multiple
regression computer program?
For the next 4 problems you can use any of 3 approaches: (1) the general method of comparing full and reduced models as taught in class; (2) the equivalent method of extra sums of squares (explained in ALSM5e Sections 7.1 to 7.3, pp. 256-268 [ALSM4e Sections 7.1 to 7.3, pp. 260-274] but not discussed in class); or (3) the test command in STATA or the hypothesis command of SYSTAT (explained in Module 8, Section 5), or equivalent commands in other statistical programs. Use file knnch06pr18 posted on the course site.
2. ALSM5e 6.18 p. 251 [not in ALSM4e].
(Commercial properties.) Part b, c, f only. This is to set
up the context for the next 3 problems.
A commercial real estate company evaluates
vacancy rates, square footage, rental rates, and operating expenses for
commercial properties in a large metropolitan area in order to provide
clients with quantitative information upon which to make rental decisions.
The data below are taken from 81 suburban commercial properties that are
the newest, best located, most attractive, and expensive for five specific
geographic areas. [The variables are] are the age (X1), operating
expenses and taxes (X2), vacancy rates (X3), total square footage (X4),
and rental rates (Y).
b. Obtain the scatter plot matrix and
the correlation matrix. Interpret these and state your principal
findings.
c. Fit regression model (6.5) for four
predictor variables to the data. State the estimated regression function.
f. Can you conduct a formal test for
lack of fit here?
3. ALSM5e 7.7 p. 289 [not in ALSM4e]. (Commercial properties; test b3= 0.) Hint: Do not follow instructions in the text; instead run the regression with X1, X2, X3 and X4 and test whether X3 can be safely dropped from the model using the appropriate t-test.
4. ALSM5e 7.8 p. 290 [not in ALSM4e].
(Commercial properties; test b2
= 0 & b3
= 0)
Test whether both X2 and X3 can be dropped
from the regression model given that X1 and X4 are retained; use a=.01.
State the alternatives, decision rule, and conclusion. What is the
P-value of the test?
5. ALSM5e 7.10 p. 290 [not in ALSM4e].
(Commercial properties; test b1
= -.1 and b2
= .4)
Test whether beta1 = -.1 and beta2 = .4;
use a=.01. State the alternatives, full and reduced models, decision
rule, and conclusion.
---------
6. ALSM5e 7.32 p. 292 [ALSM4e 7.47 p. 324]. (Reduced models for various tests) See Module 8, Section 4.
The next two problems are optional. The figures for problems 7 and 8 are provided in the linked MS Word file soci209a3p7nd8sup.doc or PDF file soci209a3p7nd8sup.pdf. You can use these figures in your assignment if you have difficulty accessing SYSTAT or producing the figures with STATA or other software.
7. (Optional) ALSM5e 8.1 p. 335 [ALSM4e
7.28 p. 320]. (Plotting response surface and contour curves of quadratic
model) For this problem, also plot the response surface directly
in 3-dimensional space and rotate the display to see the surface from various
angles. You can use SYSTAT to plot the response surface in 3-D, and to
represent it with contour curves.
To show the response surface in 3-D, from
the command prompt in the interactive window enter the command
fplot y=140 + 4*x1*x1 - 2*x2*x2 + 5*x1*x2To rotate the surface in space double-click on the graph. Then repeatedly click the rotation buttons on the upper-right of the graph window to rotate the surface around a vertical or horizontal axis until you find a perspective that most appeal to you. Then print the graph.
fplot y=140 + 4*x1*x1 - 2*x2*x2 + 5*x1*x2; contourThe contour plot cannot be rotated in space.
8. (optional) ALSM5e 8.9 p. 336 [ALSM4e
7.38 p. 323]. (Plotting response surface and contour curves of model
with interaction)
Hint for Part a: To do the conditional
effects plot requested you can use SYSTAT to represent the response surface
in 3-D by going to the interactive window and entering
fplot y=25+3*x1+4*x2+1.5*x1*x2; surface=ycut ymin=3 ymax=6The surface=ycut option will represent the response surface with lines corresponding to fixed values of x2 (which SYSTAT identifies in 3-D graphs as the "y" dimension regardless of the name you gave to the corresponding variable). Double-click on the graph and rotate the graph clockwise in the horizontal plane until the x2-axis is almost head-on, and the x1-axis is horizontal. Then your conditional effects plot is given by the blue line closest to you (corresponding to x2=3) and the blue line farthest from you (corresponding to x2=6). This is equivalent to the conditional effects plots illustrated in ALSM4e Figure 7.10 p. 310. Describe the nature of the interaction effect, as requested.
fplot y = 25 + 3*x1 + 4*x2 + 1.5*x1*x2; ymin=3 ymax=6 contourUsing both methods of representation (Part a and Part b) should give you a feel for the shape of the response surface of a typical interaction model.
9. ALSM5e 8.12 p. 337 [ALSM4e 11.1 p. 490]. (Why X'X singular in model with indicators?)
10. ALSM5e 8.16 p. 337 [ALSM4e 11.5 p. 490]. (Grade point average; one indicator.) Part a, b, and c only. Use file knnch08pr16 posted on course site.
11. ALSM5e 8.20 p. 338 [ALSM4e 11.9 p. 491]. (Grade point average; interaction with indicator.) Use file knnch08pr16 posted on course site.
12. ALSM5e 8.23 p. 339 [ALSM4e 11.13 p. 492]. (Interpretation of seasonal indicators.)
13. ALSM5e 8.24 p. 339 [not in ALSM4e].
(Corner lot location and assessed valuations.)
Assessed valuations. A tax consultant
studied the current relation between selling price and assessed valuation
of one-family residential dwellings in a large tax district by obtaining
data for a random sample of 16 "arm's length" sales transactions of one-family
dwellings located on corner lots and for a random sample of 48 recent sales
of one-family dwellings not located on corner lots. In the data that
follow, both selling price (Y) and seessed valuation (X1) are expressed
in thousand dollars, whereas the location (X2) is coded 1 for corner lots.
(Beginning of data file: Xi1 76.4 Xi2 0 Yi 78.8 ....) Assume the
error variances in the two populations are equal and that regssion model
(8.49) is appropriate.
a. Plot the sample data for the two
populations as a symbolic scatter plot. Does the regression relation
appear to be the same for the two populations?
b. Test for identity of the regression
functions for dwellings on corner lots and dwellings in other locations;
control the risk of Type I error at .05. State the alternatives,
decision rule, and conclusion.
c. Plot the estimated regression functions
for the two populations and describe the nature of the differences between
them.
14. ALSM5e 8.27 p. 339 [ALSM4e 11.19 p. 494]. (# of older siblings; quantitative variable vs. indicators.)
15. ALSM5e 8.39 p. 341 [not in ALSM4e].
(CDI county data; model of # of active physicians.) The raw CDI data
are in the file APPENC02.DAT on the diskette that comes with the textbook;
you can also use file knnappenc02 posted on the course site.
The data set is described in ALSM5e Appendix C.2. You will need to
create indicators for the regions on the basis of the region variable.
For Part b instead of calculating a confidence interval test directly the
hypothesis that b3
= b4.
The number of active physicians (Y) is to
be regressed against total population (X1), total personal income (X2),
and geographic region (X3, X4, X5).
a. Fit a first-order regression model.
Let X3 = 1 if NE and 0 otherwise, X4 = 1 if NC and 0 otherwise, and X5
= 1 if S and 0 otherwise (thus the omitted category is W). Geographic
Region is the last variable in teh data set.
b. Examine whether the effect for the
northeastern region on number of active physicians differs from the effect
for the northcentral region by constructing an appropriate 90 percent confidence
interval. Interpret your interval estimate. (See note above
about testing the hypothesis directly instead.)
c. Test whether any geographic effects
are present; use alpha=.10. State the alternatives, decision rule,
and conclusion. What is the P-value of the test?
Note on Problem 15: this problem does not work as stated because X3 (NE region indicator) is collinear with other variables in the model. You have two options:
1. Do the problem after replacing X1 (total
population) by its logarithm, i.e.
a. in STATA you would create a variable,
say
generate l10x1=log10(x1)b. in SYSTAT you would create a variable, say l10x1, as
let l10x1=l10(x1)The regression called for in the problem with the transformed variable works as intended.
2. Alternatively, do ALSM5e 8.41 p. 342 [ALSM4e
11.27 p. 496] (using the SENIC data set, posted on the course site as knnappenc01.dta)
instead of 8.39 p. 341. For Part c, do not use the Bonferroni procedure;
instead test directly the hypothesis that the coefficients of all 3 regional
indicators are the same.