Chapter 15

Multiple Regression and Model Building

Copyright ©2018 McGraw-Hill Education. All rights reserved.

1

Chapter Outline

15.1 The Multiple Regression Model and the Least Squares Point Estimate

15.2 R2 and Adjusted R2

15.3 Model Assumptions and the Standard Error

15.4 The Overall F Test

15.5 Testing the Significance of an Independent Variable

15.6 Confidence and Prediction Intervals

15-2

2

Chapter Outline Continued

15.7 The Sales Representative Case: Evaluating Employee Performance

15.8 Using Dummy Variables to Model Qualitative Independent Variables (Optional)

15.9 Using Squared and Interaction Variables (Optional)

15.10 Multicollinearity, Model Building and Model Validation (Optional)

15.11 Residual Analysis and Outlier Detection in Multiple Regression (Optional)

15-3

3

15.1 The Multiple Regression Model and the Least Squares Point Estimate

Simple linear regression used one independent variable to explain the dependent variable

Some relationships are too complex to be described using a single independent variable

Multiple regression uses two or more independent variables to describe the dependent variable

This allows multiple regression models to handle more complex situations

There is no limit to the number of independent variables a model can use

Multiple regression has only one dependent variable

LO15-1: Explain the multiple regression model and the related least squares point estimates.

15-4

4

The Multiple Regression Model

The linear regression model relating y to x1, x2,…, xk is y = β0 + β1×1 + β2×2 +…+ βkxk +

µy = β0 + β1×1 + β2×2 +…+ βkxk is the mean value of the dependent variable y when the values of the independent variables are x1, x2,…, xk

β0, β1, β2,… βk are the unknown regression parameters relating the mean value of y to x1, x2,…, xk

is an error term that describes the effects on y of all factors other than the independent variables x1, x2,…, xk

LO15-1

15-5

5

The Least Squares Estimates and Point Estimation and Prediction

Estimation/prediction equation

ŷ = b0 + b1x1 + b2x2 + … + bkxk

is the point estimate of the mean value of the dependent variable when the values of the independent variables are x1, x2,…, xk

It is also the point prediction of an individual value of the dependent variable when the values of the independent variables are x1, x2,…, xk

b0, b1, b2,…, bk are the least squares point estimates of the parameters β0, β1, β2,…, βk

x1, x2,…, xk are specified values of the independent predictor variables x1, x2,…, xk

LO15-1

15-6

6

LO15-1

Example 15.1 The Tasty Sub Shop Case

Figure 15.4 (a)

15-7

7

15.2 R2 and Adjusted R2

Total variation is given by the formula

Explained variation is given by the formula

Unexplained variation is given by the formula

Total variation is the sum of explained and unexplained variation

LO15-2: Calculate and interpret the multiple and adjusted multiple coefficients of determination.

15-8

8

R2 and Adjusted R2 Continued

The multiple coefficient of determination is the ratio of explained variation to total variation

R2 is the proportion of the total variation that is explained by the overall regression model

Multiple correlation coefficient R is the square root of R2

LO15-2

15-9

9

Multiple Correlation Coefficient R

The multiple correlation coefficient R is just the square root of R2

With simple linear regression, r would take on the sign of b1

There are multiple bi’s with multiple regression

For this reason, R is always positive

To interpret the direction of the relationship between the x’s and y, you must look to the sign of the appropriate bi coefficient

LO15-2

15-10

10

Adjusted R2

Adding an independent variable to multiple regression will raise R2

R2 will rise slightly even if the new variable has no relationship to y

corrects this tendency in R2

As a result, it gives a better estimate of the importance of the independent variables

LO15-2

15-11

11

15.3 Model Assumptions and the Standard Error

The model is y = β0 + β1×1 + β2×2 + … + βkxk +

Assumptions are stated about the model error terms, ’s

Mean of Zero Assumption: The mean of the error terms is equal to 0

Constant Variance Assumption: The variance of the error terms σ2 is, the same for every combination values of x1, x2,…, xk

Normality Assumption: The error terms follow a normal distribution for every combination values of

x1, x2,…, xk

Independence Assumption: The values of the error terms are statistically independent of each other

LO15-3: Explain the assumptions behind multiple regression and calculate the standard error.

15-12

12

The Mean Square Error and the Standard Error

Sum of squared errors

Mean squared error

Point estimate of the residual variance σ2

Standard error

Point estimate of the residual standard deviation σ

LO15-3

15-13

13

15.4 The Overall F Test

To test

H0: β1= β2 = …= βk = 0 versus

Ha: At least one of β1, β2,…, βk ≠ 0

Test statistic

Reject H0 in favor of Ha if F(model) > F* or

p-value <
*F is based on k numerator and n - (k + 1) denominator degrees of freedom
LO15-4: Test the significance of a multiple regression model by using an F test.
15-14
14
15.5 Testing the Significance of an Independent Variable
A variable in a multiple regression model is not likely to be useful unless there is a significant relationship between it and y
To test significance, we use the null hypothesis H0: βj = 0
Versus the alternative hypothesis
Ha: βj ≠ 0
LO15-5: Test the significance of a single independent variable.
15-15
15
Testing the Significance of the Independent Variable xj
LO15-5
15-16
16
Testing the Significance of an Independent Variable Continued
Customary to test significance of every independent variable
If we can reject H0: βj = 0 at =0.05, we have strong evidence the independent variable xj is significantly related to y
If we can reject H0: βj = 0 at =0.01, we have very strong evidence the independent variable xj is significantly related to y
The smaller the significance level at which H0 can be rejected, the stronger the evidence that xj is significantly related to y
LO15-5
15-17
17
A Confidence Interval for the Regression Parameter βj
If the regression assumptions hold,
100 (1 - ) percent confidence interval for βj
is [bj ± t/2 Sbj]
t/2 is based on n – (k + 1) degrees of freedom
LO15-5
15-18
18
15.6 Confidence and Prediction Intervals
The point on the regression line corresponding to a particular value of x1, x2,…, xk, of the independent variables is
It is unlikely that this value will equal the mean value of y for these x values
Therefore, we need to place bounds on how far away the predicted value might be
We can do this by calculating a confidence interval for the mean value of y and a prediction interval for an individual value of y
LO15-6: Find and interpret a confidence interval for a mean value and a prediction interval for an individual value.
15-19
19
Distance Value
Both the confidence interval for the mean value of y and the prediction interval for an individual value of y employ a quantity called the distance value
With simple regression, we were able to calculate the distance value fairly easily
However, for multiple regression, calculating the distance value requires matrix algebra
LO15-6
15-20
20
A Confidence Interval and a Prediction Interval
Distance value
Assume the regression assumptions hold
Confidence interval for the mean value of y
Prediction interval for an individual value of y
These are based on n - (k + 1) degrees of freedom
LO15-6
15-21
21
15.7 The Sales Representative Case: Evaluating Employee Performance
yi Yearly sales of the company’s product
x1 Number of months the representative has been employed
x2 Sales of products in the sales territory
x3 Dollar advertising expenditure in the territory
x4 Weighted average of the company’s market share in territory for the previous four years
x5 Change in the company’s market share in the territory over the previous four years
15-22
22
Partial Excel Output of a Regression Analysis of the Sales Territory Performance Data
Figure 15.10a
15-23
Time = 85.42
MktPoten = 35,182.73
Adver = 7,281.65
MktShare = 9.64
Change = .28
Sales
Predicted 4,181.74
95% Prediction Interval
[3,233.59 to 5,129.89]
23
15.8 Using Dummy Variables to Model Qualitative Independent Variables (Optional)
So far, we have only looked at including quantitative data in a regression model
However, we may wish to include descriptive qualitative data as well
For example, might want to include the gender of respondents
We can model the effects of different levels of a qualitative variable by using what are called dummy variables
Also known as indicator variables
LO15-7: Use dummy variables to model qualitative independent
Variables (Optional).
15-24
24
Constructing Dummy Variables
A dummy variable always has a value of either 0 or 1
For example, to model sales at two locations, would code the first location as a zero and the second as a 1
Operationally, it does not matter which is coded 0 and which is coded 1
LO15-7
15-25
25
What If We Have More Than Two Categories?
Consider having three categories, say A, B and C
Cannot code this using one dummy variable
A=0, B=1 and C=2 would be invalid
Assumes the difference between A and B is the same as B and C
We must use multiple dummy variables
Specifically, k categories requires k - 1 dummy variables
LO15-7
15-26
26
What If We Have Three Categories?
For A, B, and C, would need two dummy variables
x1 is 1 for A, zero otherwise
x2 is 1 for B, zero otherwise
If x1 and x2 are zero, must be C
This is why the third dummy variable is not needed
LO15-7
15-27
27
Interaction Models
So far, have only considered dummy variables as stand-alone variables
Model so far is y = β0 + β1x + β2DM +
Where D is dummy variable
However, can also look at interaction between dummy variable and other variables
That model would take the form
y = β0 + β1x + β2DM + β3xDM +
With an interaction term, both the intercept and slope are shifted
LO15-7
15-28
28
15.9 Using Squared and Interaction Variables (Optional)
Quadratic regression model is:
y = β0 + β1x + β2x2 ε
where
β0 + β1x + β2x2 is μy
β, β1, and β2 are the regression parameters
ε is an error term
LO15-8: Use squared and interaction variables.
15-29
29
Using Interaction Variables
Regression models often contain interaction variables
Formed by multiplying two independent variables together
Consider a model where x3 and x4 interact
and x3 is used as a quadratic
y = β0 + β1x4 + β2x3 + β3x32 + β4x4x3 + ε
LO15-8
15-30
30
15.10 Multicollinearity, Model Building, and Model Validation (Optional)
Multicollinearity: when “independent” variables are related to one another
Considered severe when the simple correlation exceeds 0.9
Even moderate multicollinearity can be a problem
Another measurement is variance inflation factors
Multicollinearity considered
Severe when VIF > 10

Moderately strong for VIF > 5

LO15-9: Describe multicollinearity and build and validate a multiple regression model (Optional).

15-31

31

Effect of Adding Independent Variable

Adding any independent variable will increase R²

Even adding an unimportant independent variable

Thus, R² cannot tell us that adding an independent variable is undesirable

LO15-9

15-32

32

A Better Criterion is the Standard Error

A better criterion is the size of the standard error s

If s increases when an independent variable is added, we should not add that variable

However, decreasing s alone is not enough

An independent variable should only be included if it reduces s enough to offset the higher t value and reduces the length of the desired prediction interval for y

LO15-9

15-33

33

C Statistic

Another quantity for comparing regression models is called the C (a.k.a. Cp) statistic,

First, calculate mean square error for the model containing all p potential independent variables (s2p)

Next, calculate SSE for a reduced model with k independent variables

LO15-9

15-34

34

C Statistic Continued

We want the value of C to be small

Adding unimportant independent variables will raise the value of C

While we want C to be small, we also wish to find a model for which C roughly equals k + 1

A model with C substantially greater than k + 1 has substantial bias and is undesirable

If a model has a small value of C and C for this model is less than k + 1, then it is not biased and the model should be considered desirable

LO15-9

15-35

35

The Partial F Test: An F Test for a Portion of a Regression Model

To test

H0: All of the βj coefficients corresponding to the independent variables in the subset are zero

Ha: At least one of the βj coefficients is not equal to zero

Reject H0 in favor of Ha if:

F(partial) > F or

p-value <
F is based on k - g numerator and n - (k + 1) denominator degrees of freedom
LO15-9
15-36
36
15.11 Residual Analysis and Outlier Detection in Multiple Regression (Optional)
For an observed value of yi, the residual is
i = yi - ŷ = yi – (b0 + b1xi1 + … + bkxik)
If the assumptions hold, the residuals should look like a random sample from a normal distribution with mean 0 and variance σ2
Residual plots
Residuals versus each independent variable
Residuals versus predicted y’s
Residuals in time order (if the response is a time series)
LO15-10: Use residual analysis and outlier detection to check the assumptions of multiple regression (Optional).
15-37
Figure 15.35
37
LO15-10
Outliers
Figure 15.37 c, d and e
15-38

The price is based on these factors:

Academic level

Number of pages

Urgency

Basic features

- Free title page and bibliography
- Unlimited revisions
- Plagiarism-free guarantee
- Money-back guarantee
- 24/7 support

On-demand options

- Writer’s samples
- Part-by-part delivery
- Overnight delivery
- Copies of used sources
- Expert Proofreading

Paper format

- 275 words per page
- 12 pt Arial/Times New Roman
- Double line spacing
- Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Delivering a high-quality product at a reasonable price is not enough anymore.

That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read moreEach paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read moreThanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read moreYour email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read moreBy sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more