# Regression and Correlation Notes

Regression and Correlation (Notes prepared by Dr. Tibrewala)

Regression analysis is the process of finding the best relationship between a dependent variable Y and independent variable X (or several independent variables X1, X2, …).

The process begins with identification of the dependent variable Y which we are interested in predicting or analyzing. We then define one or more independent variables which make business or technical sense. For example, sales of a durable product may be Y and interest rates, price, and advertising may be independent variables. After defining the variables, we collect historical data for both the dependent and independent variables.

The next step is to either use Excel (Data Analysis- Regression) or formulas to find the best relationship or regression equation. The best in this context means the equation which minimizes the sum of squared errors (error = actual value of Y – estimated value of Y) and where the sum of errors is zero. The formulas or Excel will always find the least squared (error) line or the regression line. However, this may not be very significant or useful. The process of evaluating (how good) this equation is called correlation analysis.

When we have only one independent variable X, then we will end up with a simple regression and when we have two or more independent variables we will end up with multiple regression. Furthermore, if we confine our search to linear models, then we will be doing either simple linear or multiple linear regressions.

A Simple Linear Regression equation will look as follows:

Ycap = a + b*X

Where “a” is the intercept and “b” is the slope or regression coefficient.

Slope measures the average rate of change in Y per unit change in X.

Note: If b is negative, then the line goes downward as X increases. If b = 0, the Y = a and the line is horizontal

If b is positive, then the line goes upward as X increases.

A Multiple regression equation will look as follows:

Ycap = a + b1*X1 + b2*X2 + b3*X3 + …….

Symbols or Terminology

n number of observations

k number of independent variables

X (X1, X2, ——) actual value of independent variable(s)

Y actual value of the dependent variable

Xbar average value of X

Ybar average value of Y

a intercept

b (b1, b2, —–) slopes or regression coefficients

Ycap estimated value of Y

Y – Ycap is the error in estimating Y

Se standard error of estimate or standard deviation of errors

Sb (Sb1, Sb2, ——) standard error of b(s)

R^2 coefficient of determination

R coefficient of correlation

T (T1, T2, ——) Calculated values T statistic(s)

F calculated value of F statistic

Simple Linear Regression Formulas

Xbar = (ΣX)/n Ybar = (ΣY)/n

b = [(ΣXY) – n (Xbar) (Ybar)] / [( ΣX^2) – n (Xbar)^2]

a = Ybar – b * Xbar

Ycap = a + b * X

Unexplainable variation = Σ (Y – Ycap)^2

Total Variation = Σ (Y – Ybar)^2

Explainable Variation = total – Unexplainable variation

R^2 = Explainable Variation/ Total variation

Se = Sqrt [Unexplainable Variation/ (n-k-1)]

Sb = Se / Sqrt [( ΣX^2) – n (Xbar)^2]

Calculated T = b/Sb

Calculated F = [Explainable variation/(k)] / [Unexplainable variation/(n-k-1)]

An alternative approach to finding these results is to use Data Analysis option within the Tools area of Excel. If you do not see Data Analysis option under Tools, then click on “Add In”, select Data Analysis Tool Pack. Once you click on Data Analysis, select Regression and identify range of dependent and independent variables. Select the location of output and click okay to get the results.

Procedure for performing Correlation Analysis

1. Look at R or R^2 and talk about it. If R^2 is close to 1 it is close to perfect model and if it is close to 0, then the model shows no linear relationship or it is statistically not significant.

2. Perform F test to see if the entire model is significant. Find critical value of F from the table using alpha = 0.05 (unless specified to be a different value), degrees of freedom for the numerator to be k and degrees of freedom for the denominator to be n-k-1. If the calculated value of F > critical F from the table or Excel formula, the model is significant. The higher the calculated F statistic the more significant the model is.

3. Perform T test for each independent variable to if it is significantly related to Y. Find the critical value of T from T table using alpha = 0.05 (unless specified otherwise) and degrees of freedom to be n-k-1. If the absolute value of the calculated T statistic for an independent variable > critical T value from the table or Excel formula, then the variable is statistically significant and thus related to Y (it passes T Test).

4. Ideally, we want all the variables to pass the T test and the whole model to pass the F test. Usually, it is a good idea to perform the regression again by eliminating the variable with the worst (meaning closest to 0) T statistic. If we have two or more models passing the F test with all variables passing T tests, we can select the best model to be the one with the lowest Se or the one which makes the most business sense.

5. Check the residuals (errors) for randomness. If residuals are not random, then further analysis must be done to investigate the reason.

Using the Model to Estimate

If you substitute the value of X (X1, X1, —-) into the regression equation to calculate Ycap the result will be a point estimate of Y. Note that the probability of this point estimate being correct is almost zero. We therefore will build a confidence interval for Y using the following formulas.

95 % Confidence Interval for Y = Ycap + or – E

E = Critical Value of T * Se

In words, this says that we are 95% confident that Y will be between Ycap – E and Ycap + E.

Performing Non-Linear regressions

Define X1 = X and X2 = X^2. Create another column X2 and perform multiple regression. The result will be a quadratic relationship. Ycap = a + b1 * X + b2 * X^2.

A Complete Example

You are a new V.P. of R&D and you are trying to convince the CEO that the R&D expenditures add to the bottom line immediately. You have following data.

Year 2000 2001 2002 2003 2004 2005

R&D (Millions \$) 2 3 5 4 11 5

Profit (Millions \$) 20 25 34 30 40 31

We will now build a simple linear regression model with Profit being the dependent variable Y and R&D being the independent variable X. Note that n =6, k =1.

 Obs X Y XY X^2 Ycap Y-Ycap (Y- Ycap)^2 Y – Ybar (Y-Ybar)^2 # RD PR Est Error Error^2 Dev Dev^2 1 2 20 40 4 24 -4 16 -10 100 2 3 25 75 9 26 -1 1 -5 25 3 5 34 170 25 30 4 16 4 16 4 4 30 120 16 28 2 4 0 0 5 11 40 440 121 42 -2 4 10 100 6 5 31 155 25 30 1 1 1 1 Sum 30 180 1000 200 0 42 0 242

Xbar = 30/6 = 5 and Ybar = 180/6 = 30

b = [1000 – 6 * 5 * 30] / [(200) – 6 * (5)^2] = 100/50 = 2.0

a = 30 – 2 * 5 = 20

Ycap = 20 + 2 * X Use this equation to fill up column 5 titled Ycap in the above table.

Unexplainable variation = 42

Total Variation = 242

Explainable Variation = 242- 42 = 200

R^2 = 200/242 = 0.826

Se = Sqrt [ 42/ (6-1-1) ] = sqrt (10.5) = 3.24

Sb = 3.24/ Sqrt(200 – 6* 5^2) = 3.24/7.07 = 0.458

Calculated T = 2/ 0.458 = 4.37

Calculated F = (200/1) / (42/ (6-1-1)) = 200 / 10.5 = 19.05

Analysis:

1. Coefficient of determination looks good.

2. Numerator DF for F Test = 1 and Denominator DF for F Test = 6 – 1 – 1 = 4. Therefore, critical value of F (alpha = 0.05) = 7.71. The Model passes F test since calc F > Table F.

3. DF for T Test = 6 – 1 – 1 = 4 therefore Critical T value is 2.776 at alpha = 0.05. The R&D expenditure passes T Test. It is significantly related to profits.

Our model is significant.

Using the model to forecast for next year

Suppose that the R&D Expenditure for the next year is budgeted to be \$10 million.

Ycap = 20 + 2 * X = 20 + 2 * (10) = 40 million. This is a point estimate of Y.

95% Confidence Interval for Y = Ycap ± E

E = Critical T value * Se = 2.776 * 3.24 = 9.0

95% Confidence Interval for Y = 40 ± 9 = [31, 49]

We are 95% confident that the profit will be between 31 and 49 million dollars.