Regression concepts
Basic
Concepts of Regression Analysis | |||||||||
Regression Analysis is used for finding the best relationship between | |||||||||
dependent variable Y and one or more independent variables X1, X2 etc. | |||||||||
Let us first focus on simple linear regression. Simple means there is | |||||||||
only one independent variable X and linear means we will only find the | |||||||||
best linear relationship between X and Y. | |||||||||
We will be given a set of observations X,Y and our goal will be to | |||||||||
find the best equation Ycap = a + b.X | |||||||||
Note that Ycap is the estimated value of Y, a is the intercept on the vertical axis | |||||||||
and b is the slope of the line. | |||||||||
The best relationship is the one where error between actual Y and estimated | |||||||||
value of y given by symbol Ycap has the following properties. | |||||||||
1. Sum of errors is equal to 0. (Sum(Y – Ycap) = 0) | |||||||||
2. Sum of squared errors is lowest compared to any other line. | |||||||||
or ( Sum( (Y – Ycap)^2) ) = minimum compared to any pther line | |||||||||
This sum of squared error is also called unexplainable or residual variation. | |||||||||
Any line which follows these two properties is also called the least squares line. | |||||||||
You can take any set of numbers and use formulas below or EXCEL to find the best line. | |||||||||
The obvious question is if this line is good enough. Can we use this line for projecting | |||||||||
trends or forecasting. To answer this question we perform Correlation analysis. | |||||||||
The strating point of Correlation analysis is to find the total variation of Y. Recall from | |||||||||
basic statistics formulas that total variation = Sum ( ( Y – Ybar) ^ 2) | |||||||||
Note that deviding this total variation by n – 1 gives us variance and the square root of | |||||||||
variance gives us the standard deviation. | |||||||||
In general, if we start out with a large amount of variation and after doing the regression | |||||||||
we end up with a very small of residual variation, we claim that we have done a good job | |||||||||
of explaining and our model is good. Suppose our total variation was 200 and we are | |||||||||
left with a residual or unexplainable variation of 30, we have explained 170 out of 200 units | |||||||||
of variation. Our coefficient of determination would be 170/200 or 0.85. In other words | |||||||||
Coeff. of determination tells us the proportion of variation we have suceeded in explaining | |||||||||
by doing the regression model. At most we can explain will be all of it, in that case | |||||||||
R-squared = 1. If we do not explain anything then R-squared = 0. | |||||||||
When R-squared is 1 the value of R (Coeff. of correlation) will be = +1 or -1. | |||||||||
This is when we have a perfect model. When Rsqaured is 0, R will also be 0 and we | |||||||||
say that X and Y have no correlation. Correlation measures degree of linear relationship | |||||||||
between X and Y. If R is closer to +1 or -1, we can conclude that there appears to be | |||||||||
strong linear relationship between X and Y. If R is close to 0, we will say that there | |||||||||
does not appear to be linear relationship between X and Y. | |||||||||
When R is + it is called positive correlaion or we say that X and Y have direct relationship. | |||||||||
When R is – it is called negative correlaion or we say that X and Y have inverse relationship. | |||||||||
We will first learn how to find the regression line and R-squared using formulas. | |||||||||
Soppose the sales for the last 9 periods for a company were 80000, 90000, 120000, | |||||||||
90000, 110000, 12000,15000, 90000, 140000 units respectively. We want to forecast | |||||||||
for the tenth period using trend analysis or regression analysis. | |||||||||
Note that we have two variables. Sales and Time. We will make the data handing | |||||||||
easier by using sales in thousands and time will start from 1,2,3 etc. Note that n = 9 | |||||||||
Time is the independent variable X and sales is the dependent variable Y. | |||||||||
X | Y | X.Y | X^2 | Ycap | Y – Ycap | (Y-Ycap)^2 | Y – Ybar | (Y-Ybar)^2 | |
1 | 80 | 80 | 1 | 88 | -8 | 64 | -30 | 900 | |
9 | 2 | 90 | 180 | 4 | 93.5 | -3.5 | 12.25 | -20 | 400 |
# of Obs | 3 | 120 | 360 | 9 | 99 | 21 | 441 | 10 | 100 |
4 | 90 | 360 | 16 | 104.5 | -14.5 | 210.25 | -20 | 400 | |
5 | 110 | 550 | 25 | 110 | 0 | 0 | 0 | 0 | |
6 | 120 | 720 | 36 | 115.5 | 4.5 | 20.25 | 10 | 100 | |
7 | 150 | 1050 | 49 | 121 | 29 | 841 | 40 | 1600 | |
8 | 90 | 720 | 64 | 126.5 | -36.5 | 1332.25 | -20 | 400 | |
9 | 140 | 1260 | 81 | 132 | 8 | 64 | 30 | 900 | |
Totals | 45 | 990 | 5280 | 285 | 0 | 2985 | 0 | 4800 | |
Xbar = | 5 | b = | 5.5 | Residual | Total | ||||
Ybar = | 110 | a = | 82.5 | ||||||
Ycap = 82.5 + 5.5 X | R-Squared = | 0.378125 | R = | 0.6149186938 | |||||
Formulas used above: | Ybar = Sum(Y) / n Xbar = Sum(X) / n | ||||||||
b = (Sum(XY) – n * Xbar * Ybar) / (Sum(X^2) – n * (Xbar^2)) | |||||||||
a = Ybar – b * Xbar | |||||||||
Ycap = a + b * X | |||||||||
Unexplainable or Residual Variation = Sum ( (Y – Ycap) ^ 2 ) ) | |||||||||
Total Variation = Sum ( ( Y – Ybar) ^ 2 ) | |||||||||
R- Squared = ( Total – Residual) / Total | |||||||||
You can solve this problem using Regression in EXCEL’s Data Analysis Package | |||||||||
Go to TOOLS. If you do not have DATA ANALYSIS Choose ADD IN | |||||||||
Select DATA ANALYSIS and it will be added to your Tools menu. | |||||||||
Within data analysis choose regression. You can chck labels to show names. | |||||||||
Time | Sales | ||||||||
1 | 80 | ||||||||
2 | 90 | ||||||||
3 | 120 | ||||||||
4 | 90 | ||||||||
5 | 110 | ||||||||
6 | 120 | ||||||||
7 | 150 | ||||||||
8 | 90 | ||||||||
9 | 140 | ||||||||
SUMMARY OUTPUT | |||||||||
Regression Statistics | |||||||||
Multiple R | 0.6149186938 | ||||||||
R Square | 0.378125 | ||||||||
Adjusted R Square | 0.2892857143 | ||||||||
Standard Error | 20.6501470074 | ||||||||
Observations | 9 | ||||||||
ANOVA | |||||||||
df | SS | MS | F | Significance F | |||||
Regression | 1 | 1815 | 1815 | 4.256281407 | 0.0780101022 | ||||
Residual | 7 | 2985 | 426.4285714286 | ||||||
Total | 8 | 4800 | |||||||
Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | Lower 95.0% | Upper 95.0% | ||
Intercept | 82.5 | 15.0019839958 | 5.4992726311 | 0.0009072363 | 47.0259701991 | 117.9740298009 | 47.0259701991 | 117.9740298009 | |
Time | 5.5 | 2.6659225152 | 2.0630757153 | 0.0780101022 | -0.8039005226 | 11.8039005226 | -0.8039005226 | 11.8039005226 | |
RESIDUAL OUTPUT | |||||||||
Observation | Predicted Sales | Residuals | |||||||
1 | 88 | -8 | |||||||
2 | 93.5 | -3.5 | |||||||
3 | 99 | 21 | |||||||
4 | 104.5 | -14.5 | |||||||
5 | 110 | 0 | |||||||
6 | 115.5 | 4.5 | |||||||
7 | 121 | 29 | |||||||
8 | 126.5 | -36.5 | |||||||
9 | 132 | 8 |
Time Line Fit Plot
Sales 1 2 3 4 5 6 7 8 9 80 90 120 90 110 120 150 90 140 Predicted Sales 1 2 3 4 5 6 7 8 9 88 93.5 99 104.5 110 115.5 121 126.5 132Time
Sales
Formulas
Trend Analysis | |||||||||
n = | 14 | Using Formulas | |||||||
Y | X | X.Y | X^2 | Ycap | Y-Ycap | (Y-Ycap)^2 | Y-Ybar | (Y-Ybar)^2 | |
# Mergers | Year | ||||||||
4 | 1 | 4 | 1 | 18.543 | -14.543 | 211.495 | -27.214 | 740.617 | |
17 | 2 | 34 | 4 | 20.492 | -3.492 | 12.196 | -14.214 | 202.046 | |
19 | 3 | 57 | 9 | 22.442 | -3.442 | 11.846 | -12.214 | 149.189 | |
45 | 4 | 180 | 16 | 24.391 | 20.609 | 424.722 | 13.786 | 190.046 | |
25 | 5 | 125 | 25 | 26.341 | -1.341 | 1.797 | -6.214 | 38.617 | |
37 | 6 | 222 | 36 | 28.290 | 8.710 | 75.862 | 5.786 | 33.474 | |
44 | 7 | 308 | 49 | 30.240 | 13.760 | 189.350 | 12.786 | 163.474 | |
35 | 8 | 280 | 64 | 32.189 | 2.811 | 7.902 | 3.786 | 14.332 | |
27 | 9 | 243 | 81 | 34.138 | -7.138 | 50.958 | -4.214 | 17.760 | |
31 | 10 | 310 | 100 | 36.088 | -5.088 | 25.887 | -0.214 | 0.046 | |
21 | 11 | 231 | 121 | 38.037 | -17.037 | 290.272 | -10.214 | 104.332 | |
38 | 12 | 456 | 144 | 39.987 | -1.987 | 3.947 | 6.786 | 46.046 | |
45 | 13 | 585 | 169 | 41.936 | 3.064 | 9.386 | 13.786 | 190.046 | |
49 | 14 | 686 | 196 | 43.886 | 5.114 | 26.156 | 17.786 | 316.332 | |
Total | 437 | 105 | 3721 | 1015 | 0.000 | 1341.776 | 0.000 | 2206.357 | |
AVERAGE | 31.2142857143 | 7.5 | |||||||
b= | 1.949450549 | Unexp Var = | 1341.776 | R-Square= | 0.391859188 | ||||
a= | 16.59340659 | Total Var = | 2206.357 | R = | 0.625986572 | ||||
Ycap = 16.59 + 1.94 X | Exp Var = | 864.581 | May or May not br good Model | ||||||
Standard Error = | 10.574245442 | Standard Err of b = | 0.7010656462 | ||||||
Calculated T Stat = | 2.7806961581 | Calculated F Stat = | 7.7322682773 | ||||||
USING EXCEL BUILT-IN FUNTION | |||||||||
SUMMARY OUTPUT | |||||||||
Regression Statistics | |||||||||
Multiple R | 0.625986572 | ||||||||
R Square | 0.391859188 | ||||||||
Adjusted R Square | 0.341180787 | ||||||||
Standard Error | 10.57424475 | ||||||||
Observations | 14 | ||||||||
ANOVA | |||||||||
df | SS | MS | F | Significance F | |||||
Regression | 1 | 864.5813187 | 864.5813187 | 7.732272141 | 0.016628916 | ||||
Residual | 12 | 1341.775824 | 111.814652 | ||||||
Total | 13 | 2206.357143 | |||||||
Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | Lower 95.0% | Upper 95.0% | ||
Intercept | 16.59340659 | 5.969358487 | 2.779763793 | 0.016657688 | 3.587291958 | 29.59952123 | 3.587291958 | 29.59952123 | |
Year | 1.949450549 | 0.7010656 | 2.780696341 | 0.016628916 | 0.421959851 | 3.476941248 | 0.421959851 | 3.476941248 | |
RESIDUAL OUTPUT | |||||||||
Observation | Predicted # Mergers | Residuals | |||||||
1 | 18.54285714 | -14.54285714 | |||||||
2 | 20.49230769 | -3.492307692 | |||||||
3 | 22.44175824 | -3.441758242 | |||||||
4 | 24.39120879 | 20.60879121 | |||||||
5 | 26.34065934 | -1.340659341 | |||||||
6 | 28.29010989 | 8.70989011 | |||||||
7 | 30.23956044 | 13.76043956 | |||||||
8 | 32.18901099 | 2.810989011 | |||||||
9 | 34.13846154 | -7.138461538 | |||||||
10 | 36.08791209 | -5.087912088 | |||||||
11 | 38.03736264 | -17.03736264 | |||||||
12 | 39.98681319 | -1.986813187 | |||||||
13 | 41.93626374 | 3.063736264 | |||||||
14 | 43.88571429 | 5.114285714 |
Time Series
Example of Forecasting | |||||||||
Three period Simple | Three period Weighted Average | ||||||||
Moving Average | 0.2 | 0.3 | 0.5 | ||||||
Error | Error | ||||||||
Period | Sales Yt | Forecast Ft | Abs(Yt – Ft) | (Yt-Ft)^2 | APE | Forecast Ft | Abs(Yt – Ft) | (Yt-Ft)^2 | APE |
1 | 24 | ||||||||
2 | 25 | ||||||||
3 | 27 | ||||||||
4 | 24 | 25.333 | 1.333 | 1.778 | 5.556 | 25.800 | 1.800 | 3.240 | 7.500 |
5 | 20 | 25.333 | 5.333 | 28.444 | 26.667 | 25.100 | 5.100 | 26.010 | 25.500 |
6 | 21 | 23.667 | 2.667 | 7.111 | 12.698 | 22.600 | 1.600 | 2.560 | 7.619 |
7 | 23 | 21.667 | 1.333 | 1.778 | 5.797 | 21.300 | 1.700 | 2.890 | 7.391 |
8 | 28 | 21.333 | 6.667 | 44.444 | 23.810 | 21.800 | 6.200 | 38.440 | 22.143 |
9 | 24 | 24.000 | 0.000 | 0.000 | 0.000 | 25.100 | 1.100 | 1.210 | 4.583 |
10 | 26 | 25.000 | 1.000 | 1.000 | 3.846 | 25.000 | 1.000 | 1.000 | 3.846 |
11 | 24 | 26.000 | 2.000 | 4.000 | 8.333 | 25.800 | 1.800 | 3.240 | 7.500 |
12 | 29 | 24.667 | 4.333 | 18.778 | 14.943 | 24.600 | 4.400 | 19.360 | 15.172 |
26.333 | 2.741 | 11.926 | 11.294 | 26.900 | 2.744 | 10.883 | 11.251 | ||
Forecast | MAD | MSSE | MAPE | Forecast | MAD | MSSE | MAPE | ||
Exponential Smoothing Method | Exponential Smoothing Method | ||||||||
Alpha = | 0.2 | Alpha = | 0.7 | ||||||
Error | Error | ||||||||
Period | Sales Yt | Forecast Ft | Abs(Yt – Ft) | (Yt-Ft)^2 | APE | Forecast Ft | Abs(Yt – Ft) | (Yt-Ft)^2 | APE |
1 | 24 | ||||||||
2 | 25 | 24.000 | 1.000 | 1.000 | 4.000 | 24.000 | 1.000 | 1.000 | 4.000 |
3 | 27 | 24.200 | 2.800 | 7.840 | 10.370 | 24.700 | 2.300 | 5.290 | 8.519 |
4 | 24 | 24.760 | 0.760 | 0.578 | 3.167 | 26.310 | 2.310 | 5.336 | 9.625 |
5 | 20 | 24.608 | 4.608 | 21.234 | 23.040 | 24.693 | 4.693 | 22.024 | 23.465 |
6 | 21 | 23.686 | 2.686 | 7.217 | 12.792 | 21.408 | 0.408 | 0.166 | 1.942 |
7 | 23 | 23.149 | 0.149 | 0.022 | 0.648 | 21.122 | 1.878 | 3.525 | 8.164 |
8 | 28 | 23.119 | 4.881 | 23.821 | 17.431 | 22.437 | 5.563 | 30.950 | 19.869 |
9 | 24 | 24.095 | 0.095 | 0.009 | 0.398 | 26.331 | 2.331 | 5.434 | 9.713 |
10 | 26 | 24.076 | 1.924 | 3.700 | 7.399 | 24.699 | 1.301 | 1.692 | 5.003 |
11 | 24 | 24.461 | 0.461 | 0.213 | 1.921 | 25.610 | 1.610 | 2.591 | 6.707 |
12 | 29 | 24.369 | 4.631 | 21.447 | 15.969 | 24.483 | 4.517 | 20.404 | 15.576 |
25.295 | 2.181 | 7.916 | 8.831 | 27.645 | 2.537 | 8.947 | 10.235 | ||
Forecast | MAD | MSSE | MAPE | Forecast | MAD | MSSE | MAPE | ||
Naïve Method Number 1 | Naïve Method Number 2 | ||||||||
Forecast Ft | ERROR:#NAME? | Forecast Ft | 24.5833333333 | ||||||
Error | Error | ||||||||
Period | Sales Yt | Forecast Ft | Abs(Yt – Ft) | (Yt-Ft)^2 | APE | Forecast Ft | Abs(Yt – Ft) | (Yt-Ft)^2 | APE |
1 | 24 | 24.583 | 0.583 | 0.340 | 2.431 | ||||
2 | 25 | 24 | 1.000 | 1.000 | 4.000 | 24.583 | 0.417 | 0.174 | 1.667 |
3 | 27 | 25 | 2.000 | 4.000 | 7.407 | 24.583 | 2.417 | 5.840 | 8.951 |
4 | 24 | 27 | 3.000 | 9.000 | 12.500 | 24.583 | 0.583 | 0.340 | 2.431 |
5 | 20 | 24 | 4.000 | 16.000 | 20.000 | 24.583 | 4.583 | 21.007 | 22.917 |
6 | 21 | 20 | 1.000 | 1.000 | 4.762 | 24.583 | 3.583 | 12.840 | 17.063 |
7 | 23 | 21 | 2.000 | 4.000 | 8.696 | 24.583 | 1.583 | 2.507 | 6.884 |
8 | 28 | 23 | 5.000 | 25.000 | 17.857 | 24.583 | 3.417 | 11.674 | 12.202 |
9 | 24 | 28 | 4.000 | 16.000 | 16.667 | 24.583 | 0.583 | 0.340 | 2.431 |
10 | 26 | 24 | 2.000 | 4.000 | 7.692 | 24.583 | 1.417 | 2.007 | 5.449 |
11 | 24 | 26 | 2.000 | 4.000 | 8.333 | 24.583 | 0.583 | 0.340 | 2.431 |
12 | 29 | 24 | 5.000 | 25.000 | 17.241 | 24.583 | 4.417 | 19.507 | 15.230 |
24.5833333333 | 29 | 2.818 | 9.909 | 11.378 | 24.583 | 2.014 | 6.410 | 8.340 | |
Average | Forecast | MAD | MSSE | MAPE | Forecast | MAD | MSSE | MAPE | |
Trend Analysis | |||||||||
Using Regression | |||||||||
X | Y | X^2 | X.Y | Error | |||||
Period | Sales Yt | Forecast Ft | Abs(Yt – Ft) | (Yt-Ft)^2 | APE | ||||
1 | 24 | 1 | 24 | 23.333 | 0.667 | 0.444 | 2.778 | ||
2 | 25 | 4 | 50 | 23.561 | 1.439 | 2.072 | 5.758 | ||
3 | 27 | 9 | 81 | 23.788 | 3.212 | 10.318 | 11.897 | ||
4 | 24 | 16 | 96 | 24.015 | 0.015 | 0.000 | 0.063 | ||
5 | 20 | 25 | 100 | 24.242 | 4.242 | 17.998 | 21.212 | ||
6 | 21 | 36 | 126 | 24.470 | 3.470 | 12.039 | 16.522 | ||
7 | 23 | 49 | 161 | 24.697 | 1.697 | 2.880 | 7.378 | ||
8 | 28 | 64 | 224 | 24.924 | 3.076 | 9.460 | 10.985 | ||
9 | 24 | 81 | 216 | 25.152 | 1.152 | 1.326 | 4.798 | ||
10 | 26 | 100 | 260 | 25.379 | 0.621 | 0.386 | 2.389 | ||
11 | 24 | 121 | 264 | 25.606 | 1.606 | 2.579 | 6.692 | ||
12 | 29 | 144 | 348 | 25.833 | 3.167 | 10.028 | 10.920 | ||
6.5 | 24.5833333333 | 650 | 1950 | 26.061 | 2.030 | 5.794 | 8.449 | ||
Average | Average | Forecast | MAD | MSSE | MAPE | ||||
Slope b = | 0.2273 | ||||||||
Intecpt a= | 23.1061 | ||||||||
Equation is Estimated Y = 23.1061 + 0.2273 X |