# Regression concepts

## Basic

 Concepts of Regression Analysis Regression Analysis is used for finding the best relationship between dependent variable Y and one or more independent variables X1, X2 etc. Let us first focus on simple linear regression. Simple means there is only one independent variable X and linear means we will only find the best linear relationship between X and Y. We will be given a set of observations X,Y and our goal will be to find the best equation Ycap = a + b.X Note that Ycap is the estimated value of Y, a is the intercept on the vertical axis and b is the slope of the line. The best relationship is the one where error between actual Y and estimated value of y given by symbol Ycap has the following properties. 1. Sum of errors is equal to 0. (Sum(Y – Ycap) = 0) 2. Sum of squared errors is lowest compared to any other line. or ( Sum( (Y – Ycap)^2) ) = minimum compared to any pther line This sum of squared error is also called unexplainable or residual variation. Any line which follows these two properties is also called the least squares line. You can take any set of numbers and use formulas below or EXCEL to find the best line. The obvious question is if this line is good enough. Can we use this line for projecting trends or forecasting. To answer this question we perform Correlation analysis. The strating point of Correlation analysis is to find the total variation of Y. Recall from basic statistics formulas that total variation = Sum ( ( Y – Ybar) ^ 2) Note that deviding this total variation by n – 1 gives us variance and the square root of variance gives us the standard deviation. In general, if we start out with a large amount of variation and after doing the regression we end up with a very small of residual variation, we claim that we have done a good job of explaining and our model is good. Suppose our total variation was 200 and we are left with a residual or unexplainable variation of 30, we have explained 170 out of 200 units of variation. Our coefficient of determination would be 170/200 or 0.85. In other words Coeff. of determination tells us the proportion of variation we have suceeded in explaining by doing the regression model. At most we can explain will be all of it, in that case R-squared = 1. If we do not explain anything then R-squared = 0. When R-squared is 1 the value of R (Coeff. of correlation) will be = +1 or -1. This is when we have a perfect model. When Rsqaured is 0, R will also be 0 and we say that X and Y have no correlation. Correlation measures degree of linear relationship between X and Y. If R is closer to +1 or -1, we can conclude that there appears to be strong linear relationship between X and Y. If R is close to 0, we will say that there does not appear to be linear relationship between X and Y. When R is + it is called positive correlaion or we say that X and Y have direct relationship. When R is – it is called negative correlaion or we say that X and Y have inverse relationship. We will first learn how to find the regression line and R-squared using formulas. Soppose the sales for the last 9 periods for a company were 80000, 90000, 120000, 90000, 110000, 12000,15000, 90000, 140000 units respectively. We want to forecast for the tenth period using trend analysis or regression analysis. Note that we have two variables. Sales and Time. We will make the data handing easier by using sales in thousands and time will start from 1,2,3 etc. Note that n = 9 Time is the independent variable X and sales is the dependent variable Y. X Y X.Y X^2 Ycap Y – Ycap (Y-Ycap)^2 Y – Ybar (Y-Ybar)^2 1 80 80 1 88 -8 64 -30 900 9 2 90 180 4 93.5 -3.5 12.25 -20 400 # of Obs 3 120 360 9 99 21 441 10 100 4 90 360 16 104.5 -14.5 210.25 -20 400 5 110 550 25 110 0 0 0 0 6 120 720 36 115.5 4.5 20.25 10 100 7 150 1050 49 121 29 841 40 1600 8 90 720 64 126.5 -36.5 1332.25 -20 400 9 140 1260 81 132 8 64 30 900 Totals 45 990 5280 285 0 2985 0 4800 Xbar = 5 b = 5.5 Residual Total Ybar = 110 a = 82.5 Ycap = 82.5 + 5.5 X R-Squared = 0.378125 R = 0.6149186938 Formulas used above: Ybar = Sum(Y) / n Xbar = Sum(X) / n b = (Sum(XY) – n * Xbar * Ybar) / (Sum(X^2) – n * (Xbar^2)) a = Ybar – b * Xbar Ycap = a + b * X Unexplainable or Residual Variation = Sum ( (Y – Ycap) ^ 2 ) ) Total Variation = Sum ( ( Y – Ybar) ^ 2 ) R- Squared = ( Total – Residual) / Total You can solve this problem using Regression in EXCEL’s Data Analysis Package Go to TOOLS. If you do not have DATA ANALYSIS Choose ADD IN Select DATA ANALYSIS and it will be added to your Tools menu. Within data analysis choose regression. You can chck labels to show names. Time Sales 1 80 2 90 3 120 4 90 5 110 6 120 7 150 8 90 9 140 SUMMARY OUTPUT Regression Statistics Multiple R 0.6149186938 R Square 0.378125 Adjusted R Square 0.2892857143 Standard Error 20.6501470074 Observations 9 ANOVA df SS MS F Significance F Regression 1 1815 1815 4.256281407 0.0780101022 Residual 7 2985 426.4285714286 Total 8 4800 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Intercept 82.5 15.0019839958 5.4992726311 0.0009072363 47.0259701991 117.9740298009 47.0259701991 117.9740298009 Time 5.5 2.6659225152 2.0630757153 0.0780101022 -0.8039005226 11.8039005226 -0.8039005226 11.8039005226 RESIDUAL OUTPUT Observation Predicted Sales Residuals 1 88 -8 2 93.5 -3.5 3 99 21 4 104.5 -14.5 5 110 0 6 115.5 4.5 7 121 29 8 126.5 -36.5 9 132 8

Time Line Fit Plot

Sales 1 2 3 4 5 6 7 8 9 80 90 120 90 110 120 150 90 140 Predicted Sales 1 2 3 4 5 6 7 8 9 88 93.5 99 104.5 110 115.5 121 126.5 132Time

Sales

## Formulas

 Trend Analysis n = 14 Using Formulas Y X X.Y X^2 Ycap Y-Ycap (Y-Ycap)^2 Y-Ybar (Y-Ybar)^2 # Mergers Year 4 1 4 1 18.543 -14.543 211.495 -27.214 740.617 17 2 34 4 20.492 -3.492 12.196 -14.214 202.046 19 3 57 9 22.442 -3.442 11.846 -12.214 149.189 45 4 180 16 24.391 20.609 424.722 13.786 190.046 25 5 125 25 26.341 -1.341 1.797 -6.214 38.617 37 6 222 36 28.290 8.710 75.862 5.786 33.474 44 7 308 49 30.240 13.760 189.350 12.786 163.474 35 8 280 64 32.189 2.811 7.902 3.786 14.332 27 9 243 81 34.138 -7.138 50.958 -4.214 17.760 31 10 310 100 36.088 -5.088 25.887 -0.214 0.046 21 11 231 121 38.037 -17.037 290.272 -10.214 104.332 38 12 456 144 39.987 -1.987 3.947 6.786 46.046 45 13 585 169 41.936 3.064 9.386 13.786 190.046 49 14 686 196 43.886 5.114 26.156 17.786 316.332 Total 437 105 3721 1015 0.000 1341.776 0.000 2206.357 AVERAGE 31.2142857143 7.5 b= 1.949450549 Unexp Var = 1341.776 R-Square= 0.391859188 a= 16.59340659 Total Var = 2206.357 R = 0.625986572 Ycap = 16.59 + 1.94 X Exp Var = 864.581 May or May not br good Model Standard Error = 10.574245442 Standard Err of b = 0.7010656462 Calculated T Stat = 2.7806961581 Calculated F Stat = 7.7322682773 USING EXCEL BUILT-IN FUNTION SUMMARY OUTPUT Regression Statistics Multiple R 0.625986572 R Square 0.391859188 Adjusted R Square 0.341180787 Standard Error 10.57424475 Observations 14 ANOVA df SS MS F Significance F Regression 1 864.5813187 864.5813187 7.732272141 0.016628916 Residual 12 1341.775824 111.814652 Total 13 2206.357143 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Intercept 16.59340659 5.969358487 2.779763793 0.016657688 3.587291958 29.59952123 3.587291958 29.59952123 Year 1.949450549 0.7010656 2.780696341 0.016628916 0.421959851 3.476941248 0.421959851 3.476941248 RESIDUAL OUTPUT Observation Predicted # Mergers Residuals 1 18.54285714 -14.54285714 2 20.49230769 -3.492307692 3 22.44175824 -3.441758242 4 24.39120879 20.60879121 5 26.34065934 -1.340659341 6 28.29010989 8.70989011 7 30.23956044 13.76043956 8 32.18901099 2.810989011 9 34.13846154 -7.138461538 10 36.08791209 -5.087912088 11 38.03736264 -17.03736264 12 39.98681319 -1.986813187 13 41.93626374 3.063736264 14 43.88571429 5.114285714

## Time Series

 Example of Forecasting Three period Simple Three period Weighted Average Moving Average 0.2 0.3 0.5 Error Error Period Sales Yt Forecast Ft Abs(Yt – Ft) (Yt-Ft)^2 APE Forecast Ft Abs(Yt – Ft) (Yt-Ft)^2 APE 1 24 2 25 3 27 4 24 25.333 1.333 1.778 5.556 25.800 1.800 3.240 7.500 5 20 25.333 5.333 28.444 26.667 25.100 5.100 26.010 25.500 6 21 23.667 2.667 7.111 12.698 22.600 1.600 2.560 7.619 7 23 21.667 1.333 1.778 5.797 21.300 1.700 2.890 7.391 8 28 21.333 6.667 44.444 23.810 21.800 6.200 38.440 22.143 9 24 24.000 0.000 0.000 0.000 25.100 1.100 1.210 4.583 10 26 25.000 1.000 1.000 3.846 25.000 1.000 1.000 3.846 11 24 26.000 2.000 4.000 8.333 25.800 1.800 3.240 7.500 12 29 24.667 4.333 18.778 14.943 24.600 4.400 19.360 15.172 26.333 2.741 11.926 11.294 26.900 2.744 10.883 11.251 Forecast MAD MSSE MAPE Forecast MAD MSSE MAPE Exponential Smoothing Method Exponential Smoothing Method Alpha = 0.2 Alpha = 0.7 Error Error Period Sales Yt Forecast Ft Abs(Yt – Ft) (Yt-Ft)^2 APE Forecast Ft Abs(Yt – Ft) (Yt-Ft)^2 APE 1 24 2 25 24.000 1.000 1.000 4.000 24.000 1.000 1.000 4.000 3 27 24.200 2.800 7.840 10.370 24.700 2.300 5.290 8.519 4 24 24.760 0.760 0.578 3.167 26.310 2.310 5.336 9.625 5 20 24.608 4.608 21.234 23.040 24.693 4.693 22.024 23.465 6 21 23.686 2.686 7.217 12.792 21.408 0.408 0.166 1.942 7 23 23.149 0.149 0.022 0.648 21.122 1.878 3.525 8.164 8 28 23.119 4.881 23.821 17.431 22.437 5.563 30.950 19.869 9 24 24.095 0.095 0.009 0.398 26.331 2.331 5.434 9.713 10 26 24.076 1.924 3.700 7.399 24.699 1.301 1.692 5.003 11 24 24.461 0.461 0.213 1.921 25.610 1.610 2.591 6.707 12 29 24.369 4.631 21.447 15.969 24.483 4.517 20.404 15.576 25.295 2.181 7.916 8.831 27.645 2.537 8.947 10.235 Forecast MAD MSSE MAPE Forecast MAD MSSE MAPE Naïve Method Number 1 Naïve Method Number 2 Forecast Ft ERROR:#NAME? Forecast Ft 24.5833333333 Error Error Period Sales Yt Forecast Ft Abs(Yt – Ft) (Yt-Ft)^2 APE Forecast Ft Abs(Yt – Ft) (Yt-Ft)^2 APE 1 24 24.583 0.583 0.340 2.431 2 25 24 1.000 1.000 4.000 24.583 0.417 0.174 1.667 3 27 25 2.000 4.000 7.407 24.583 2.417 5.840 8.951 4 24 27 3.000 9.000 12.500 24.583 0.583 0.340 2.431 5 20 24 4.000 16.000 20.000 24.583 4.583 21.007 22.917 6 21 20 1.000 1.000 4.762 24.583 3.583 12.840 17.063 7 23 21 2.000 4.000 8.696 24.583 1.583 2.507 6.884 8 28 23 5.000 25.000 17.857 24.583 3.417 11.674 12.202 9 24 28 4.000 16.000 16.667 24.583 0.583 0.340 2.431 10 26 24 2.000 4.000 7.692 24.583 1.417 2.007 5.449 11 24 26 2.000 4.000 8.333 24.583 0.583 0.340 2.431 12 29 24 5.000 25.000 17.241 24.583 4.417 19.507 15.230 24.5833333333 29 2.818 9.909 11.378 24.583 2.014 6.410 8.340 Average Forecast MAD MSSE MAPE Forecast MAD MSSE MAPE Trend Analysis Using Regression X Y X^2 X.Y Error Period Sales Yt Forecast Ft Abs(Yt – Ft) (Yt-Ft)^2 APE 1 24 1 24 23.333 0.667 0.444 2.778 2 25 4 50 23.561 1.439 2.072 5.758 3 27 9 81 23.788 3.212 10.318 11.897 4 24 16 96 24.015 0.015 0.000 0.063 5 20 25 100 24.242 4.242 17.998 21.212 6 21 36 126 24.470 3.470 12.039 16.522 7 23 49 161 24.697 1.697 2.880 7.378 8 28 64 224 24.924 3.076 9.460 10.985 9 24 81 216 25.152 1.152 1.326 4.798 10 26 100 260 25.379 0.621 0.386 2.389 11 24 121 264 25.606 1.606 2.579 6.692 12 29 144 348 25.833 3.167 10.028 10.920 6.5 24.5833333333 650 1950 26.061 2.030 5.794 8.449 Average Average Forecast MAD MSSE MAPE Slope b = 0.2273 Intecpt a= 23.1061 Equation is Estimated Y = 23.1061 + 0.2273 X