SummaryLinear regression is generally used to model the linear relationship between a continuous dependent variable and one or more independent (predictor) variables. When there is only one independent (predictor) variable, it is called “Simple Linear Regression” and when there are more than one independent variables, it is referred as “Multiple Linear Regression”. The general form of the linear regression model is given below. Where b0 is the Y intercept, e is the error in the model, b1 is the coefficient (slope) for independent factor x1, and b2 is the coefficient (slope) for independent factor x2 and so on.
y = b0 + b1* x1 + b2 *x2 + e
Applications of Linear RegressionLinear regression can be used for establishing the relationship (coefficient estimation) between independent and dependent variables, testing hypothesis and prediction of dependent variable for a certain combination of independent variables. Examples- 1. Prediction of income of a customer based on the zip code where customer is living, spend pattern, # of loans, payment patterns etc. 2. Prediction of sales for an e-commerce firm 3. Prediction of unemployment rate of a country etc. 4. What-if Analysis based on model established relationship between dependent and independent factors.
Underlying Algorithm and AssumptionsThe underlying algorithm is called Ordinary Least Square (OLS). The algorithm tries to fit a straight line that passes through the points in a way that it minimizes the sum of squared distance of the points from the line. In other words, it minimizes the sum of squared error in predictions. Some of the key assumptions for linear regression models are- Linear Relationship- The predictor (Xs) variables and dependent variables have a linear relationship. This can be easily verified by plotting the X vs Y in scatter plot. If the linear relationship doesn’t exist, either the variables need to be transformed or some other technique should be used. No Heteroscedasticity- The error between the predicted values and the actual values should be randomly distributed for all values of independent factors. This can be easily verified by plotting the error (residual) terms against each X. If there is no pattern there is Homoscedasticity, otherwise there is Heteroscedasticity (lack of constant variance). If Heteroscedasticity is present, this needs to be fixed prior to finalizing the model. No or little Multicollinearity– The independent factors should not be correlated to each other. If they are collinear, some of these need to be excluded from the final model to provide stability to the model and estimated coefficients. Normality and Independence– Residuals (errors), i.e. predicted minus actual data, should be normally distributed with mean of zero and constant standard deviation, and the residuals of independent factors should not be correlated to each other. Independence- Observations are independent from each other. Y (X+1) should not be correlated to Y (X)
Tools to Build Linear RegressionExcel- “Data Analysis” tool pack in Excel has a tool for building multiple linear regression models R- Function “lm” or “glm” are frequently used for building linear regression models SAS- Proc REG in SAS achieves the same objective.
Key Metrics and InterpretationThere are several metrics generated in the multiple linear regression output. Key ones are- R^2 (R square) – This tells the percent of variance in the dependent variable that can be explained by the model and the independent variables in the model. R^2= Explained Variation / Total Variation. The range for this metric is 0 to 1 or from 0% to 100%. If the R^2 is 0% that means the model explain 0% of the variation in dependent variable. On the other hand, 100% signifies a perfect model, i.e. explains 100% of variations. R^2 should be as to 1 as possible for a good model. F Statistics and Related ‘p’ or significance value– The F test measures the lift in the model with predictor variables versus a model with only intercept. ‘p’ value is giving the significance of rejecting the null hypothesis that all model coefficients for predictor variables are zero. F stat should be as high as possible and the associated p value of significance should be as low as possible for a good model. For example, a p value of 0.002 means that we are 1-0.002 or 99.8% confident that some coefficients of the independent variables are non-zero in the model. In other words, some independent variables have good explaining power for the dependent variable. Coefficients Estimate- Coefficient estimates are the multiplicative term for each of the independent factor to derive the regression equation. In the example below, we are modelling for Sales of an eCommerce company. The equation for this can be derived as-
Sales= 624.69704+ 0.18184*Marketing Budget- 0.556408*PriceCoefficients:
|Estimate||Std. Error||T value||Pr(>|t|)|