In statistics, Regression analysis is a method which is used to determine relation between continuous quantitative variables. In this method, there are independent variables that are used to predict the value of a dependent variable. In case there is a single independent variable, then it is a case of simple regression, and if number of independent variables is more than one, then it is a case of multiple regression. In this article we will learn Linear Regression in Excel.
In linear regression, relationship between dependent and independent variables is modeled by fitting a linear equation to the observed data. The term ‘Linear’ is used since we use a line to fit our data. The dependent variable used in regression analysis is also known as response variable or predicted variable, and the independent variables are also known as predictors or explanatory variables.
A linear regression line has an equation which is of the kind: Y= a + bX;
X is the independent variable,
Y is the dependent variable,
b is slope of the line,
a is y-intercept (i.e. value of y when x=0).
Linear regression uses the least-squares method by minimizing sum of squares of deviation of data points from the line and thereby calculates the best fit line for observed data.
Let us say we have a dataset in which we have characteristics of some individuals like their age, bio-mass index (BMI), and the amount spent by them on medical expenses in a month. Now looking at the individuals’ characteristics like age and BMI, we wish to find how these variables affect the individual’s medical expenses, and thus carry out regression analysis and thereby estimate or predict the average medical expenses for some particular individuals.
Let us first consider the case of a simple regression where we have only one independent variable. So let us see how only age affects medical expenses.
Below is the dataset:
In this example, the linear regression equation will take the following form:
Amount on medical expenses= b*age + a
To perform linear regression analysis in Excel, there is more than one method:
- Scatter chart with a trendline
- Regression tool with Analysis ToolPak Add-In
- Linear Regression formula
Scatter chart method:
- Select the two columns (x and y), including headers
Now in the ‘Insert’ tab, expand the ‘Scatter Chart’ dropdown and select ‘Scatter’ thumbnail (first option)
On selecting this, a scatter plot will appear on which we would draw a regression line. To do this, right click on any data point on the plot and select ‘Add Trendline’:
Now we will have a ‘Format Trendline’ pane on the right. Select ‘Linear Trendline’ and ‘Display Equation on Chart’ in this:
We can improvise and edit the chart as desired, like we can add axes titles, change the scale, colour and line type.
NOTE: In this kind of regression plot, dependent variable should always be on y-axis, and independent variable on the x-axis. If anyhow the graph is plotted in reverse order, then either we should switch the axes in the chart, or swap the columns in the dataset.
Analysis ToolPak Add-In method:
Analysis ToolPak is not enabled by default and we need to enable it manually. To do so, we follow the below steps:
- Click on ‘File’ -> ‘Options’ and select ‘Add-Ins’
- In the ‘Manage’ dropdown, select ‘Excel Add-Ins’ and then click on ‘Go’
- Checkmark ‘Analysis ToolPak’ and click on ‘OK’
This will add ‘Data Analysis’ tools to the ‘Data’ tab.To Add Data Analysis Toolpak as per your excel version,you can take reference from Microsoft’s Website’s too.
Now in order to run the regression analysis, following steps are used:
- Select ‘Data Analysis’ from the ‘Data’ tab:
- Click on ‘Regression’ -> ‘OK’
This will open a regression dialog box. In this dialog box, select the following:
- Select the Input Y range and Input X range (medical expenses and age, respectively). More columns of independent variables can be selected in case of multiple linear regression (like if we wish to see the impact of BMI as well on medical expenses, then we can also select BMI along with age)
- Checkmark the ‘Labels’ box to include headers
- Choose the desired ‘output’ option
- Select the ‘residuals’ checkbox and click on ‘OK’
This will create our regression analysis output in a new worksheet, stating the Regression Statistics, ANOVA, residuals and the coefficients.
Regression Statistics: This tells us how well the regression equation fits the data:
Multiple R: This is the correlation coefficient which is used to measure the strength of linear relationship between two variables. The value of Multiple R lies between -1 to 1, and the relationship strength of variables is depicted by its absolute value with a value close to +1 or -1 indicating a stronger linear relationship, a value equal to 1 indicating a perfect linear relationship and zero value indicating no relationship between the variables.
R Square: Thisis the Coefficient of Determination which is used as an indicator of goodness of fit. It lies between 0 and 1, with a value close to 1 indicating that the model is a good fit. In this case, 0.57=57% of y-values are explained by the x-values.
Adjusted R Square: Thisis R Square adjusted for number of predictors when we have more than one predictor in case of multiple linear regression.
Standard Error: This is used toexamine the precision of regression analysis conducted.
Observations: This is used todepict the number of model observations.
ANOVA: This part is used to tell the level of variability within the regression model.
In case of simple linear regression, this is generally not used. However the ‘Significance F values’ tell us how reliable our results are, with a value greater than 0.05 suggesting to choose another predictor.
Coefficients: This is the most important part of regression analysis which is used to build the regression equation.
So, the regression equation in this case would be: y= 16.891 x – 355.32
We can see that this is the same equation which we got by method1 (scatter chart).
Now let’s say we wish to predict the average medical expenses when age is 72:
So y= 16.891 * 72 -355.32 = 860.832
So using this technique, we can predict values of y for any other values of x.
Residuals: Difference between actual and predicted values is indicated by this.
The last method: Linear Regression formula is not so commonly used to carry out regression analysis and requires the use of statistical functions like slope (), intercept (), correl () etc.
- Choose the independent and dependent variable wisely, else the analysis would be wrong.
If you want to learn more click here…
If you wants to learn How to Create Histogram in Microsoft Power BI click here…