
Hi Everyone, In this article of our machine learning training series using Python from EC Analytics, we’ll get the understanding about R squared value, and Adjusted R squared. Both these values we use to test the accuracy of our linear regression models.
Now first let’s start with the R squared value. R squared is used to determine the goodness of fit in regression analysis. It’s a statistical measure that represents the portion of variance for a dependent variable, means how much is the independent variable is capable to explain the dependent variable.
R Squared Equation:
1 – Sum of square of residuals/sum of square of total
Now here let’s understand what Sum of Square of residuals.
In the below screenshot we have scatter plot chart with predicted line(best fit line). Best fit line we have created using regression models. The difference between predicted value on best fit line and Actual value is residual or Error.
For SSR (Sum of Square of residuals) we calculate square and do the summation of each residual.
For Example:
Predicted Values are 10, 12, 8, 15, 13
Actual Values are 9, 11, 8, 20, 17
Then Sum of Square of Residual is = (Actual – Predicted Values)^2
(9-10,11-12,8-8,15-20,13-17)^2
And for SST (Sum of Square of Total) we calculate square and do the summation of difference between Average line as predicted line and actual Values.
How to interpret using R-Squared Value in regression:
R-squared is always between 0 to 100%:
- 0% indicates that the model explains no variability of the response data(independent variable) around it’s mean.
- 100% indicates that the model explains all the variability of the response data(independent variable) around it’s mean.
Now let’s understand the Adjusted R square, which also use to check the accuracy of our model regression model. When our model we have multiple independent variables, we use Adjusted R-Squared instead of R-Squared value for model evaluation.
Because when we have more independent variable in R-Squared calculation(weather they are correlated or not correlated), R-squared value automatically get increased due to slope coefficient value.
In Adjusted R-Squared calculations, it penalizes the independent variables which are not correlated to the dependent variable. So there is no impact of independent variable which are not correlated to dependent variable. So in our Multiple Regression cases we use Adjusted R square for model accuracy check instead of R-Square.
Ho to calculate R-Squared and Adjusted R-Squared in Python using stats-model library of sci-kit learn.
The model which we are using here is Multiple Regression Analysis Model. In this example we have Sales as dependent variable, Marketing Expenses (Direct, Tele, Email) and Region as Independent Variables(Predictor).
1 2 3 4 5 6 7 8 9 10 | import pandas as pd import pandas as pd df = pd.read_excel('D:\Project\IT Tools Training\Pyhton\Machine Learning\MLR.xlsx') df.head() # Independent variables X = df.iloc[:,:-1].values #dependent variable y = df.iloc[:,4].values |
1 2 3 4 5 6 7 8 9 10 11 12 | import pandas as pd</code> Here will Encode the Region variable because it’s a categorical variable using scikit learn pre-processing library. from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelencoder = LabelEncoder() X[:,3] = labelencoder.fit_transform(X[:,3]) onehotencoder = OneHotEncoder(categories='auto') X = onehotencoder.fit_transform(X).toarray() |
#Dummy Variable Trap
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | import pandas as pd</code> X = X[:,1:] Splitting data into training and test. from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=0) Fitting Scikit learn regression model on training data. from sklearn.linear_model import LinearRegression reg = LinearRegression() reg.fit(X_train, y_train) |
1 2 3 4 5 6 7 8 9 10 11 12 | import pandas as pd</code> # Predicting Values y_pred = reg.predict(X_test) print(y_test) print(y_pred) #Check Columns print(df.columns.values) |
#compare this using formula from statsmodel library
1 2 3 4 5 6 7 8 | import pandas as pd</code> import statsmodels.formula.api as sm model = sm.ols(formula = 'Sales ~ Direct+Tele+Region', data=df) fitted = model.fit() print(fitted.rsquared, fitted.rsquared_adj) |
In the above code, In the sm.ols function, we have used direct, tele and region independent variables. You can add or remove independent variables to check the optimized R-Squared and Adjusted R-Squared Value.
0 responses on "How to Interpret R Squared and Adjusted R Squared"