# Basic system and data packagesimportwarningswarnings.filterwarnings('ignore')importnumpyasnpimportpandasaspdfromcollectionsimportOrderedDictimportitertoolsimportstatsmodels.formula.apiassmfimportstatsmodels.apiassmimportsysimporttime# Plotting packagesimportmatplotlib.pyplotaspltfrommatplotlibimportcmfrommatplotlib.tickerimportLinearLocatorfrompandas.plottingimportscatter_matriximportseabornassns# Modeling packagesfromsklearn.linear_modelimportLinearRegression,LogisticRegression,RidgeCV,LassoCV,Ridge,Lasso,LassoLarsIC,LassoLarsCVfromsklearn.discriminant_analysisimportLinearDiscriminantAnalysis,QuadraticDiscriminantAnalysisfromsklearn.neighborsimportKNeighborsClassifierfromsklearn.feature_selectionimportRFEfromsklearn.preprocessingimportscalefromsklearn.model_selectionimportcross_val_score,train_test_split,cross_validatefromsklearn.metricsimportmean_absolute_error,median_absolute_error,mean_squared_error,classification_report,confusion_matrix,roc_auc_score,plot_roc_curve,make_scorer,recall_scorefrommlxtend.feature_selectionimportSequentialFeatureSelectorasSFS# also available in latest version of sklearn (0.24)frommlxtend.plottingimportplot_sequential_feature_selectionasplot_sfsfromsklearn.linear_modelimportRidgeCV,LassoCV,Ridge,Lasso,LassoLarsIC,LassoLarsCV
Credit for Problem 1 solution shown here: Kaitlyn Koehler, Spring 2021
This subsection revisits Problem 5 in Problem Set 1 (so that you have a proper background for this)¶
Import the training data into the notebook:
InĀ [159]:
# Importing training datadata=pd.read_csv('data/Massachusetts_Census_Data_estimation.csv')
a. Data relavent to explaining average vehicle ownership per household:
I started with creating a matrix plot of the median and mean variables in the data set. These are the variables that might be relevant to explaining the average vehicle ownership per household (response variable) without any normalization. After some visual analysis, I chose the variables that appear to have the strongest correlation with the repsonse variable:
Median household income - positive correlation
Mean travel time to work - weaker positive correlation
Average household size - positive correlation
There might also be variables that are relevant if normalized by total population in each observation. I brainstormed possible revalent variables to be normalized:
race (ex. percentage white)
age (ex. percentage 25 yrs and older)
Income (ex. percentage making over $50,000 a year)
Some variables might have a linear relationship if transformed. One example would be to do a log transformation on the total population. Total population could be revalant to the repsonse variable because in places with large population (such as cities), people are less likely to own cars. I would expect a negative correlation between total population and average vehciles owned.
InĀ [163]:
# Matrix Plot of Median/Mean Variables & Average Vehicle Ownership per Householddata_subset=data.loc[:,['Average Vehicles per Household','Median household income ($)','Median Age','Mean Travel Time to Work (min)','Average household size']]print('Matrix Scatter Plot of Median/Mean Variables & Average Vehicle Ownership:')scatter_matrix(data_subset,figsize=(15,15));#scatter matrix plot
Matrix Scatter Plot of Median/Mean Variables & Average Vehicle Ownership:
b. Specification of explanatory variable:
In the below code I normalized some of the varibles by population and transformed the total population. Then, I created a matrix plot of these modified variable and choose the ones that had the best visual correlation to the response variable. After this analysis, and considering the variable analysed above, I chose the following predictor variables for my model:
Mean travel time to work
Median household income
Average household size
Percent of population that is white
Percent of household making over 50,000 a year
Total population log normalized
I then analyzed these variables for goodness of fit using an OLS regression on each predictor variable and the response variable (see second code cell below). The statistics from the OLS regressions show a good p-values for all variables analyzed. The R^2 values were not very good for all variables. The variable with the best R^2 value is percent of household making over $50,000 a year with an R^2 of 0.457.
InĀ [164]:
# Normalize/transform data and plot# Race: Percent of population that is whitetotal=data.loc[:,'Total']#race data taken for entire populationpct_white=(data.loc[:,'White']/total)*100data["percent white"]=pct_white# Age: Percent over 25 year old - taken for entire populationpop_25=data.loc[:,'25 to 34 years':'75 years and over']pct_25=(pop_25.sum(axis=1)/total)*100data["percent over 25"]=pct_25# Income: percent of households making over $50,000 a yeartot_house=data.loc[:,'Total households']#income data taken for total households reportedincome_50=data.loc[:,'50,000 to 59,999':'200,000 and up']pct_50=(income_50.sum(axis=1)/tot_house)*100data["percent over 50,000"]=pct_50# Total Population log10 transform to make lineartotal_log=np.log10(total)data["log(total)"]=total_logfig,axs=plt.subplots(2,2)fig.suptitle('Scatter Plots of Normalized and Transformed Data vs. Average Vehicle Ownership')fig.set_figheight(8)fig.set_figwidth(8)data.plot.scatter(x='percent white',y='Average Vehicles per Household',ax=axs[0,0]);data.plot.scatter(x='percent over 25',y='Average Vehicles per Household',ax=axs[0,1]);data.plot.scatter(x='percent over 50,000',y='Average Vehicles per Household',ax=axs[1,0]);data.plot.scatter(x='log(total)',y='Average Vehicles per Household',ax=axs[1,1]);data_train=data#Used later in code
InĀ [165]:
# Statistical Analysis: OLS Regression for each predictor variableeqn1='Q("Average Vehicles per Household") ~ Q("Mean Travel Time to Work (min)")'eqn2='Q("Average Vehicles per Household") ~ Q("Median household income ($)")'eqn3='Q("Average Vehicles per Household") ~ Q("Average household size")'eqn4='Q("Average Vehicles per Household") ~ Q("percent white")'eqn5='Q("Average Vehicles per Household") ~ Q("percent over 50,000")'eqn6='Q("Average Vehicles per Household") ~ Q("log(total)")'results1=smf.ols(eqn1,data).fit()results2=smf.ols(eqn2,data).fit()results3=smf.ols(eqn3,data).fit()results4=smf.ols(eqn4,data).fit()results5=smf.ols(eqn5,data).fit()results6=smf.ols(eqn6,data).fit()print(results1.summary())print(results2.summary())print(results3.summary())print(results4.summary())print(results5.summary())print(results6.summary())
OLS Regression Results
===============================================================================================
Dep. Variable: Q("Average Vehicles per Household") R-squared: 0.179
Model: OLS Adj. R-squared: 0.177
Method: Least Squares F-statistic: 69.66
Date: Mon, 07 Mar 2022 Prob (F-statistic): 2.16e-15
Time: 16:06:51 Log-Likelihood: 10.531
No. Observations: 321 AIC: -17.06
Df Residuals: 319 BIC: -9.520
Df Model: 1
Covariance Type: nonrobust
=======================================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------------------
Intercept 1.2856 0.073 17.533 0.000 1.141 1.430
Q("Mean Travel Time to Work (min)") 0.0215 0.003 8.346 0.000 0.016 0.027
==============================================================================
Omnibus: 36.564 Durbin-Watson: 2.137
Prob(Omnibus): 0.000 Jarque-Bera (JB): 52.984
Skew: -0.747 Prob(JB): 3.12e-12
Kurtosis: 4.316 Cond. No. 159.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
===============================================================================================
Dep. Variable: Q("Average Vehicles per Household") R-squared: 0.306
Model: OLS Adj. R-squared: 0.303
Method: Least Squares F-statistic: 140.4
Date: Mon, 07 Mar 2022 Prob (F-statistic): 4.43e-27
Time: 16:06:51 Log-Likelihood: 37.354
No. Observations: 321 AIC: -70.71
Df Residuals: 319 BIC: -63.17
Df Model: 1
Covariance Type: nonrobust
====================================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------------
Intercept 1.4155 0.042 33.998 0.000 1.334 1.497
Q("Median household income ($)") 6.194e-06 5.23e-07 11.847 0.000 5.17e-06 7.22e-06
==============================================================================
Omnibus: 20.356 Durbin-Watson: 1.984
Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.020
Skew: -0.490 Prob(JB): 1.36e-06
Kurtosis: 4.029 Cond. No. 2.75e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.75e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
OLS Regression Results
===============================================================================================
Dep. Variable: Q("Average Vehicles per Household") R-squared: 0.334
Model: OLS Adj. R-squared: 0.332
Method: Least Squares F-statistic: 159.7
Date: Mon, 07 Mar 2022 Prob (F-statistic): 5.86e-30
Time: 16:06:51 Log-Likelihood: 43.979
No. Observations: 321 AIC: -83.96
Df Residuals: 319 BIC: -76.41
Df Model: 1
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 0.2741 0.128 2.138 0.033 0.022 0.526
Q("Average household size") 0.6419 0.051 12.638 0.000 0.542 0.742
==============================================================================
Omnibus: 75.501 Durbin-Watson: 2.050
Prob(Omnibus): 0.000 Jarque-Bera (JB): 188.644
Skew: -1.117 Prob(JB): 1.09e-41
Kurtosis: 6.019 Cond. No. 31.7
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
===============================================================================================
Dep. Variable: Q("Average Vehicles per Household") R-squared: 0.394
Model: OLS Adj. R-squared: 0.392
Method: Least Squares F-statistic: 207.4
Date: Mon, 07 Mar 2022 Prob (F-statistic): 1.45e-36
Time: 16:06:51 Log-Likelihood: 59.205
No. Observations: 321 AIC: -114.4
Df Residuals: 319 BIC: -106.9
Df Model: 1
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept 0.5348 0.095 5.652 0.000 0.349 0.721
Q("percent white") 0.0151 0.001 14.400 0.000 0.013 0.017
==============================================================================
Omnibus: 14.182 Durbin-Watson: 2.054
Prob(Omnibus): 0.001 Jarque-Bera (JB): 18.489
Skew: -0.368 Prob(JB): 9.66e-05
Kurtosis: 3.917 Cond. No. 756.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
===============================================================================================
Dep. Variable: Q("Average Vehicles per Household") R-squared: 0.457
Model: OLS Adj. R-squared: 0.456
Method: Least Squares F-statistic: 269.0
Date: Mon, 07 Mar 2022 Prob (F-statistic): 2.89e-44
Time: 16:06:51 Log-Likelihood: 76.972
No. Observations: 321 AIC: -149.9
Df Residuals: 319 BIC: -142.4
Df Model: 1
Covariance Type: nonrobust
============================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------
Intercept 0.8398 0.065 12.963 0.000 0.712 0.967
Q("percent over 50,000") 0.0159 0.001 16.400 0.000 0.014 0.018
==============================================================================
Omnibus: 31.076 Durbin-Watson: 1.939
Prob(Omnibus): 0.000 Jarque-Bera (JB): 52.588
Skew: -0.590 Prob(JB): 3.81e-12
Kurtosis: 4.594 Cond. No. 405.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
===============================================================================================
Dep. Variable: Q("Average Vehicles per Household") R-squared: 0.219
Model: OLS Adj. R-squared: 0.216
Method: Least Squares F-statistic: 89.23
Date: Mon, 07 Mar 2022 Prob (F-statistic): 7.77e-19
Time: 16:06:51 Log-Likelihood: 18.416
No. Observations: 321 AIC: -32.83
Df Residuals: 319 BIC: -25.29
Df Model: 1
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 2.7080 0.088 30.850 0.000 2.535 2.881
Q("log(total)") -0.2092 0.022 -9.446 0.000 -0.253 -0.166
==============================================================================
Omnibus: 41.173 Durbin-Watson: 2.046
Prob(Omnibus): 0.000 Jarque-Bera (JB): 76.405
Skew: -0.718 Prob(JB): 2.56e-17
Kurtosis: 4.910 Cond. No. 28.9
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
c. Analyze the explanatory variables to discard potential collinearity problems
I analyzed the 6 chosen predictor variables for collinearity using a scatter matrix for visual analysis and a correlation matrix for statistical analysis. One pair of variables showed an obvious collinearity problem: Median household income and percent of household making over $50,000 a year. The scatter plot of these two variable looks to be almost perfectly linear and the correlation number is almost 1.
I decided to eliminate the Median household income because it seems to be a less significant predictor from the statistical analysis above.
Here is a list of the final 5 predictor variables that I will use in my models:
Mean travel time to work
Average household size
Percent of population that is white
Percent of household making over 50,000 a year
Total population log normalized
InĀ [166]:
scatter_matrix(data.loc[:,['Mean Travel Time to Work (min)','Median household income ($)','Average household size','percent white','percent over 50,000','log(total)']],figsize=(15,15));data.loc[:,['Mean Travel Time to Work (min)','Median household income ($)','Average household size','percent white','percent over 50,000','log(total)']].corr()#correlation between percent over 50,000 and mean income is almost one. Remove one of these. Lets eliminate median household income
# Defining the training data predictor variables (X) and response variable (y)X=data.loc[:,['Mean Travel Time to Work (min)','Average household size','percent white','log(total)','percent over 50,000',]]# Predictor variablesy=data.loc[:,'Average Vehicles per Household']# Reponse variable#dataframe = pd.concat([X, y], axis=1)#set up Dataframe for model statistics:modelStats=pd.DataFrame(index=['model1','model2','model3','model4'],columns=["model","R-squared","adj R-squared","AIC","BIC","Cp"])
Model 1: OLS Linear Regression with Every Predictor Variable
The following code shows a OLS linear regression including every perdictor variable. The OLS regression results summary is shown below. The R^2 and adj R^2 are high (0.8). Also, the p-values for each predictor variable is zero, except for the mean travel time to work which has a p-value of 0.146 (which is still acceptable). Overall, I am fairly happy with this model.
InĀ [170]:
# Model 1: OLS Linear Regression with every predictor variablemodel1=sm.OLS(y,sm.add_constant(X)).fit()print(model1.summary())n=len(y)p=X.shape[1]RSS=model1.ssrsigma2=RSS/(n-1)# equation 8 from PS-1 Hints# I had to manually calculate AIC, BIC, and Cp for the OLS regression becasue ...# statsmodel doesn't seem to use the equations we reviewed in class (from ISLR)AIC=(1/(n*sigma2))*(RSS+(2*p*sigma2))BIC=(1/(n*sigma2))*(RSS+(np.log(n)*p*sigma2))Cp=(1/n)*(RSS+2*p*sigma2)modelStats.loc['model1']={"model":'OLS',"R-squared":model1.rsquared,"adj R-squared":model1.rsquared_adj,"AIC":AIC,"BIC":BIC,"Cp":Cp}
OLS Regression Results
==========================================================================================
Dep. Variable: Average Vehicles per Household R-squared: 0.843
Model: OLS Adj. R-squared: 0.840
Method: Least Squares F-statistic: 337.1
Date: Mon, 07 Mar 2022 Prob (F-statistic): 4.17e-124
Time: 16:07:48 Log-Likelihood: 275.53
No. Observations: 321 AIC: -539.1
Df Residuals: 315 BIC: -516.4
Df Model: 5
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
const 0.0532 0.111 0.479 0.633 -0.165 0.272
Mean Travel Time to Work (min) -0.0020 0.001 -1.457 0.146 -0.005 0.001
Average household size 0.5683 0.036 15.941 0.000 0.498 0.638
percent white 0.0080 0.001 10.991 0.000 0.007 0.009
log(total) -0.1693 0.013 -13.197 0.000 -0.195 -0.144
percent over 50,000 0.0062 0.001 8.270 0.000 0.005 0.008
==============================================================================
Omnibus: 96.002 Durbin-Watson: 2.049
Prob(Omnibus): 0.000 Jarque-Bera (JB): 663.176
Skew: -1.036 Prob(JB): 9.84e-145
Kurtosis: 9.730 Cond. No. 2.25e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.25e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Model 2: Forward Stepwise Selection
The below code shows a forward stepwise selection. This selection model starts with a base model $M_0$, and adds just one predictor variable to find the best model $M_1$ based on the R^2 value. Then, it continues to adds 1 predictor variable to the previous best model to find the best $M_k$ model. Then, all the overall best model is choosen from the $M_k$ models using the adjusted R^2.
The forward stepwise selection here choose the $M_5$ as the overall best model. This is an OLS linear regression model that contains all 5 of the predictor variable, so the result is identical to the regular OLS linear regression model above (Model 1). This was unexpected because I through this method would choose an $M_4$ model without the "Mean Travel Time to Work" variable since this variable seems to be the least statistically significant. But, it appears this variable seems to be significant enough to improve the over model.
InĀ [171]:
# Model 2: Forward Stepwise Selection - same result as OLS# FunctionsdefprocessSubset(feature_set):# Fit model on feature_set and calculate RSSx=X[list(feature_set)]model=sm.OLS(y,sm.add_constant(x))regr=model.fit()RSS=regr.ssradjR2=regr.rsquared_adjreturn{"model":regr,"RSS":RSS,"adj_R2":adjR2}defforward(predictors):# Pull out predictors we still need to processremaining_predictors=[pforpinX.columnsifpnotinpredictors]tic=time.time()results=[]forpinremaining_predictors:results.append(processSubset(predictors+[p]))# Wrap everything up in a nice dataframemodels=pd.DataFrame(results)# Choose the model with the highest RSS #This should be lowest RSS or highest R^2!best_model=models.loc[models['RSS'].argmin()]toc=time.time()#print("Processed ", models.shape[0], "models on", len(predictors)+1, "predictors in", (toc-tic), "seconds.")# Return the best model, along with some other useful information about the modelreturnbest_model# Codemodels_fwd=pd.DataFrame(columns=["RSS","adj_R2","model"])tic=time.time()predictors=[]foriinrange(1,len(X.columns)+1):models_fwd.loc[i]=forward(predictors)variables=models_fwd.loc[i]["model"].model.exog_namespredictors=variables[1:]toc=time.time()print("Total elapsed time:",(toc-tic),"seconds.")print('Best model from each iteration:')print(models_fwd)models_fwd['adj_R2'].argmin()index=models_fwd['adj_R2'].argmax()+1best_model_M=models_fwd.loc[index]#best model of Msprint('The overall best model is model with '+str(index)+' predictor variables becasue it has the highest adjusted R^2')model2=best_model_M['model']print(model2.summary())n=len(y)p=X.shape[1]RSS=model2.ssrsigma2=RSS/(n-1)# equation 8 from PS-1 Hints# I had to manually calculate AIC, BIC, and Cp for the OLS regression becasue ...# statsmodel doesn't seem to use the equations we reviewed in class (from ISLR)AIC=(1/(n*sigma2))*(RSS+(2*p*sigma2))BIC=(1/(n*sigma2))*(RSS+(np.log(n)*p*sigma2))Cp=(1/n)*(RSS+2*p*sigma2)modelStats.loc['model2']={"model":'Forward Stepwise',"R-squared":model2.rsquared,"adj R-squared":model2.rsquared_adj,"AIC":AIC,"BIC":BIC,"Cp":Cp}
Total elapsed time: 0.04576921463012695 seconds.
Best model from each iteration:
RSS adj_R2 model
1 11.634684 0.455747 <statsmodels.regression.linear_model.Regressio...
2 7.090896 0.667255 <statsmodels.regression.linear_model.Regressio...
3 4.674002 0.779978 <statsmodels.regression.linear_model.Regressio...
4 3.399352 0.839474 <statsmodels.regression.linear_model.Regressio...
5 3.376606 0.840042 <statsmodels.regression.linear_model.Regressio...
The overall best model is model with 5 predictor variables becasue it has the highest adjusted R^2
OLS Regression Results
==========================================================================================
Dep. Variable: Average Vehicles per Household R-squared: 0.843
Model: OLS Adj. R-squared: 0.840
Method: Least Squares F-statistic: 337.1
Date: Mon, 07 Mar 2022 Prob (F-statistic): 4.17e-124
Time: 16:07:51 Log-Likelihood: 275.53
No. Observations: 321 AIC: -539.1
Df Residuals: 315 BIC: -516.4
Df Model: 5
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
const 0.0532 0.111 0.479 0.633 -0.165 0.272
percent over 50,000 0.0062 0.001 8.270 0.000 0.005 0.008
log(total) -0.1693 0.013 -13.197 0.000 -0.195 -0.144
Average household size 0.5683 0.036 15.941 0.000 0.498 0.638
percent white 0.0080 0.001 10.991 0.000 0.007 0.009
Mean Travel Time to Work (min) -0.0020 0.001 -1.457 0.146 -0.005 0.001
==============================================================================
Omnibus: 96.002 Durbin-Watson: 2.049
Prob(Omnibus): 0.000 Jarque-Bera (JB): 663.176
Skew: -1.036 Prob(JB): 9.84e-145
Kurtosis: 9.730 Cond. No. 2.25e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.25e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Model 3: Ridge Regression Method
Ridge Regression method adds a penalty term to the RSS, which is minimized to determine the coefficents. This reduces the coefficients of irrelevant predictors. This is know as a shrinkage method. Ridge regression can shrink the coefficients, but not eliminate variable entirely.
The code below shows the Ridge Regression applied to my model predictors. The resulting coefficients for each variable and a histogram representing the feature importance is shown below. The alpha (or $\lambda$) chosen by the ridge method and the R^2 value is also reported below. The R^2 value of 0.84 is good. The coefficient that was shrunk the most was "Mean Travel Time to Work", which was expected.
InĀ [176]:
# Model 3: Ridge Regression Methodmodel3=RidgeCV(normalize=True,alphas=np.logspace(-6,6),cv=10).fit(X,y)#Note that normalize is deprecated (warnings silenced)print("Best alpha using built-in Ridge: %f"%model3.alpha_)print("Best score (R^2) using built-in Ridge: %f"%model3.score(X,y)) # score is R^2
coef3=pd.Series(model3.coef_,index=X.columns)coef3.plot(kind="barh")plt.title("Feature importance using Ridge Model");print('Coefficients:')print(coef3)print('y-intercept (B0): '+str(model3.intercept_))R2=model3.score(X,y)n=len(y)p=X.shape[1]R2_adj=1-(1-R2)*(n-1)/(n-p-1)RSS=np.sum((model3.predict(X)-y)**2)sigma2=RSS/(n-1)AIC=(1/(n*sigma2))*(RSS+(2*p*sigma2))BIC=(1/(n*sigma2))*(RSS+(np.log(n)*p*sigma2))Cp=(1/n)*(RSS+2*p*sigma2)modelStats.loc['model3']={"model":'Ridge',"R-squared":R2,"adj R-squared":R2_adj,"AIC":AIC,"BIC":BIC,"Cp":Cp}
Best alpha using built-in Ridge: 0.014563
Best score (R^2) using built-in Ridge: 0.842393
Coefficients:
Mean Travel Time to Work (min) -0.001586
Average household size 0.552673
percent white 0.007963
log(total) -0.166076
percent over 50,000 0.006247
dtype: float64
y-intercept (B0): 0.07171316706715425
Model 4: Lasso Method
Lasso method also penalizes coefficients or irrelevant predictors by shrinking them. But, unlike the Ridge method, the Lasso method can shrink coefficients to zero.
The Lasso method is applied below. The resulting coefficients for each variable and a histogram representing the feature importance is shown below. The alpha (or $\lambda$) chosen by the Lasso method and the R^2 value is also reported below. The R^2 value of 0.84 is good. The results are very similar to the Ridge method. The coefficient that was shrunk the most was "Mean Travel Time to Work", which was expected. But, I am surprised that "Mean Travel Time to Work" wasn't totally eliminated (coeff=0).
InĀ [177]:
# Model 4: Lasso Methodmodel4=LassoCV(normalize=True,cv=10).fit(X,y)#n_alphas: number of alpha tested, default 100#Lasso defult tolerance (tol) for the optimization of alpha = 1e-4print("Best alpha using built-in Lasso: %f"%model4.alpha_)# lamdaprint("Best score (R^2) using built-in Lasso: %f"%model4.score(X,y))# What does this mean? Return the coefficient of determination R^2 of the prediction?coef4=pd.Series(model4.coef_,index=X.columns)print("Lasso picked "+str(np.sum(coef4!=0))+" variables and eliminated the other "+str(np.sum(coef4==0))+" variables.")coef4.plot(kind="barh")plt.title("Feature importance using Lasso Model");#print("Number of interations run by the cordinate decent solver to reach optimal alpha: " + str(model4.n_iter_))print('Coefficients:')print(coef4)R2=model4.score(X,y)n=len(y)p=X.shape[1]R2_adj=1-(1-R2)*(n-1)/(n-p-1)RSS=np.sum((model4.predict(X)-y)**2)sigma2=RSS/(n-1)AIC=(1/(n*sigma2))*(RSS+(2*p*sigma2))BIC=(1/(n*sigma2))*(RSS+(np.log(n)*p*sigma2))Cp=(1/n)*(RSS+2*p*sigma2)modelStats.loc['model4']={"model":'Lasso',"R-squared":R2,"adj R-squared":R2_adj,"AIC":AIC,"BIC":BIC,"Cp":Cp}
Best alpha using built-in Lasso: 0.000010
Best score (R^2) using built-in Lasso: 0.842538
Lasso picked 5 variables and eliminated the other 0 variables.
Coefficients:
Mean Travel Time to Work (min) -0.001876
Average household size 0.566677
percent white 0.008032
log(total) -0.168877
percent over 50,000 0.006169
dtype: float64
Model Comparison
Below is the printed model statistics for each model. The Adujsued R^2 and Cp are plotted as bar plots. The 4 models all have very similar statitics. This is surprising to me because I though the Ridge and Lasso method would provide better fit models due to the addition of the penalty term. The best peforming for the training data model appears to be the OLS model (which is the same as the forward stepwise model) by a very slim margin.
a. Using all the models you estimated, forecast the average number of vehicles per household in
the 30 validation towns.
In the below code, the test data is loaded and the 4 models from above are used to predict the reponse (average number of vehicles per household). The predicted response from each model is plotted agaist the actual response. It appears the performance of all 4 models on the test data is very similar, which is expected based on their similar performances on the training data. In these plots, the points laying near/on the red line indicate where the model predicted the response ver well.
The test MSE for each model is also calculated (see second code block below). The test MSE is very low (close to zero) for each model, which indicates good performace. Based on this analysis, the best fit model is still the OLS model.
$M_B'$ = Model 1 (OLS)
InĀ [23]:
# Importing the test datadata_test=pd.read_csv('data/Massachusetts_Census_Data_forecast.csv')# Normalize/transform relevant data# Race: percentage whitetotal=data_test.loc[:,'Total']#race data taken for entire populationpct_white=(data_test.loc[:,'White']/total)*100data_test["percent white"]=pct_white# Income: percent over $50,000 & percent over $60,000 & percent over $75,000tot_house=data_test.loc[:,'Total households']#income data taken for total households reportedincome_50=data_test.loc[:,'50,000 to 59,999':'200,000 and up']pct_50=(income_50.sum(axis=1)/tot_house)*100data_test["percent over 50,000"]=pct_50# Total Population normalize to make lineartotal=data_test.loc[:,'Total']#race data taken for entire populationtotal_norm=np.log10(total)data_test["log(total)"]=total_normX_test=np.array(data_test.loc[:,['Mean Travel Time to Work (min)','Average household size','percent white','log(total)','percent over 50,000']])# Define Predictors matrix XX_test2=np.array(data_test.loc[:,['percent over 50,000','log(total)','Average household size','percent white','Mean Travel Time to Work (min)']])y_test=np.array(data_test.loc[:,'Average Vehicles per Household'])# Response y# Predict the repsonse variable (using each model)y_predict1=model1.predict(sm.add_constant(X_test))y_predict2=model2.predict(sm.add_constant(X_test2))y_predict3=model3.predict(X_test)y_predict4=model4.predict(X_test)# Scatter plots for predicted vs. actaul response for each modelfig,ax=plt.subplots(2,2)fig.suptitle('Predicted vs. Actual, Average Vehicles per Household')fig.set_figheight(10)fig.set_figwidth(10)ax[0,0].scatter(y_test,y_predict1)ax[0,0].plot([0.75,2.5],[0.75,2.5],'r-')ax[0,0].set_title('OLS Model')ax[0,1].scatter(y_test,y_predict2)ax[0,1].plot([0.75,2.5],[0.75,2.5],'r-')ax[0,1].set_title('Forward Stepwise Model')ax[1,0].scatter(y_test,y_predict3)ax[1,0].plot([0.75,2.5],[0.75,2.5],'r-')ax[1,0].set_title('Ridge Model')ax[1,1].scatter(y_test,y_predict4)ax[1,1].plot([0.75,2.5],[0.75,2.5],'r-')ax[1,1].set_title('Lasso Model')foraxsinax.flat:axs.set(xlabel='Actual',ylabel='Predicted',xlim=(0.75,2.5),ylim=(0.75,2.5))# Hide x labels and tick labels for top plots and y ticks for right plots.foraxsinax.flat:axs.label_outer()
InĀ [24]:
# Calculate Test MSEtestStat=pd.DataFrame(index=['OLS','Forward Stepwise','Ridge','Lasso'])n=len(y_test)testStat.at['OLS','MSE']=(1/n)*np.sum((y_test-y_predict1)**2)testStat.at['Forward Stepwise','MSE']=(1/n)*np.sum((y_test-y_predict2)**2)testStat.at['Ridge','MSE']=(1/n)*np.sum((y_test-y_predict3)**2)testStat.at['Lasso','MSE']=(1/n)*np.sum((y_test-y_predict4)**2)print('Test MSE:')print(testStat)# MSE closer to zero is bettertestStat.plot(kind='barh');plt.title('Test MSE')
b. Provide confidence intervals of the predicted values using $M_B'$, and check if the actual average
numbers of vehicles per household belong to the intervals.
The 95% confidence intervals for the predicted mean response of the test data is shown below. The actual responses inside and outside the confidence interval are tabulated. I was surprised that actual responses for 26 out of the 30 observations were outside the 95% confidence interval.
InĀ [25]:
# Get the 95% confidence intervals of the mean response.predictions=model1.get_prediction(sm.add_constant(X_test))frame=predictions.summary_frame()#alpha = 0.05 default (95% confidence intervals)CI_analysis=pd.DataFrame(index=data_test['Town'])CI_analysis['Lower CI']=np.array(frame.mean_ci_lower)CI_analysis['Upper CI']=np.array(frame.mean_ci_upper)CI_analysis['Actual']=np.array(y_test)# indicies predictions outside and inside 95% confidence intervallogic_index=np.where((y_test>np.array(frame.mean_ci_upper))|(y_test<np.array(frame.mean_ci_lower)))logic_index_inside=np.where(~((y_test>np.array(frame.mean_ci_upper))|(y_test<np.array(frame.mean_ci_lower))))print('Number of predictions outside 95% confidence interval: '+str(np.shape(logic_index)[1])+' out of '+str(len(y_test)))print('\n Response outside:')print(CI_analysis.iloc[logic_index])print('\n Response inside:')print(CI_analysis.iloc[logic_index_inside])
Number of predictions outside 95% confidence interval: 26 out of 30
Response outside:
Lower CI Upper CI Actual
Town
Amesbury 1.814286 1.854146 1.75
Athol 1.714972 1.779373 1.71
Brookfield 1.949572 1.980366 1.89
Brookline 1.467665 1.542770 1.15
Chelsea 1.269407 1.436420 0.97
Chicopee 1.386749 1.450724 1.51
Clinton 1.599495 1.637100 1.70
Dracut 1.890487 1.926226 1.98
Franklin 1.997743 2.042569 1.92
Goshen 2.105999 2.150089 2.09
Hancock 2.084697 2.151735 2.04
Hatfield 1.750153 1.797531 1.86
Leicester 1.958413 1.985562 1.93
Littleton 2.027160 2.056983 1.94
Mendon 2.240405 2.286718 2.20
Monroe 1.739902 1.872017 1.45
Natick 1.738544 1.787616 1.72
New Ashford 2.107606 2.184581 2.06
Norton 1.961977 1.997849 1.91
Orleans 1.577839 1.640571 1.81
Randolph 1.412498 1.541990 1.67
Rockland 1.845494 1.881047 1.74
Salisbury 1.812089 1.846105 1.74
Sandwich 1.955333 1.995692 2.04
Scituate 1.938802 1.993888 1.88
Sudbury 2.221161 2.281459 2.11
Response inside:
Lower CI Upper CI Actual
Town
Charlton 2.098619 2.137195 2.10
Norwood 1.659567 1.705287 1.66
Pittsfield 1.417806 1.491689 1.45
Tyringham 2.055983 2.138958 2.08
c. Plot and analyze the residuals for all 351 towns, indicating if the assumptions of normality
and homoscedasticity hold. If not, discuss how these issues may be addressed.
The code below predicts the repsonse for all 351 towns using $M_B'$ (model 1: OLS). The residuals are calculated and plotted to analyze the normality and homoscedasticity.
The first plot below shows the residuals vs. fitted (predicted) values as a scatter plot. The fitted red line is almost vertical, representing that the linear fit used is a good approximation. There also appears to be constant variance (or scattering) on the error term, which represents homoscedasticity.
The second and third plot below show the residuals as a histogram and a probability plot. These two plots show that the residuals are normally distributed, indicating normality of the model. The R^2 term for the probability plot is 0.94, which represents very good normality.
InĀ [26]:
# Data for all 351 townsdata_all=pd.concat([data_train,data_test],ignore_index=True)data_all=data_all.sort_values('Town')# Normalize/transform data # Race: percentage whitetotal=data_all.loc[:,'Total']#race data taken for entire populationpct_white=(data_all.loc[:,'White']/total)*100data_all["percent white"]=pct_white# Income: percent over $50,000 & percent over $60,000 & percent over $75,000tot_house=data_all.loc[:,'Total households']#income data taken for total households reportedincome_50=data_all.loc[:,'50,000 to 59,999':'200,000 and up']pct_50=(income_50.sum(axis=1)/tot_house)*100data_all["percent over 50,000"]=pct_50# Total Population normalize to make lineartotal=data_all.loc[:,'Total']#race data taken for entire populationtotal_norm=np.log10(total)data_all["log(total)"]=total_normX_all=np.array(data_all.loc[:,['Mean Travel Time to Work (min)','Average household size','percent white','log(total)','percent over 50,000']])y_all=np.array(data_all.loc[:,'Average Vehicles per Household'])# model predicted valuesy_hat_all=model1.predict(sm.add_constant(X_all))# model residualsresiduals_all=y_all-y_hat_allplot1=plt.figure()plot1.axes[0]=sns.residplot(y_hat_all,y_all,lowess=True,scatter_kws={'alpha':0.5},line_kws={'color':'red','lw':1,'alpha':0.8});plot1.axes[0].set_title('Residuals vs Fitted')plot1.axes[0].set_xlabel('Fitted values')plot1.axes[0].set_ylabel('Residuals');# Linearity: there wonāt be any apparent patterns in the scatterplot and the red line would be horizontal.# this looks good. Linear fit is a good approximation#Also appears to have homoscedasticity (constant variance on error term)plot2=plt.figure()plot2=plt.hist(residuals_all)plt.title('Histogram of Residuals')plt.xlabel('Residuals')plt.ylabel('Count')fig,ax=plt.subplots()_,(__,___,r)=sp.stats.probplot(residuals_all,plot=ax,fit=True)print('Probability Plot R^2: '+str(r**2))# We can apply normal probability plot to assess how the data (error) depart from normality visually: The good fit (r^2 = 0.94) indicates that normality is a reasonable approximation.
Probability Plot R^2: 0.9417173315254236
/Users/katiekoehler/miniconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
warnings.warn(
fig,ax=plt.subplots(subplot_kw={"projection":"3d"})# Make data.y_1=8y_2=5la=5beta_1=np.arange(-5,7,0.1)beta_2=np.arange(-5,7,0.1)beta_1,beta_2=np.meshgrid(beta_1,beta_2)rss_ridge=(y_1-beta_1)**2+(y_2-beta_2)**2+la*(beta_1**2+beta_2**2)# Plot the surface.surf=ax.plot_surface(beta_1,beta_2,rss_ridge,cmap=cm.coolwarm,linewidth=0,antialiased=False)# Customize the z axis.ax.set_zlabel(r"$RSS^{R}(\beta)$")#ax.set_zlim(-1.01, 1.01)#ax.zaxis.set_major_locator(LinearLocator(10))#ax.zaxis.set_major_formatter('{x:3.0f}') #format appearance of z-axis ticks# Add a color bar which maps values to colors.fig.colorbar(surf,shrink=0.5,aspect=5,orientation="vertical",pad=0.2)plt.title('Ridge RSS function')plt.xlabel(r"$\beta_1$")plt.ylabel(r"$\beta_2$")plt.show()
fig,ax=plt.subplots(subplot_kw={"projection":"3d"})# Make data.y_1=8y_2=5la=5beta_1=np.arange(-4,15,0.1)beta_2=np.arange(-4,15,0.1)beta_1,beta_2=np.meshgrid(beta_1,beta_2)rss_ridge=(y_1-beta_1)**2+(y_2-beta_2)**2+la*(np.abs(beta_1)+np.abs(beta_2))# Plot the surface.surf=ax.plot_surface(beta_1,beta_2,rss_ridge,cmap=cm.coolwarm,linewidth=0,antialiased=False)# Customize the z axis.ax.set_zlabel(r"$RSS^{L}(\beta)$")#ax.set_zlim(0, 450)#ax.zaxis.set_major_locator(LinearLocator(10))#ax.zaxis.set_major_formatter('{x:3.0f}') #format appearance of z-axis ticks# Add a color bar which maps values to colors.fig.colorbar(surf,shrink=0.5,aspect=5,orientation="vertical",pad=0.2)plt.title('Lasso RSS function; lambda = 5')plt.xlabel(r"$\beta_1$")plt.ylabel(r"$\beta_2$")plt.show()
Notice the subtle but visible differences between the ridge and lasso (sharp corners) objective functions in the plot. Let's generate a plot with a more extreme regularization (lambda):
InĀ [7]:
fig,ax=plt.subplots(subplot_kw={"projection":"3d"})# Make data.y_1=8y_2=5la=10beta_1=np.arange(-4,15,0.1)beta_2=np.arange(-4,15,0.1)beta_1,beta_2=np.meshgrid(beta_1,beta_2)rss_ridge=(y_1-beta_1)**2+(y_2-beta_2)**2+la*(np.abs(beta_1)+np.abs(beta_2))# Plot the surface.surf=ax.plot_surface(beta_1,beta_2,rss_ridge,cmap=cm.coolwarm,linewidth=0,antialiased=False)# Customize the z axis.ax.set_zlabel(r"$RSS^{L}(\beta)$")#ax.set_zlim(0, 450)#ax.zaxis.set_major_locator(LinearLocator(10))#ax.zaxis.set_major_formatter('{x:3.0f}') #format appearance of z-axis ticks# Add a color bar which maps values to colors.fig.colorbar(surf,shrink=0.5,aspect=5,orientation="vertical",pad=0.2)plt.title('Lasso RSS function; lambda = 10')plt.xlabel(r"$\beta_1$")plt.ylabel(r"$\beta_2$")plt.show()
We see that with greater regularization, the less relevant parameter $\beta_2$ is now 0. For greater clarity, we may observe the wireframe plot:
InĀ [8]:
fig,ax=plt.subplots(subplot_kw={"projection":"3d"})# Make data.y_1=8y_2=5la=10beta_1=np.arange(-4,15,0.1)beta_2=np.arange(-4,15,0.1)beta_1,beta_2=np.meshgrid(beta_1,beta_2)rss_ridge=(y_1-beta_1)**2+(y_2-beta_2)**2+la*(np.abs(beta_1)+np.abs(beta_2))# Plot the surface.surf=ax.plot_wireframe(beta_1,beta_2,rss_ridge,rstride=10,cstride=10)#cmap=cm.coolwarm, linewidth=0, antialiased=False)# Customize the z axis.ax.set_zlabel(r"$RSS^{L}(\beta)$")plt.title('Lasso RSS function; lambda = 10')plt.xlabel(r"$\beta_1$")plt.ylabel(r"$\beta_2$")plt.show()
Data: Monthly average temperatures for Boston, MA, from 1978 to 2019 (original dataset from NOAA was daily average temperature)
Lagged variables in Time Series Regression
Lag variables enable us to include feedback from previous time periods as variables in a regression model. For instance, in predicting an annual response, we might want to include observations not just from the year in question, but also from previous years. To do this, we can augment a dataset with time lags and explore if any correlations exist between the lag variables and the response in question.
Goal
Predict the monthly average temperature in Boston (response variable: TAVG).
#Plot the data setfig=plt.figure(figsize=(16,4))fig.suptitle('Monthly Average temperatures in Boston, MA from 1978 to 2019',fontsize=20)actual,=plt.plot(df.index,df['TAVG'],'go-',label='Monthly Average Temperature')plt.xlabel('Date',fontsize=18)plt.ylabel('Temperature',fontsize=18)plt.legend(handles=[actual],fontsize=18)#plt.savefig('images/avetemp.png',bbox_inches='tight')plt.show()
Now, we calculate the lag variables:
InĀ [11]:
#Make a copydf_lagged=df.copy()#Add time lagged columns to the data setforiinrange(1,13,1):df_lagged['TAVG_LAG_'+str(i)]=df_lagged['TAVG'].shift(i)
InĀ [12]:
#Drop the NaN rowsforiinrange(0,12,1):df_lagged=df_lagged.drop(df_lagged.index[0])print(df_lagged.head())
fig,axs=plt.subplots(3,4,figsize=(14,8))axs[0,0].scatter(df_lagged.TAVG_LAG_1,df_lagged.TAVG,alpha=.4)axs[0,0].set_title('TAVG vs TAVG_LAG_1')axs[0,1].scatter(df_lagged.TAVG_LAG_2,df_lagged.TAVG,alpha=.4)axs[0,1].set_title('TAVG vs TAVG_LAG_2')axs[0,2].scatter(df_lagged.TAVG_LAG_3,df_lagged.TAVG,alpha=.4)axs[0,2].set_title('TAVG vs TAVG_LAG_3')axs[0,3].scatter(df_lagged.TAVG_LAG_4,df_lagged.TAVG,alpha=.4)axs[0,3].set_title('TAVG vs TAVG_LAG_4')axs[1,0].scatter(df_lagged.TAVG_LAG_5,df_lagged.TAVG,alpha=.4)axs[1,0].set_title('TAVG vs TAVG_LAG_5')axs[1,1].scatter(df_lagged.TAVG_LAG_6,df_lagged.TAVG,alpha=.4)axs[1,1].set_title('TAVG vs TAVG_LAG_6')axs[1,2].scatter(df_lagged.TAVG_LAG_7,df_lagged.TAVG,alpha=.4)axs[1,2].set_title('TAVG vs TAVG_LAG_7')axs[1,3].scatter(df_lagged.TAVG_LAG_8,df_lagged.TAVG,alpha=.4)axs[1,3].set_title('TAVG vs TAVG_LAG_8')axs[2,0].scatter(df_lagged.TAVG_LAG_9,df_lagged.TAVG,alpha=.4)axs[2,0].set_title('TAVG vs TAVG_LAG_9')axs[2,1].scatter(df_lagged.TAVG_LAG_10,df_lagged.TAVG,alpha=.4)axs[2,1].set_title('TAVG vs TAVG_LAG_10')axs[2,2].scatter(df_lagged.TAVG_LAG_11,df_lagged.TAVG,alpha=.4)axs[2,2].set_title('TAVG vs TAVG_LAG_11')axs[2,3].scatter(df_lagged.TAVG_LAG_12,df_lagged.TAVG,alpha=.4)axs[2,3].set_title('TAVG vs TAVG_LAG_12')foraxinaxs.flat:ax.set(xlabel='Monthly ave. temp. (F)',ylabel='Monthly ave. temp. (F)')# Hide x labels and tick labels for top plots and y ticks for right plots.foraxinaxs.flat:ax.label_outer()#fig.savefig('images/lagscatter.png',bbox_inches='tight')
#Carve out the test and the training data setssplit_index=round(len(df_lagged)*0.8)split_date=df_lagged.index[split_index]df_train=df_lagged.loc[df_lagged.index<=split_date].copy()df_test=df_lagged.loc[df_lagged.index>split_date].copy()
where $t$ is the time in months and $y$ is the monthly average temperature. According to this model, the monthly average temperature is expected to increaseby 0.32F for each 1F increase in the temperature of the previous month and decrease by 0.22F for a unit increase in the temperature 6 months ago, and so on. Given the annual cycle of temperatures, the signs of the coefficients are reasonable.
We compute the RMSE on the test set:
InĀ [32]:
y_pred=fss_lm.predict(X_test_fss)
InĀ [33]:
np.sqrt(np.mean((y_test.to_list()-y_pred)**2))
Out[33]:
3.3322665747515114
This is 3.3F. Thus, the model error is approximately 3 degrees Fahrenheit off on average (quite good).
InĀ [34]:
#Plot the actual versus predicted values of TAVG on the test data setfig=plt.figure(figsize=(12,6))fig.suptitle('Model performance (forward subset selection)',fontsize=20)predicted,=plt.plot(X_test.index,y_pred,'go-',label='Predicted',alpha=.7)actual,=plt.plot(X_test.index,y_test,'ro-',label='Observed',alpha=.7)plt.legend(handles=[predicted,actual],fontsize=18)plt.xlabel('Year',fontsize=18)plt.ylabel('Monthly average temperature (F)',fontsize=18)plt.show()
InĀ [45]:
y_pred_train_fss=fss_lm.predict(X_train_fss)y_pred_fss=lm.predict(X_test_fss)mae_train=mean_absolute_error(y_train,y_pred_train_fss)string_score=f'MAE on training set: {mae_train:.2f} deg. F'mae_test=mean_absolute_error(y_test,y_pred_fss)string_score+=f'\nMAE on test set: {mae_test:.2f} deg. F'fig,ax=plt.subplots(figsize=(5,5))plt.scatter(y_test,y_pred_fss)ax.plot([0,1],[0,1],transform=ax.transAxes,ls="--",c="red")plt.text(25,70,string_score)plt.title('Forward-stepwise selected model performance')plt.ylabel('Predicted')plt.xlabel('Observed')plt.xlim([20,80])_=plt.ylim([20,80])
ridge_coefs.plot(kind='barh')plt.axvline(x=0,color='.5')plt.title('Ridge model coefficients (scaled)')
Out[43]:
Text(0.5, 1.0, 'Ridge model coefficients (scaled)')
InĀ [44]:
y_pred_train_ridge=ridge_model.predict(X_train_scaled)mae_train=mean_absolute_error(y_train,y_pred_train_ridge)string_score=f'MAE on training set: {mae_train:.2f} deg. F'mae_test=mean_absolute_error(y_test,y_pred_ridge)string_score+=f'\nMAE on test set: {mae_test:.2f} deg. F'fig,ax=plt.subplots(figsize=(5,5))plt.scatter(y_test,y_pred_ridge)ax.plot([0,1],[0,1],transform=ax.transAxes,ls="--",c="red")plt.text(25,70,string_score)plt.title('Ridge model performance')plt.ylabel('Predicted')plt.xlabel('Observed')plt.xlim([20,80])_=plt.ylim([20,80])
This is a reasonably good fit to the data. We can write the equation (based on the scaled data) as:
If we would like to interpret the coefficients directly (e.g. how much would the temperature change by a unit change in any of the monthly lags), we can convert them by dividing each coefficient by the respective standard deviation of the lag variable.
lasso_coefs=pd.DataFrame(lasso_model.coef_,columns=['Coefficients'],index=features)lasso_coefs=lasso_coefs.sort_values(by='Coefficients')lasso_coefs.plot(kind='barh')plt.axvline(x=0,color='.5')_=plt.title('Lasso model coefficients (scaled)')
InĀ [52]:
y_pred_train_lasso=lasso_model.predict(X_train_scaled)y_pred_lasso=lasso_model.predict(X_test_scaled)mae_train=mean_absolute_error(y_train,y_pred_train_lasso)string_score=f'MAE on training set: {mae_train:.2f} deg. F'mae_test=mean_absolute_error(y_test,y_pred_lasso)string_score+=f'\nMAE on test set: {mae_test:.2f} deg. F'fig,ax=plt.subplots(figsize=(5,5))plt.scatter(y_test,y_pred_ridge)ax.plot([0,1],[0,1],transform=ax.transAxes,ls="--",c="red")plt.text(25,70,string_score)plt.title('Lasso model performance')plt.ylabel('Predicted')plt.xlabel('Observed')plt.xlim([20,80])_=plt.ylim([20,80])
We see that in this case, the model estimated based on the variables determined by forward stepwise selection (FSS) performed the best. The Lasso and Ridge models could be parameterized for potentially better performance by using other criteria (e.g. AIC/BIC) to select the optimal regularization (alpha).
Play around with the test_size parameter in train_test_split. Given the small number of positive observations, a much smaller test split (e.g. 1%) will result in a much better fit to the data (albeit leaning toward an overfit).
At the decision boundary, the probability of default equals 0.5. Thus, line is defined as the logit link function equated to the log-odds, $\ln(\frac{p}{1-p})$. Thus:
sns.scatterplot(x=ddf.balance,y=ddf.income,hue=ddf.default,alpha=.9,marker='3')x=np.linspace(0,np.max(ddf.balance),100)y=-(1/logistic.coef_[0,1])*(logistic.intercept_[0]+logistic.coef_[0,0]*x)plt.plot(x,y,color='r',ls='--')plt.ylim(0,75000)plt.title('Logistic regression boundary at p = 0.5')
Out[70]:
Text(0.5, 1.0, 'Logistic regression boundary at p = 0.5')
The decision boundary indicates that this is a poorly-fitting model. We can clearly see that no "Yes" lies to the "right" side of the boundary (indicating zero recall).
y_pred=logistic.predict(X_test)print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logistic.score(X_test,y_test)))print(classification_report(y_test,y_pred))
Accuracy of logistic regression classifier on test set: 0.97
precision recall f1-score support
No 0.97 1.00 0.98 967
Yes 0.00 0.00 0.00 33
accuracy 0.97 1000
macro avg 0.48 0.50 0.49 1000
weighted avg 0.94 0.97 0.95 1000
/usr/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/usr/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/usr/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
The error rate is 0.03, while the precision and recall are both 0 (with respect to the positive class). This means the model fails to predict a single positive observation correctly in the test set.
To increase classifier sensitivity, I would lower the threshold of admitting to the positive class. We show the new decision boundary below. The model is still a poor fit, so the improvement is marginal (the sensitivity is merely increased to 0.03 from 0).
InĀ [86]:
sns.scatterplot(x=ddf.balance,y=ddf.income,hue=ddf.default,alpha=.7)x=np.linspace(0,np.max(ddf.balance),100)y=(1/logistic.coef_[0,1])*(np.log(0.3/0.7)-logistic.intercept_[0]-logistic.coef_[0,0]*x)plt.plot(x,y,color='r',ls='--')plt.title('Logistic regression boundary at p = 0.3')
Out[86]:
Text(0.5, 1.0, 'Logistic regression boundary at p = 0.3')
/usr/lib/python3.10/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_roc_curve is deprecated; Function :func:`plot_roc_curve` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: :meth:`sklearn.metric.RocCurveDisplay.from_predictions` or :meth:`sklearn.metric.RocCurveDisplay.from_estimator`.
warnings.warn(msg, category=FutureWarning)
Out[94]:
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x7fabe3344820>
If the default is independent of either balance or income, then we would expect the ROC curve to be a straight 45-degree line, as any prediction would be a random guess between 'Yes' and 'No'. This line is also known as the "line of no discrimination" (with an AUC of 0.50). A decent classifer should have a much higher value than this threshold.
lda=LinearDiscriminantAnalysis()lda.fit(X_train.iloc[:,:],y_train)#using only balance and income
Out[99]:
LinearDiscriminantAnalysis()
InĀ [102]:
y_pred_lda=lda.predict(X_test.iloc[:,:])print('Accuracy of LDA classifier on test set: {:.2f}'.format(logistic.score(X_test.iloc[:,:],y_test)))print(classification_report(y_test,y_pred_lda))con_mat_lda=confusion_matrix(y_test,y_pred_lda)plot_confusion_matrix(con_mat_lda,['No','Yes'])#, normalize = False)
Accuracy of LDA classifier on test set: 0.97
precision recall f1-score support
No 0.98 1.00 0.99 967
Yes 0.80 0.36 0.50 33
accuracy 0.98 1000
macro avg 0.89 0.68 0.74 1000
weighted avg 0.97 0.98 0.97 1000
Out[102]:
No
Yes
No
964
3
Yes
21
12
We see that the basic LDA classifier has a precision of 0.83 (a significant improvement over logistic) but still relatively low recall of 0.30. We plot the boundary (a bit involved):
InĀ [154]:
N=300X=np.linspace(0,2500,N)Y=np.linspace(0,70000,N)X,Y=np.meshgrid(X,Y)#Compute the predicted class function for each value on the gridzz=np.array([lda.predict(pd.DataFrame(np.array([xx,yy]).reshape(1,-1),columns=['balance','income']))forxx,yyinzip(np.ravel(X),np.ravel(Y))])#zz = np.array( [lda.predict(np.array([xx,yy]).reshape(1,-1)) for xx, yy in zip(np.ravel(X), np.ravel(Y)) ] )
InĀ [155]:
zz2=np.unique(zz,return_inverse=True)[1]# convert zz to numeric labels
InĀ [156]:
g=sns.FacetGrid(ddf,hue="default",height=5)#, palette = 'colorblind')g.map(plt.scatter,"balance","income",marker='3',alpha=.7)g.add_legend()my_ax=g.axZ=zz2.reshape(X.shape)my_ax.contour(X,Y,Z,alpha=1,colors='black',linestyles='dashed')#Plot the filled and boundary contoursmy_ax.set_xlabel('Balance')my_ax.set_ylabel('Income')my_ax.set_title('LDA and boundary')plt.tight_layout()plt.savefig('lda.png',dpi=200)
The boundary indicates that the LDA provides a better-fitting model.
To implement CV efficiently across these models, we use the nifty cross_val_score function:
InĀ [139]:
knn=KNeighborsClassifier(n_neighbors=5)yes_recall=make_scorer(recall_score,pos_label='Yes')# we have use make_scorer in order to modify the default positive label (which is 1)knn_scores=cross_validate(knn,X_train.iloc[:,:],y_train,cv=5,scoring=yes_recall)
Based on the cross-validation results, QDA results in the highest sensitivity (recall). Note that we have done a three-way partitioning (train/validate/test). We have reserved an untouched test set for real-world performance of the selected model.