Linear Regression
This notebook covers linear regression fundamentals and implementation.
Notebook Contents
This notebook covers linear regression fundamentals:
- Simple Linear Regression: Basic concepts and implementation
- Multiple Linear Regression: Working with multiple features
- Assumptions: Understanding linear regression assumptions
- Model Evaluation: R-squared, residual analysis
- Implementation: From scratch and using scikit-learn
Use the buttons above to download the notebook or open it in your preferred environment.
š Notebook Preview
InĀ [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import sklearn as sk
import sklearn.linear_model as slm
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split ## Higher Version 0.19.1
#from sklearn.cross_validation import train_test_split ## Higher Version 0.16.1
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
#import os
#print(os.listdir("../input"))
# Any results you write to the current directory are saved as output.
InĀ [13]:
#data = pd.read_csv('housing.csv')
data = pd.read_csv('housing.csv')
InĀ [14]:
data.head()
Out[14]:
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
| 1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
| 2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
| 3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
| 4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
InĀ [15]:
df = data.drop('ocean_proximity', axis = 1)
InĀ [16]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20640 entries, 0 to 20639 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 longitude 20640 non-null float64 1 latitude 20640 non-null float64 2 housing_median_age 20640 non-null float64 3 total_rooms 20640 non-null float64 4 total_bedrooms 20433 non-null float64 5 population 20640 non-null float64 6 households 20640 non-null float64 7 median_income 20640 non-null float64 8 median_house_value 20640 non-null float64 dtypes: float64(9) memory usage: 1.4 MB
InĀ [17]:
df = df.dropna(axis = 0) # remove missing observations
df.info() #check
<class 'pandas.core.frame.DataFrame'> Int64Index: 20433 entries, 0 to 20639 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 longitude 20433 non-null float64 1 latitude 20433 non-null float64 2 housing_median_age 20433 non-null float64 3 total_rooms 20433 non-null float64 4 total_bedrooms 20433 non-null float64 5 population 20433 non-null float64 6 households 20433 non-null float64 7 median_income 20433 non-null float64 8 median_house_value 20433 non-null float64 dtypes: float64(9) memory usage: 1.6 MB
InĀ [18]:
X = df.drop('median_house_value', axis = 1)
y = df['median_house_value']
InĀ [19]:
X.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 20433 entries, 0 to 20639 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 longitude 20433 non-null float64 1 latitude 20433 non-null float64 2 housing_median_age 20433 non-null float64 3 total_rooms 20433 non-null float64 4 total_bedrooms 20433 non-null float64 5 population 20433 non-null float64 6 households 20433 non-null float64 7 median_income 20433 non-null float64 dtypes: float64(8) memory usage: 1.4 MB
InĀ [20]:
print(y.shape)
(20433,)
InĀ [21]:
plt.scatter(X['total_bedrooms'], y)
plt.xlabel('total_bedrooms')
plt.ylabel('House Price')
Out[21]:
Text(0, 0.5, 'House Price')
InĀ [14]:
X.head()
InĀ [22]:
color = dict(boxes='DarkGreen', whiskers='DarkOrange',medians='DarkBlue', caps='Gray')
X['total_bedrooms'].plot.box(color=color)
Out[22]:
<AxesSubplot:>
InĀ [23]:
LR = slm.LinearRegression()
X_train,X_test,Y_train,Y_test = train_test_split(X,y,test_size=0.25)
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)
(15324, 8) (15324,) (5109, 8) (5109,)
InĀ [24]:
LR.fit(X_train, Y_train)
Out[24]:
LinearRegression()
InĀ [25]:
predict = LR.predict(X_test)
InĀ [37]:
LR.coef_ESL 3.2; ISLR 3; PML 8.1
Out[37]:
array([-4.32129322e+04, -4.28463981e+04, 1.16190991e+03, -7.58836039e+00,
1.07263411e+02, -3.70135744e+01, 4.75593132e+01, 3.99922627e+04])
InĀ [26]:
print('Predicted Value :',predict[0])
print('Actual Value :',Y_test.values[0])
Predicted Value : 413499.99734686874 Actual Value : 430200.0
InĀ [32]:
LR.score(X_test, Y_test) #R^2
Out[32]:
0.6351168854932498
InĀ [30]:
gr = pd.DataFrame({'Predicted':predict,'Actual':Y_test})
gr = gr.reset_index()
gr = gr.drop(['index'],axis=1)
plt.scatter(gr.Predicted, gr.Actual)
#plt.legend(['Actual','Predicted'])
#gr.plot.bar();
Out[30]:
<matplotlib.collections.PathCollection at 0x7efe8c85d0a0>
InĀ [31]:
#For R-style diagnostic plots, see:
#https://robert-alvarez.github.io/2018-06-04-diagnostic_plots/