Linear Regression
This project will work on how to predict the prices of homes based on the properties of the house. I will determine which house affected the final sale price and how effectively we can predict the sale price.Here's a brief description of the columns in the data:
Learning Objectives
By the end of this notebook, the reader should be able to perform Linear Regression techniques in python. This includes:
- Importing and formating data
- Training the LinearRegression model from the
sklearn.linear_model
library - Work with qualitative and quantitative data, and effectively deal with instances of categorical data.
- Analyze and determine proper handling of redundant and/or inconsistent data features.
- Create a heatmap visual with
matplot.lib
library
Read Data
The pandas
library is an open source data analytics tool for python that allows the use of 'data frame' objects and clean file parsing.
Here we split the Ames housing data into training and testing data. The dataset contains 82 columns which are known as features of the data. Here are a few:
- Lot Area: Lot size in square feet.
- Overall Qual: Rates the overall material and finish of the house.
- Overall Cond: Rates the overall condition of the house.
- Year Built: Original construction date.
- Low Qual Fin SF: Low quality finished square feet (all floors).
- Full Bath: Full bathrooms above grade.
- Fireplaces: Number of fireplaces.
and so on.
import pandas as pd
data = pd.read_csv("Data/AmesHousing.txt", delimiter = '\t')
train = data[0:1460]
test = data[1460:]
target = 'SalePrice'
print(train.info())
The train data over here will be used to create the linear regression model, while the test data will be used to figure out the accuracy of the linear regression model.
Use linear regression to model the data
In this case, we will use the simple linear regression to evaluate the relationship between 2 variable–living area(“Gr Liv Area”) and price(“SalePrice”). linearRegression.fit()
is a pretty convinent function that it can turn the input data into the linear function and you dont need to worry about the calculation. We can also use the mean_squared_error
to get the total cariance of the linear function.
import numpy as np
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train[['Gr Liv Area']], train['SalePrice'])
from sklearn.metrics import mean_squared_error
train_predictions = lr.predict(train[['Gr Liv Area']])
test_predictions = lr.predict(test[['Gr Liv Area']])
train_mse = mean_squared_error(train_predictions, train['SalePrice'])
test_mse = mean_squared_error(test_predictions, test['SalePrice'])
train_rmse = np.sqrt(train_mse)
test_rmse = np.sqrt(test_mse)
print(lr.coef_)
print(train_rmse)
print(test_rmse)
In this case, we use the lr.coef_()
to get the coefficient of the linear function which is 116.87. More than that, the standard error for the train data is 56034 and test data is 57088. Now, let’s make the result more visible by plotting.
Following is the linear regression line made from data in “train”.
import matplotlib.pyplot as plt
plt.scatter(train[['Gr Liv Area']], train[['SalePrice']], color='black')
plt.xlabel('Gr Liv Area in Train', fontsize = '18')
plt.ylabel('train_predictions' ,fontsize = '18')
trainPlot =plt.plot(train[['Gr Liv Area']], train_predictions, color='blue', linewidth=3)
trainPlot
And now lets put the model into test data set to see if it can predict the value precisely.
import matplotlib.pyplot as plt
plt.scatter(test[['Gr Liv Area']], test[['SalePrice']], color='black')
plt.xlabel('Gr Liv Area in Test', fontsize = '18')
plt.ylabel('test_predictions' ,fontsize = '18')
testPlot = plt.plot(test[['Gr Liv Area']], test_predictions, color='blue', linewidth=3)
testPlot
Since all the data looks like concentrate on the linear regression model, we should conclude that the model can predict the “Sale Price” .
Use Multiple Regression to model the data
In the real world, Multiple Regression is a more useful technique since we need to evaluate more than one correlation in most cases. Now, we will still predict the SalePrice, but with one more variable – Overall Condition (Overall Cond). In this case the model will be a Binary Linear Equation in the form of : $$ Y = a0 + coef{Cond} * (Overall Cond) + coef_{Area} * (Gr Liv Area) $$
$a0$ stands for the intial value while both “Overall Cond” and “Gr Liv Area” is zero
$coef{Cond}$ stands for the coefficient of Overall Cond
$coef_{Area}$ stands for the coefficient of Gr Liv Area
from sklearn.metrics import mean_squared_error
cols = ['Overall Cond', 'Gr Liv Area']
lr.fit(train[cols], train['SalePrice'])
train_predictions = lr.predict(train[cols])
test_predictions = lr.predict(test[cols])
train_rmse_2 = np.sqrt(mean_squared_error(train_predictions, train['SalePrice']))
test_rmse_2 = np.sqrt(mean_squared_error(test_predictions, test['SalePrice']))
print(lr.coef)
print(lr.intercept)
print(train_rmse_2)
print(test_rmse_2)
such that the linear model will be like: $$ Y = 7858.7 - 409.6 * (Overall Cond) + 116.7 * (Gr Liv Area) $$
However, it’s hard to make a geometric explanation since the model will be either surface or high-dimension which cant be plotted.
Handling data types with missing values/non-numeric values
In the machine learning workflow, once we’ve selected the model we want to use, selecting the appropriate features for that model is the next important step. In the following code snippets, I will explore how to use correlation between features and the target column, correlation between features, and variance of features to select features.
I will specifically focus on selecting from feature columns that don’t have any missing values or don’t need to be transformed to be useful (e.g. columns like Year Built and Year Remod/Add).
numerical_train = train.select_dtypes(include=['int64', 'float'])
numerical_train = numerical_train.drop(['PID', 'Year Built', 'Year Remod/Add', 'Garage Yr Blt', 'Mo Sold', 'Yr Sold'], axis=1)
null_series = numerical_train.isnull().sum()
full_cols_series = null_series[null_series == 0]
print(full_cols_series)
train_subset = train[full_cols_series.index]
corrmat = train_subset.corr()
sorted_corrs = corrmat['SalePrice'].abs().sort_values()
print(sorted_corrs)
Correlation Matrix Heatmap
We now have a decent list of candidate features to use in our model, sorted by how strongly they’re correlated with the SalePrice column. For now, I will keep only the features that have a correlation of 0.3 or higher. This cutoff is a bit arbitrary and, in general, it’s a good idea to experiment with this cutoff. For example, you can train and test models using the columns selected using different cutoffs and see where your model stops improving.
The next thing we need to look for is for potential collinearity between some of these feature columns. Collinearity is when 2 feature columns are highly correlated and stand the risk of duplicating information. If we have 2 features that convey the same information using 2 different measures or metrics, we need to choose just one or predictive accuracy can suffer.
While we can check for collinearity between 2 columns using the correlation matrix, we run the risk of information overload. We can instead generate a correlation matrix heatmap using Seaborn to visually compare the correlations and look for problematic pairwise feature correlations. Because we’re looking for outlier values in the heatmap, this visual representation is easier.
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
strong_corrs = sorted_corrs[sorted_corrs > 0.3]
corrmat = train_subset[strong_corrs.index].corr()
ax = sns.heatmap(corrmat)
ax
final_corr_cols = strong_corrs.drop(['Garage Cars', 'TotRms AbvGrd'])
features = final_corr_cols.drop(['SalePrice']).index
target = 'SalePrice'
clean_test = test[final_corr_cols.index].dropna()
lr = LinearRegression()
lr.fit(train[features], train['SalePrice'])
train_predictions = lr.predict(train[features])
test_predictions = lr.predict(clean_test[features])
train_mse = mean_squared_error(train_predictions, train[target])
test_mse = mean_squared_error(test_predictions, clean_test[target])
train_rmse = np.sqrt(train_mse)
test_rmse = np.sqrt(test_mse)
print(train_rmse)
print(test_rmse)
Removing low variance features
The last technique I will explore is removing features with low variance. When the values in a feature column have low variance, they don’t meaningfully contribute to the model’s predictive capability. On the extreme end, let’s imagine a column with a variance of 0. This would mean that all of the values in that column were exactly the same. This means that the column isn’t informative and isn’t going to help the model make better predictions.
To make apples to apples comparisions between columns, we need to standardize all of the columns to vary between 0 and 1. Then, we can set a cutoff value for variance and remove features that have less than that variance amount.
unit_train = train[features]/(train[features].max())
sorted_vars = unit_train.var().sort_values()
print(sorted_vars)
features = features.drop(['Open Porch SF'])
clean_test = test[final_corr_cols.index].dropna()
lr = LinearRegression()
lr.fit(train[features], train['SalePrice'])
train_predictions = lr.predict(train[features])
test_predictions = lr.predict(clean_test[features])
train_mse = mean_squared_error(train_predictions, train[target])
test_mse = mean_squared_error(test_predictions, clean_test[target])
train_rmse_2 = np.sqrt(train_mse)
test_rmse_2 = np.sqrt(testmse)
print(lr.intercept)
print(lr.coef_)
print(train_rmse_2)
print(test_rmse_2)
The final model will be a 7-dimension linear function which looks like: $$ Y = -112765 + 37.9 * Wood Deck + 7087 * Fire Places - 2222 * Full Bath + 43 * 1st Fle SF + 65 * garage Area + 39 * Liv area + 24553 * Overall Qual $$
Feature transformation
To understand how linear regression works, I have stuck to using features from the training dataset that contained no missing values and were already in a convenient numeric representation. In this mission, we’ll explore how to transform some of the the remaining features so we can use them in our model. Broadly, the process of processing and creating new features is known as feature engineering.
train = data[0:1460]
test = data[1460:]
train_null_counts = train.isnull().sum()
df_no_mv = train[train_null_counts[train_null_counts==0].index]
text_cols = df_no_mv.select_dtypes(include=['object']).columns
for col in text_cols:
print(col+":", len(train[col].unique()))
for col in text_cols:
train[col] = train[col].astype('category')
train['Utilities'].cat.codes.value_counts()
Dummy Coding
When we convert a column to the categorical data type, pandas assigns a number from 0 to n-1 (where n is the number of unique values in a column) for each value. The drawback with this approach is that one of the assumptions of linear regression is violated here. Linear regression operates under the assumption that the features are linearly correlated with the target column. For a categorical feature, however, there’s no actual numerical meaning to the categorical codes that pandas assigned for that colum. An increase in the Utilities column from 1 to 2 has no correlation value with the target column, and the categorical codes are instead used for uniqueness and exclusivity (the category associated with 0 is different than the one associated with 1).
The common solution is to use a technique called dummy coding
dummy_cols = pd.DataFrame()
for col in text_cols:
col_dummies = pd.get_dummies(train[col])
train = pd.concat([train, col_dummies], axis=1)
del train[col]
train['years_until_remod'] = train['Year Remod/Add'] - train['Year Built']
Missing Values
Now I will focus on handling columns with missing values. When values are missing in a column, there are two main approaches we can take:
- Remove rows containing missing values for specific columns Pro: Rows containing missing values are removed, leaving only clean data for modeling Con: Entire observations from the training set are removed, which can reduce overall prediction accuracy
- Impute (or replace) missing values using a descriptive statistic from the column Pro: Missing values are replaced with potentially similar estimates, preserving the rest of the observation in the model. Con: Depending on the approach, we may be adding noisy data for the model to learn
Given that we only have 1460 training examples (with ~80 potentially useful features), we don’t want to remove any of these rows from the dataset. Let’s instead focus on imputation techniques.
We’ll focus on columns that contain at least 1 missing value but less than 365 missing values (or 25% of the number of rows in the training set). There’s no strict threshold, and many people instead use a 50% cutoff (if half the values in a column are missing, it’s automatically dropped). Having some domain knowledge can help with determining an acceptable cutoff value.
df_missing_values = train[train_null_counts[(train_null_counts>0) & (train_null_counts<584)].index]
print(df_missing_values.isnull().sum())
print(df_missing_values.dtypes)
Inputing missing values
It looks like about half of the columns in df_missing_values are string columns (object data type), while about half are float64 columns. For numerical columns with missing values, a common strategy is to compute the mean, median, or mode of each column and replace all missing values in that column with that value
float_cols = df_missing_values.select_dtypes(include=['float'])
float_cols = float_cols.fillna(df_missing_values.mean())
print(float_cols.isnull().sum())
Conclusion
This note book talks about how to do linear regression in machine learning by analysing the real example – Boston housing data. In this case, to do the linear regression not only means we need to figure out the correlation among all the variable, but also eliminate the variable with either insignificant influence or missing value.