Chapter 2 : End-to-End Machine Learning Project

Notes

The Process

Big Picture (Problem Statement)
Get Data
Exploratory Data Analysis
Data Preparation
Model selection and training
Fine-tune the model
Production
Monitor and Maintain

Frame the Problem

The Task

Build a model to predict housing prices in California given California census data. Specifically, predict the median housing price in any district, given all other metrics.

Additional Considerations

Determine how exactly the result of your model is going to be used
- In this instance it will be fed into another machine learning model downstream
- Current process is a manual one which is costly and time consuming
- Typical error rate of the experts is 15%
Questions to ask yourself:
- Is it supervised, unsupervised, or Reinforcement Learning? Supervised (because we have labels of existing median housing price
- Is it a classification, regression or something else? It’s a regression task, we’re predicting a number
- Should you use batch learning or online learning? Depends on the volume of data, but probably batch learning.

RMSE (Root Mean Squared Error)

Measures the standard deviation of the errors the system makes in its predictions. Recall the standard deviation is:

\[\sigma = \sqrt{\frac{\sum_{i}{(\bar{X} - X_i)^2}}{N}}\]

Analogously RMSE is:

\[RMSE = \sqrt{\frac{\sum_{i}{(y_i - f(X_i))^2}}{N}}\]

where $f$ is our model. There is also Mean Absolute Error (MAE). RMSE is more sensitive to outliers than MAE because for large outliers (i.e. differences) RMSE will make them larger by squaring them.

Get the Data

import tarfile
import tempfile
import urllib.request
import os
import pandas as pd

housing_url = (
    "https://raw.githubusercontent.com/ageron/"
    + "handson-ml/master/datasets/housing/housing.tgz"
)
FIGSIZE = (16, 12)


def read_tar(url):
    r = urllib.request.urlopen(url)
    with tempfile.TemporaryDirectory() as d:
        with tarfile.open(fileobj=r, mode="r:gz") as tf:
            tf.extractall(path=d)
            name = tf.getnames()[0]
        df = pd.read_csv(os.path.join(d, name))
    return df


df = read_tar(housing_url)
df

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY
...	...	...	...	...	...	...	...	...	...	...
20635	-121.09	39.48	25.0	1665.0	374.0	845.0	330.0	1.5603	78100.0	INLAND
20636	-121.21	39.49	18.0	697.0	150.0	356.0	114.0	2.5568	77100.0	INLAND
20637	-121.22	39.43	17.0	2254.0	485.0	1007.0	433.0	1.7000	92300.0	INLAND
20638	-121.32	39.43	18.0	1860.0	409.0	741.0	349.0	1.8672	84700.0	INLAND
20639	-121.24	39.37	16.0	2785.0	616.0	1387.0	530.0	2.3886	89400.0	INLAND

20640 rows × 10 columns

# Show histogram of the features
%matplotlib inline
import matplotlib.pyplot as plt

df.hist(bins=100, figsize=(16, 12))

png

df.describe()

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
count	20640.000000	20640.000000	20640.000000	20640.000000	20433.000000	20640.000000	20640.000000	20640.000000	20640.000000
mean	-119.569704	35.631861	28.639486	2635.763081	537.870553	1425.476744	499.539680	3.870671	206855.816909
std	2.003532	2.135952	12.585558	2181.615252	421.385070	1132.462122	382.329753	1.899822	115395.615874
min	-124.350000	32.540000	1.000000	2.000000	1.000000	3.000000	1.000000	0.499900	14999.000000
25%	-121.800000	33.930000	18.000000	1447.750000	296.000000	787.000000	280.000000	2.563400	119600.000000
50%	-118.490000	34.260000	29.000000	2127.000000	435.000000	1166.000000	409.000000	3.534800	179700.000000
75%	-118.010000	37.710000	37.000000	3148.000000	647.000000	1725.000000	605.000000	4.743250	264725.000000
max	-114.310000	41.950000	52.000000	39320.000000	6445.000000	35682.000000	6082.000000	15.000100	500001.000000

Things to note

Several columns seem to be capped out a max value (e.g. housing_median_age)
median_income isn’t in dollars
Many distributions are right-skewed (hump on left, called tail-heavy)

Things mentioned by Geron

median_income isn’t in dollars
housing_median_age and median_housing_value are capped
- The latter might be problematic because it is our target variable. To remedy he suggests:
  - Collect correct labels for those
  - Remove those districts from the training/test set
Different scales
Tail heavy

Create a Test Set

Most of the time you’re going to be fine with randomly sampling/splitting into train/test
Geron suggests stratified sampling (5 splits) based on median income
We’ll try both with a 20% test size

from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)

Let’s see how similar these are:

ax = train_set.hist(bins=5, figsize=(16, 12), density=True, alpha=0.8)
test_set.hist(bins=5, figsize=(16, 12), ax=ax, density=True, alpha=0.8)

png

That looks pretty good to me. But we’ll also do the stratified method:

# Sample to same count in each bin (5 bins)
# We can come back and try this at the end to see how if the performance improves?
import numpy as np
import pandas as pd

strat_values = df["median_income"]
bins = 5
x = np.linspace(0, len(strat_values), bins + 1)
xp = np.arange(len(strat_values))
fp = np.sort(strat_values)
bin_ends = np.interp(x, xp, fp)
# Make sure we include the bin ends and end up with 5 bins in the end
bin_ends[0] -= 0.001
bin_ends[-1] += 0.001
strat = np.digitize(strat_values, bins=bin_ends, right=True)
print(bin_ends)
print(pd.value_counts(strat))
df["income_cat"] = strat
strat_train_set, strat_test_set = train_test_split(
    df, test_size=0.2, random_state=42, stratify=strat
)
ax = strat_train_set.hist(bins=5, figsize=(16, 16), density=True, alpha=0.8)
strat_test_set.hist(
    bins=5, figsize=(16, 16), ax=ax.flatten()[:-2], density=True, alpha=0.8
)

[ 0.4989  2.3523  3.1406  3.9673  5.1098 15.0011]
  4131
  4130
  4128
  4127
  4124
dtype: int64

png

strat_values = df["median_income"]
bins = 5
strat = np.ceil(strat_values / 1.5)
strat = strat.where(strat < 5, 5.0)
df["income_cat"] = strat
print(pd.value_counts(strat) / len(strat))
strat_train_set, strat_test_set = train_test_split(
    df, test_size=0.2, random_state=42, stratify=strat
)
ax = strat_train_set.hist(bins=5, figsize=(16, 16), density=True, alpha=0.8)
strat_test_set.hist(
    bins=5, figsize=(16, 16), ax=ax.flatten()[:-2], density=True, alpha=0.8
)

0    0.350581
0    0.318847
0    0.176308
0    0.114438
0    0.039826
Name: median_income, dtype: float64

png

I feel like this doesn’t matter at all…

# Drop the income_cat column
cols = [i for i in df.columns if i != "income_cat"]
df = df.loc[:, cols]
strat_train_set = strat_train_set.loc[:, cols]
strat_test_set = strat_test_set.loc[:, cols]
# Only work with train set from here on out
df = strat_train_set.copy()

Visualize the Data to Gain Insights

Visualize geographically based on target variable
Correlations
Combining features

import seaborn as sns

plt.figure(figsize=(12, 12))
sns.scatterplot(
    x="longitude",
    y="latitude",
    data=df,
    s=df["population"] / 50,
    hue=df["median_house_value"],
    alpha=0.3,
    palette="seismic",
)
plt.title("Geographical Population/House Value Plot")

png

# Correlations
corr = df.corr()
corr["median_house_value"].sort_values(ascending=False)
# Scatter
from pandas.plotting import scatter_matrix

scatter_matrix(
    df[["median_house_value", "median_income", "total_rooms", "housing_median_age"]],
    figsize=(16, 12),
)

png

# Combining features
df["population_per_household"] = df["population"] / df["households"]
df["bedrooms_per_room"] = df["total_bedrooms"] / df["total_rooms"]
df["rooms_per_household"] = df["total_rooms"] / df["households"]
df.corr()["median_house_value"].sort_values(ascending=False)
# Not sure why rooms_per_household was 0.05 less than Geron...

median_house_value          1.000000
median_income               0.687160
rooms_per_household         0.146285
total_rooms                 0.135097
housing_median_age          0.114110
households                  0.064506
total_bedrooms              0.047689
population_per_household   -0.021985
population                 -0.026920
longitude                  -0.047432
latitude                   -0.142724
bedrooms_per_room          -0.259984
Name: median_house_value, dtype: float64

Prepare the Data for Machine Learning Algorithms

Data Cleaning
- Handle missing data in total_bedrooms
  - Option 1: Remove column entirely (kind of a lousy option considering only a few districts are missing and we just created a combo feature based on it)
  - Option 2: Remove those districts (have to remove from test set as well)…
  - Option 3: Fill value (mean, median, etc.)
- We’ll just go with option 3, using the median as Geron does
- He makes a good point, that we should use the imputer on all numerical variables because for future data there might be missing data in other columns

target = "median_house_value"
x = df[[col for col in df.columns if col != target]].copy()
y = df[[target]].copy()

# Impute the median
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")
imputer.fit(x.drop(columns="ocean_proximity"))
print(list(imputer.statistics_.round(2)))
x_num = imputer.transform(x.drop(columns="ocean_proximity"))
print(x_num)

[-118.51, 34.26, 29.0, 2119.5, 433.0, 1164.0, 408.0, 3.54, 2.82, 0.2, 5.23]
[[-121.89         37.29         38.         ...    2.09439528
     0.22385204    4.62536873]
 [-121.93         37.05         14.         ...    2.7079646
     0.15905744    6.00884956]
 [-117.2          32.77         31.         ...    2.02597403
     0.24129098    4.22510823]
 ...
 [-116.4          34.09          9.         ...    2.74248366
     0.17960865    6.34640523]
 [-118.01         33.82         31.         ...    3.80898876
     0.19387755    5.50561798]
 [-122.45         37.77         52.         ...    1.98591549
     0.22035541    4.84350548]]

Scikit-Learn Design

Consistency
- All objects share a consistent and simple interface:
  - Esimators:
    - Any object can estimate some parameters based on a dataset
    - Estimation is performed by calling fit
    - Hyperparameters are set at instantiation
  - Transformers:
    - Estimators which can transform a dataset
    - Transformation is performed by calling transform with the dataset as the arg
    - It returns the transformed dataset
    - Some transformers have an optimized fit_transform method which runs both steps
  - Predictors:
    - Estimators which can make predictions on a dataset
    - Prediction is performed by calling predict with the new dataset as the arg
    - They also have a score used to evaluate the quality of predictions
- Inspection:
  - Hyperparameters of estimators are available in public instance variables
  - Learn parameters of estimators are available, and their variables end with an underscore
- Nonproliferation of classes:
  - Datasets are numpy or scipy arrays or sparse matrices
  - Hyperparameters are python datatypes
- Composition:
  - Existing building blocks are resused as much as possible
- Sensible defaults:
  - Estimators have sensible defaults for their hyperparameters

Handling Text and Categorical Attributes

Machine learning algorithms need to work with numbers so we encode textual data as numerical input
A label encoder will map labels into integers
- But, most ML algorithms will assume that numbers closer together are more similar
Therefore we can use a one hot encoding to create binary labels for each category
LabelBinarizer is the combination of these steps

from sklearn.preprocessing import LabelBinarizer

encoder = LabelBinarizer()
x_cat = encoder.fit_transform(x[["ocean_proximity"]])
x_cat
# OneHotEncoder functionality has improved so we use that later on in favor of LabelBinarizer

array([[1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1],
       ...,
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0]])

I’m going to avoid the custom transformer for now.

Feature Scaling

With few exceptions, ML algorithms do not perform well when the input numerical attributes have very different scales
Min-max scaling or Normalization: values are rescaled to 0 - 1
- Use MinMaxScaler
Standardization: Zero mean and unit variance
- Much less affected by outliers
- Use StandardScaler
Only feature scale on training set

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

cat_cols = ["ocean_proximity"]
num_cols = [col for col in x.columns if col not in cat_cols]

num_pipeline = Pipeline(
    [
        ("imputer", SimpleImputer(strategy="median")),
        ("std_scaler", StandardScaler()),
    ]
)

pipeline = ColumnTransformer(
    [
        ("num", num_pipeline, num_cols),
        ("cat", OneHotEncoder(), cat_cols),
    ]
)
x_final = pipeline.fit_transform(x)
print(x_final)
print(x_final.shape)
# In the copy of the book I have the shape is (16513, 17),
# but in the updated version
# online: https://github.com/ageron/handson-ml/blob
# /master/02_end_to_end_machine_learning_project.ipynb
# it is (16512, 16)

[[-1.15604281  0.77194962  0.74333089 ...  0.          0.
   0.        ]
 [-1.17602483  0.6596948  -1.1653172  ...  0.          0.
   0.        ]
 [ 1.18684903 -1.34218285  0.18664186 ...  0.          0.
   1.        ]
 ...
 [ 1.58648943 -0.72478134 -1.56295222 ...  0.          0.
   0.        ]
 [ 0.78221312 -0.85106801  0.18664186 ...  0.          0.
   0.        ]
 [-1.43579109  0.99645926  1.85670895 ...  0.          1.
   0.        ]]
(16512, 16)

A Note

It’s good to reference the notebooks here because Geron updated them with new ideas and changes that have been made in new scikit-learn versions! Example of this above is ColumnTransformer and the change in behavior of OneHotEncoder.

Select and Train a Model

At last! You framed the problem, you got the data and explored it, you sampled a training set and a test set, and you wrote transformation pipelines to clean up and prepare your data for Machine Learning algorithms automatically. You are now ready to select and train a Machine Learning model.

That was straight from Geron, the excitement is palpable :)

Training and Evaluating on the Training Set

from sklearn.linear_model import LinearRegression

y_final = y.copy().values

lin_reg = LinearRegression()
lin_reg.fit(x_final, y_final)

# Some predictions
x_5 = x_final[:5, :]
y_5 = y_final[:5]
print(list(lin_reg.predict(x_5)[:, 0].round(2)))
print(list(y_5[:, 0]))

[209375.74, 315154.78, 210238.28, 55902.62, 183416.69]
[286600.0, 340600.0, 196900.0, 46300.0, 254500.0]

from sklearn.metrics import mean_squared_error

lin_preds = lin_reg.predict(x_final)
lin_mse = mean_squared_error(y_final, lin_preds)
lin_rmse = np.sqrt(lin_mse)
lin_rmse  # Better than Geron :)

68161.22644433199

# Let's plot this
plt.figure(figsize=FIGSIZE)
plt.scatter(lin_preds[:, 0], y_final[:, 0])
plt.plot(np.arange(max(y_final[:, 0])), np.arange(max(y_final[:, 0])), c="r", lw=4)
plt.axis("equal")
plt.xlabel("Predictions")
plt.ylabel("Labels")
plt.title("Median House Price: Predictions vs. Labels")

Text(0.5, 1.0, 'Median House Price: Predictions vs. Labels')

png

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(x_final, y_final)
tree_preds = tree_reg.predict(x_final)
tree_mse = mean_squared_error(y_final, tree_preds)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

0.0

Underfitting and Overfitting

Clearly the LinearRegression model is underfitting
The DecisionTreeRegressor model is overfitting

K-fold Cross Validation

Split the training data k folds
Iterate it k times using k-1 folds for training and one for the test set
Evaluate on the test_set k times

from sklearn.model_selection import cross_val_score


def cross_val_model(m, x_m, y_m, cv=10):
    scores = cross_val_score(m, x_m, y_m, scoring="neg_mean_squared_error", cv=cv)
    rmse_scores = np.sqrt(-scores)
    print(rmse_scores, np.mean(rmse_scores), np.std(rmse_scores))


cross_val_model(tree_reg, x_final, y_final)

[70968.72056379 67216.68718226 70857.65880397 69194.86108641
 69756.29757786 74386.85421573 69949.56290335 69745.34537599
 75022.85006194 70755.92128417] 70785.47590554852 2215.1990085744

cross_val_model(lin_reg, x_final, y_final)

[66060.65470195 66764.30726969 67721.72734022 74719.28193624
 68058.11572078 70909.35812986 64171.66459204 68075.65317717
 71024.84033989 67300.24394751] 68480.58471553595 2845.5843092650853

from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()
forest_reg.fit(x_final, y_final[:, 0])
forest_preds = forest_reg.predict(x_final)
forest_mse = mean_squared_error(forest_preds, y_final)
forest_rmse = np.sqrt(forest_mse)
print(forest_rmse)
cross_val_model(forest_reg, x_final, y_final[:, 0])

372911866638
[49635.15372436 47754.83871792 49368.25902706 51887.71850715
11331684 53513.21033152 49044.38099493 47851.45135021
51927089 50181.50476447] 50151.91500053619 1824.9254115323

plt.figure(figsize=FIGSIZE)
plt.scatter(forest_preds, y_final[:, 0])
plt.plot(np.arange(max(y_final[:, 0])), np.arange(max(y_final[:, 0])), c="r", lw=4)
plt.axis("equal")
plt.xlabel("Predictions")
plt.ylabel("Labels")
plt.title("Median House Price (forest_reg): Predictions vs. Labels")

Text(0.5, 1.0, 'Median House Price (forest_reg): Predictions vs. Labels')

png

Fine-Tune Your Model

Hyper parameter tuning via GridSearchCV

from sklearn.model_selection import GridSearchCV

param_grid = [
    {"n_estimators": [3, 10, 30], "max_features": [2, 4, 6, 8]},
    {"bootstrap": [False], "n_estimators": [3, 10], "max_features": [2, 3, 4]},
]

forest_reg = RandomForestRegressor()

grid_search = GridSearchCV(
    forest_reg,
    param_grid,
    cv=5,
    scoring="neg_mean_squared_error",
    return_train_score=True,
)
grid_search.fit(x_final, y_final[:, 0])

GridSearchCV(cv=5, estimator=RandomForestRegressor(),
             param_grid=[{'max_features': [2, 4, 6, 8],
                          'n_estimators': [3, 10, 30]},
                         {'bootstrap': [False], 'max_features': [2, 3, 4],
                          'n_estimators': [3, 10]}],
             return_train_score=True, scoring='neg_mean_squared_error')

print(grid_search.best_params_)
# Since these were the max we probably want to run it with higher values...
print(grid_search.best_estimator_)
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

{'max_features': 6, 'n_estimators': 30}
RandomForestRegressor(max_features=6, n_estimators=30)
643733474426 {'max_features': 2, 'n_estimators': 3}
95031384657 {'max_features': 2, 'n_estimators': 10}
35103475831 {'max_features': 2, 'n_estimators': 30}
070906720124 {'max_features': 4, 'n_estimators': 3}
142164320154 {'max_features': 4, 'n_estimators': 10}
31443657333 {'max_features': 4, 'n_estimators': 30}
50322845977 {'max_features': 6, 'n_estimators': 3}
17455491991 {'max_features': 6, 'n_estimators': 10}
43427800321 {'max_features': 6, 'n_estimators': 30}
20435335512 {'max_features': 8, 'n_estimators': 3}
879435332965 {'max_features': 8, 'n_estimators': 10}
24515443716 {'max_features': 8, 'n_estimators': 30}
17748807948 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
8506451545 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
00976167665 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
94701183964 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
55694831166 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
81114906108 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

# 49744.32698468949 is better than the 50063.56307010515 that we got earlier
pd.DataFrame(grid_search.cv_results_)

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_max_features	param_n_estimators	param_bootstrap	params	split0_test_score	split1_test_score	...	mean_test_score	std_test_score	rank_test_score	split0_train_score	split1_train_score	split2_train_score	split3_train_score	split4_train_score	mean_train_score	std_train_score
0	0.074970	0.003440	0.004058	0.000327	2	3	NaN	{'max_features': 2, 'n_estimators': 3}	-3.725274e+09	-4.519071e+09	...	-4.126989e+09	2.782979e+08	18	-1.174094e+09	-1.137123e+09	-1.135578e+09	-1.161493e+09	-1.173606e+09	-1.156379e+09	1.697186e+07
1	0.253067	0.006757	0.013049	0.000900	2	10	NaN	{'max_features': 2, 'n_estimators': 10}	-2.902669e+09	-3.157306e+09	...	-3.079579e+09	1.487250e+08	11	-5.985522e+08	-5.762183e+08	-5.598208e+08	-5.842322e+08	-5.538692e+08	-5.745385e+08	1.623130e+07
2	0.766241	0.061197	0.041019	0.012172	2	30	NaN	{'max_features': 2, 'n_estimators': 30}	-2.522423e+09	-2.879564e+09	...	-2.767954e+09	1.561066e+08	7	-4.315321e+08	-4.233639e+08	-4.226476e+08	-4.443902e+08	-4.293808e+08	-4.302629e+08	7.842951e+06
3	0.117724	0.002772	0.004679	0.000370	4	3	NaN	{'max_features': 4, 'n_estimators': 3}	-3.327938e+09	-3.817583e+09	...	-3.553599e+09	2.235334e+08	15	-9.922077e+08	-9.869361e+08	-9.914129e+08	-9.018709e+08	-9.423203e+08	-9.629496e+08	3.577073e+07
4	0.396204	0.010106	0.012332	0.000672	4	10	NaN	{'max_features': 4, 'n_estimators': 10}	-2.740772e+09	-2.832407e+09	...	-2.832064e+09	1.248181e+08	9	-5.299414e+08	-5.012337e+08	-5.068510e+08	-5.150394e+08	-5.214479e+08	-5.149027e+08	1.020482e+07
5	1.183900	0.032037	0.034720	0.003109	4	30	NaN	{'max_features': 4, 'n_estimators': 30}	-2.458769e+09	-2.659611e+09	...	-2.578031e+09	1.205170e+08	3	-3.960356e+08	-3.967138e+08	-3.976908e+08	-3.966141e+08	-3.885729e+08	-3.951254e+08	3.319175e+06
6	0.163407	0.006830	0.004493	0.000744	6	3	NaN	{'max_features': 6, 'n_estimators': 3}	-3.348268e+09	-3.590983e+09	...	-3.421841e+09	1.263433e+08	14	-9.210685e+08	-9.293810e+08	-9.024612e+08	-9.630794e+08	-8.755641e+08	-9.183108e+08	2.902711e+07
7	0.540489	0.024100	0.012174	0.000949	6	10	NaN	{'max_features': 6, 'n_estimators': 10}	-2.539299e+09	-2.724873e+09	...	-2.670117e+09	1.431570e+08	4	-5.289777e+08	-4.838298e+08	-4.794369e+08	-4.998217e+08	-5.077449e+08	-4.999622e+08	1.779906e+07
8	1.557723	0.008208	0.032147	0.001470	6	30	NaN	{'max_features': 6, 'n_estimators': 30}	-2.311978e+09	-2.530306e+09	...	-2.490552e+09	1.364425e+08	1	-3.738081e+08	-3.723503e+08	-3.776112e+08	-3.963427e+08	-3.891545e+08	-3.818534e+08	9.341090e+06
9	0.199586	0.008718	0.003961	0.000096	8	3	NaN	{'max_features': 8, 'n_estimators': 3}	-3.293286e+09	-3.533084e+09	...	-3.412336e+09	1.883880e+08	13	-9.155409e+08	-8.965800e+08	-8.831597e+08	-8.852660e+08	-9.669854e+08	-9.095064e+08	3.094863e+07
10	0.661567	0.013049	0.011522	0.000856	8	10	NaN	{'max_features': 8, 'n_estimators': 10}	-2.565200e+09	-2.760625e+09	...	-2.680120e+09	1.436614e+08	5	-5.009905e+08	-4.870524e+08	-4.837687e+08	-5.273396e+08	-5.073409e+08	-5.012984e+08	1.565241e+07
11	2.099403	0.083338	0.035778	0.004660	8	30	NaN	{'max_features': 8, 'n_estimators': 30}	-2.336182e+09	-2.611561e+09	...	-2.510836e+09	1.374942e+08	2	-3.807474e+08	-3.895396e+08	-3.731072e+08	-3.865464e+08	-3.940694e+08	-3.848020e+08	7.274341e+06
12	0.110337	0.005226	0.005313	0.000367	2	3	False	{'bootstrap': False, 'max_features': 2, 'n_est...	-3.676333e+09	-4.174187e+09	...	-4.050199e+09	2.068671e+08	17	-0.000000e+00	-0.000000e+00	-0.000000e+00	-0.000000e+00	-0.000000e+00	0.000000e+00	0.000000e+00
13	0.417751	0.083123	0.014098	0.001016	2	10	False	{'bootstrap': False, 'max_features': 2, 'n_est...	-2.679464e+09	-2.985912e+09	...	-2.924522e+09	1.304188e+08	10	-0.000000e+00	-9.463245e+02	-0.000000e+00	-0.000000e+00	-0.000000e+00	-1.892649e+02	3.785298e+02
14	0.148054	0.008795	0.005098	0.000627	3	3	False	{'bootstrap': False, 'max_features': 3, 'n_est...	-3.413029e+09	-3.862638e+09	...	-3.678788e+09	1.461245e+08	16	-0.000000e+00	-0.000000e+00	-0.000000e+00	-0.000000e+00	-0.000000e+00	0.000000e+00	0.000000e+00
15	0.484264	0.008298	0.013995	0.000664	3	10	False	{'bootstrap': False, 'max_features': 3, 'n_est...	-2.743328e+09	-2.828006e+09	...	-2.794703e+09	5.917227e+07	8	-0.000000e+00	-0.000000e+00	-0.000000e+00	-0.000000e+00	-0.000000e+00	0.000000e+00	0.000000e+00
16	0.178736	0.007328	0.005338	0.000965	4	3	False	{'bootstrap': False, 'max_features': 4, 'n_est...	-3.318034e+09	-3.381403e+09	...	-3.352938e+09	8.572118e+07	12	-1.962887e+02	-0.000000e+00	-0.000000e+00	-1.051392e+04	-0.000000e+00	-2.142042e+03	4.186630e+03
17	0.584409	0.007127	0.013667	0.000872	4	10	False	{'bootstrap': False, 'max_features': 4, 'n_est...	-2.593077e+09	-2.810679e+09	...	-2.703044e+09	1.018621e+08	6	-0.000000e+00	-0.000000e+00	-0.000000e+00	-0.000000e+00	-0.000000e+00	0.000000e+00	0.000000e+00

18 rows × 23 columns

Other Methods to Fine Tune

Randomized Search:
- Iteration dependent so it can explore more options for hyperparameters
- More control of computing budget
Ensemble Methods:
- Combining the models which perform the best

Feature Importance

attributes = list(df.columns) + list(encoder.classes_)
attributes.remove("median_house_value")
attributes.remove("ocean_proximity")
importances = grid_search.best_estimator_.feature_importances_
sorted(zip(importances, attributes), reverse=True)

[(0.3317845401920315, 'median_income'),
 (0.14391320344674868, 'INLAND'),
 (0.10526089823364354, 'population_per_household'),
 (0.08263855622539133, 'bedrooms_per_room'),
 (0.08109436950269967, 'longitude'),
 (0.06119936528237925, 'latitude'),
 (0.05437513667126127, 'rooms_per_household'),
 (0.04269180191935387, 'housing_median_age'),
 (0.018543650605563098, 'population'),
 (0.017855965561009164, 'total_rooms'),
 (0.01747459825864214, 'total_bedrooms'),
 (0.016371631697584668, 'households'),
 (0.015137593949840484, '<1H OCEAN'),
 (0.006837130390816489, 'NEAR OCEAN'),
 (0.004801246718794319, 'NEAR BAY'),
 (2.0311344240502856e-05, 'ISLAND')]

Evaluate Your Model on the Test Set

final_model = grid_search.best_estimator_
print(final_model)

X_test = strat_test_set.drop(columns="median_house_value")
# Little bit of an oversight here
X_test["population_per_household"] = X_test["population"] / X_test["households"]
X_test["bedrooms_per_room"] = X_test["total_bedrooms"] / X_test["total_rooms"]
X_test["rooms_per_household"] = X_test["total_rooms"] / X_test["households"]
y_test = strat_test_set["median_house_value"].copy().values

X_test_prep = pipeline.transform(X_test)

final_preds = final_model.predict(X_test_prep)
final_mse = mean_squared_error(y_test, final_preds)
final_rmse = np.sqrt(final_mse)
print(final_rmse)

RandomForestRegressor(max_features=6, n_estimators=30)
48308.099325390474

Launch, Monitor, and Maintain Your System

Plug into production data
Write tests
Write monitoring code to check the live performance at regular intervals and trigger alerts when it drops
- Models tend to “rot” over time
Evaluating performance requires sampling the system’s predictions and evaluating them
In this case we need to have the human evaluation plugged into the system
Monitor system input data
Automate the training process and train on fresh data
For online learning save snapshots at regular intervals

Home

About

Projects

Posts