Chapter 3 : Classification

Notes

MNIST

The MNIST dataset is a set of 70,000 small images of handwritten digits with labels
This is probably the most used dataset for machine learning

from sklearn.datasets import fetch_openml
import numpy as np
# MNIST changed to https://www.openml.org/d/554
mnist = fetch_openml("mnist_784", version=1, as_frame=False)
# Do this to follow along with Geron
def sort_by_target(mnist):
    reorder_train = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[:60000])]))[:, 1]
    reorder_test = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[60000:])]))[:, 1]
    mnist.data[:60000] = mnist.data[reorder_train]
    mnist.target[:60000] = mnist.target[reorder_train]
    mnist.data[60000:] = mnist.data[reorder_test + 60000]
    mnist.target[60000:] = mnist.target[reorder_test + 60000]
mnist.target = mnist.target.astype(np.int8)
sort_by_target(mnist)

X, y = mnist["data"], mnist["target"]
print(X.shape, y.shape)

(70000, 784) (70000,)

%matplotlib inline
import matplotlib.pyplot as plt

dim = 28
some_num = 36_000
example = X[some_num]
plt.imshow(example.reshape((dim, dim)), cmap="binary")
plt.axis("off");
y[some_num]

png

Split into train / test!

X_train, y_train, X_test, y_test = X[:60000], y[:60000], X[60000:], y[60000:]

# Shuffling
shuf_order = np.random.permutation(len(y_train))
X_train, y_train = X_train[shuf_order, :], y_train[shuf_order]

Training a Binary Classifier

y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier()
sgd_clf.fit(X_train, y_train_5)

SGDClassifier()

sgd_clf.predict([example])
# This is sometimes False

array([ True])

Performance Measures

Measuring Accuracy Using Cross-Validation

from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

array([0.96835, 0.96025, 0.9659 ])

from sklearn.base import BaseEstimator

class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        return self
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)
    
n5c = Never5Classifier()
n5c.fit(X_train, y_train_5)
cross_val_score(n5c, X_train, y_train_5, cv=3, scoring="accuracy")

array([0.9102 , 0.9103 , 0.90845])

Misleading Performance Conclusions

Results for our Stochastic Gradient Descent Classifier are ~95%!
These results are ostensibly good, because only guessing False would be 90% accurate
Thus, accuracy is not the preferred performance measure for classifiers, especially when data is skewed

Confusion Matrix

General idea is the count the number of times class A was classified as class B

from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
# This is a pretty useful function, what it does is
# Does the k-fold CV and notes the prediction in each fold (using the other
# as training data), then it stacks them all at the end

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
# Requires sklearn > 0.24
cmd = ConfusionMatrixDisplay.from_predictions(y_train_5, y_train_pred)
cm = cmd.confusion_matrix

png

This is a useful image for understanding this straight from Geron’s book:

Precision, Recall, and F1 Score

Precision

$ precision = \frac {TP}{TP + FP} $

When your classifier claims to predict a 5, it is correct precision % of the time.

Recall

$ recall = \frac {TP}{TP + FN} $

It only detects recall % of the 5’s.

$F_1$ Score

$ F_1 = \frac {2}{\frac{1}{precision} + \frac{1}{recall}} = \frac {TP}{TP + \frac{FN + TP}{2}} $

Harmonic mean of recall and precision. Both need to be high to get a high $F_1$ Score.

Precision/Recall Scenarios

Scenario 1: Safe video classifier, if video is safe –> 1, else 0. We want high precision because we never want the model thinking that the video is safe, when in reality it is violent. And we don’t care if we block a decent amount of safe videos.
Scenario 2: Detect shoplifters on security footage, if shoplifting –> 1, else 0. We want high recall because we never want the model thinking that the person is not shoplifting, when in reality they are. And we don’t care if we occasionally accuse people of shoplifting that actually aren’t.
Scenario 3: Taking a bet model, if you should bet –> 1, else 0. We want high precision because we never want to have the model predict a win, but the bet is actually a loss. Even at the price of missing out on a few good opportunities.
Scenario 4: Will jumping off this structure kill me, if yes –> 1, else 0. We want high recall because we never want to predict that it won’t kill me, but in reality it would. Sorry for the grim example…

from sklearn.metrics import precision_score, recall_score, f1_score

print(precision_score(y_train_5, y_train_pred)) # 4766 / (4766 + 2121)
print(recall_score(y_train_5, y_train_pred))
print(f1_score(y_train_5, y_train_pred))

7934046345811052
8210662239439218
8069984588885868

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

from sklearn.metrics import PrecisionRecallDisplay

# This is precision vs recall
disp = PrecisionRecallDisplay(precision=precisions, recall=recalls)
disp.plot()

# Geron's plot
def plot_precision_recall_vs_threshold(p, r, t):
    plt.plot(t, p[:-1], "b--", label="Precision")
    plt.plot(t, r[:-1], "g-", label="Recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0, 1])
plt.figure()
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)

png

# My curve is much different...
thresh = min(thresholds[np.argwhere(precisions > 0.9)[:-1, 0]])
y_train_pred_90 = (y_scores > thresh)
print(precision_score(y_train_5, y_train_pred_90))
print(recall_score(y_train_5, y_train_pred_90))

0.9001540041067762
0.646928610957388

The ROC Curve

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, lw=2, label=label)
    plt.plot([0, 1], [0, 1], "k--")
    plt.axis([0,1,0,1])
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")

plot_roc_curve(fpr, tpr)
plt.show()

png

# ROC Area Under the Curve Score
from sklearn.metrics import roc_auc_score

roc_auc_score(y_train_5, y_scores)

0.9653886554659126

Rule of Thumb

Use PR Curve when positive class is rare and you care more about the false positives than the false negatives. Our classifier (which isn’t very good) looks good on the ROC curve, but that’s only because there’s only 10% 5s in the dataset (few positives).

from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier()
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")

y_scores_forest = y_probas_forest[:, 1]
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5, y_scores_forest)

plt.plot(fpr, tpr, "b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, label="Random Forest")
plt.legend(loc="lower right")

<matplotlib.legend.Legend at 0x12ceeb6d0>

png

roc_auc_score(y_train_5, y_scores_forest) # Much better!!

0.9984537812756192

Multiclass Classification

Some algorithms can handle multiple classes directly:
- Random Forest Classifiers
- Naive Bayes Classifiers
Others are strictly binary classifiers
- Support Vector Machine Classifiers
- Linear Classifiers
These can be converted to multiclassifers
One-versus-all strategy (OvA)
- Create 10 binary classifiers and pick the one with the highest score
One-versus-one strategy (OvO)
- Create classifiers to predict 0s vs 1s, 0s vs 2s, … 0s vs 9s, 1s vs 2s, 1s vs 3s, …
- Need N * (N - 1) / 2 classifiers
Most of the time OvA is preferred
Scikit-learn automatically runs OvA when you use a strictly binary classifier (except SVM where is uses OvO)

sgd_clf.fit(X_train, y_train)

sgd_clf.predict([example])

array([5], dtype=int8)

some_num_scores = sgd_clf.decision_function([example])
some_num_scores

array([[ -9769.65818671, -24825.69130821, -10776.58906031,
         -1405.53715487, -19212.67326423,   2556.83203592,
        -20584.60080404, -18743.62574742,  -8281.89554637,
        -12370.00869108]])

forest_clf.fit(X_train, y_train)
forest_clf.predict([example])

array([5], dtype=int8)

forest_clf.predict_proba([example])

array([[0.04, 0.  , 0.  , 0.05, 0.  , 0.87, 0.03, 0.  , 0.  , 0.01]])

cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")

array([0.88685, 0.89025, 0.8789 ])

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")

/Users/riley/PycharmProjects/ML/venv/lib/python3.8/site-packages/sklearn/linear_model/_stochastic_gradient.py:696: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  warnings.warn(
/Users/riley/PycharmProjects/ML/venv/lib/python3.8/site-packages/sklearn/linear_model/_stochastic_gradient.py:696: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  warnings.warn(
/Users/riley/PycharmProjects/ML/venv/lib/python3.8/site-packages/sklearn/linear_model/_stochastic_gradient.py:696: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  warnings.warn(

array([0.9016 , 0.90595, 0.90815])

Error Analysis

# Look at the confusion matrix
y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
conf_mx

/Users/riley/PycharmProjects/ML/venv/lib/python3.8/site-packages/sklearn/linear_model/_stochastic_gradient.py:696: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  warnings.warn(
/Users/riley/PycharmProjects/ML/venv/lib/python3.8/site-packages/sklearn/linear_model/_stochastic_gradient.py:696: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  warnings.warn(





array([[5606,    0,   14,    7,    9,   48,   36,    5,  197,    1],
       [   0, 6434,   43,   20,    3,   45,    4,   10,  173,   10],
       [  30,   27, 5250,   87,   80,   29,   71,   41,  335,    8],
       [  24,   22,  110, 5260,    1,  213,   30,   44,  359,   68],
       [  10,   17,   40,    8, 5265,   11,   34,   22,  272,  163],
       [  27,   17,   30,  159,   53, 4520,   78,   19,  456,   62],
       [  30,   18,   49,    1,   36,  102, 5557,    9,  116,    0],
       [  20,   14,   50,   22,   47,   13,    6, 5725,  145,  223],
       [  19,   65,   46,   97,    3,  132,   28,    8, 5399,   54],
       [  23,   19,   30,   58,  123,   37,    1,  167,  296, 5195]])

plt.matshow(conf_mx, cmap="gray")

<matplotlib.image.AxesImage at 0x12d08fb20>

png

row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums

np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap="gray")

<matplotlib.image.AxesImage at 0x12d1a9a90>

png

Multilabel Classification

A classification system that outputs multiple binary labels is a multilabel classification system
Picture with Alice and Charlie should output [1, 0, 1]

from sklearn.neighbors import KNeighborsClassifier

y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)

KNeighborsClassifier()

knn_clf.predict([example])

array([[False,  True]])

# Typo in book here y_train --> y_multilabel
# Takes a long time
# y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3, n_jobs=-1)
# f1_score(y_multilabel, y_train_knn_pred, average="macro")

Multioutput Classification

Generalization of the multilabel classification where each classification can be multiple labels and each label can have multiple classifications/outputs

# Noise removal classifier
import numpy.random as rnd
train_noise = rnd.randint(0, 100, (len(X_train), 784))
test_noise = rnd.randint(0, 100, (len(X_test), 784))
X_train_mod = X_train + train_noise
X_test_mod = X_test + test_noise
y_train_mod = X_train
y_test_mod = X_test

plt.imshow(X_train_mod[some_num].reshape((28, 28)), cmap="binary")
plt.show()
plt.imshow(y_train_mod[some_num].reshape((28, 28)), cmap="binary")

png

<matplotlib.image.AxesImage at 0x12c401e50>

png

knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[5000]])
plt.imshow(X_test_mod[5000].reshape((28, 28)), cmap="binary")
plt.show()
plt.imshow(clean_digit.reshape((28, 28)), cmap="binary")

png

<matplotlib.image.AxesImage at 0x12d771b50>

png

That turned out pretty nice. Let’s go a step further and use this as input to model

new_in = []
for i in range(0, len(X_train_mod), 1000):
    print(i)
    new_in.append(knn_clf.predict(X_train_mod[i:i+1000, :]))

new_inn = np.vstack(new_in)
new_inn.shape

(60000, 784)

cross_val_score(sgd_clf, new_inn, y_train, cv=3, scoring="accuracy", verbose=4, n_jobs=4)

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   3 out of   3 | elapsed:  1.4min remaining:    0.0s
[Parallel(n_jobs=4)]: Done   3 out of   3 | elapsed:  1.4min finished

array([0.9067 , 0.9008 , 0.90305])

# Eh, ok, so it was about the same

Home

About

Projects

Posts