Chapter 4 : Training Models

Exercises

Exercise 1

You can use any flavor of Gradient Descent (Batch (provided there are few enough samples to fit in memory), Stochastic, or Mini-batch) if you have a training set with millions of features.

Exercise 2

All forms of Gradient Descent might suffer from features with very different scales. This is because they will converge slower since the gradient descent will take a circuitous route to the minimum as 4-7 illustrates. I like to think about this as an ellipse and the gradient descent goes towards the semi major axis first, once it reaches there it heads along it towards the minimum. To fix this you can standard scale or min max scale the features.

Exercise 3

No, the log loss function for Logistic Regression is convex so there’s no need to worry about it getting stuck in a local minimum.

Exercise 4

Not all Gradient Descent algorithms lead to the exact same model provided you let them run long enough. Most of the time it’s close enough, but Batch Gradient descent will converge smoothly to a specific minimum. However, Stochastic Gradient Descent and Mini-Batch Gradient Descent may “bounce” around the global minimum. However, if you gradually lower the learning rate they will become closer and closer to BGD.

Exercise 5

If the validation error goes up at every epoch you’re likely overfitting. There are various methods to prevent overfitting including: Use a less complicated model, use Ridge, Lasso, or Elastic Net regularization, increase the size of your (training) dataset. Geron notes that it could also be that the learning rate is too high this would definitely be the case if the training error is going up as well.

Exercise 6

Typically to implement early stopping you would have some patience factor. I don’t think it would be a good idea to stop immediately because of this. For example, the validation error could be consistently going down, increase for one epoch, and then go down for the next 10 epochs. The patience factor says something like “if _ epochs go by without a decrease in the validation error, then stop (and revert to minimum validation error model)”.

Exercise 7

The normal equations will be fast when the number of features is low because it is linear in the number of samples, it will converge exactly. Batch Gradient Descent will be slow for a large number of samples, but largely unaffected by the number of features, it will converge exactly. Stochastic Gradient Descent and Mini-batch Gradient Descent will both be fast, but require that the learning rate be decreases so that they actually converge.

Exercise 8

Three ways to solve a gap between the training error and validation error in Polynomial Regression are: Increase size of the dataset, apply regularization, use a less complicated model. Overfitting is what is happening. Nailed this answer : )

Exercise 9

If the training error and validation error are almost equal and fairly high then that indicates high bias. You should reduce the regularization parameter $\alpha$.

Exercise 10

Ridge Regression instead of Linear Regression?
- You want to prevent overfitting (high variance) in your model.
Lasso instead of Ridge Regression?
- You want to completely eliminate the impact of the least important features instead of just penalizing them.
Elastic Net instead of Lasso?
- You want to reduce the complexity of your model and only use the most important features, but want to avoid the erratic behavior of Lasso (when # features > # samples or several features are strongly correlated).
- Good tip from Geron: If you want to use Lasso without the negative effects just use Elastic Net with an l1 ratio close to 1.

Exercise 11

If you want to classify pictures as outdoor/indoor and daytime/nighttime, then you should implement two Logistic Regression classifiers instead of one Softmax Regression classifier. Softmax Regression is multi-class, not multi-output.

Exercise 12

Implement Batch Gradient Descent with early stopping for Softmax Regression

# This sounds difficult, but let's give it a shot
from sklearn.datasets import load_iris
import numpy as np

iris = load_iris()

X, y = iris["data"], iris["target"]
arr = np.hstack((X, y[np.newaxis, :].T))
np.random.seed(5)
p = np.random.permutation(len(arr))
arr = arr[p]
s = (np.array([0.7, 0.2, 0.1]) * len(arr)).astype(int)
s = np.cumsum(s)
print(s)
X_train, y_train = arr[0:s[0], :-1], arr[0:s[0], -1][:, np.newaxis]
print(X_train.shape, y_train.shape)
X_val, y_val = arr[s[0]:s[1], :-1], arr[s[0]:s[1], -1][:, np.newaxis]
print(X_val.shape, y_val.shape)
X_test, y_test = arr[s[1]:s[2], :-1], arr[s[1]:s[2], -1][:, np.newaxis]
print(X_test.shape, y_test.shape)

[105 135 150]
(105, 4) (105, 1)
(30, 4) (30, 1)
(15, 4) (15, 1)

def softmax_eval(X, y, theta):
    y = y[:, 0]
    m = X.shape[0]
    K = len(set(y))
    l = 0
    for k in range(K):
        y_tmp = (y == k).astype(int)
        for i in range(m):
            s_k = theta[:, k].T @ X[i]
            p_k = np.exp(s_k) / sum(np.exp(theta[:, k].T @ X[i]) for k in range(K))
            l += y_tmp[i] * np.log(p_k)
    l = (-1 / m) * l
    return l
        

def train(X, y, iters=1000, lr=0.01):
    y = y[:, 0]
    m = X.shape[0]
    K = len(set(y))
    theta = np.ones((X.shape[1], K))
    eta = lr
    for iteration in range(iters):
        for k in range(K):
            y_tmp = (y == k).astype(int)
            grad_k = 0
            for i in range(m):
                s_k = theta[:, k].T.dot(X[i])
                p_k = np.exp(s_k) / sum(np.exp(theta[:, k].T @ X[i]) for k in range(K))
                grad_k += (p_k - y_tmp[i]) * X[i]
            grad_k = (1 / m)*grad_k
            theta[:, k] = theta[:, k] - eta * grad_k
        if (iteration + 1) % 100 == 0:
            l = softmax_eval(X, y.reshape((-1, 1)), theta)
            print("Loss: ", round(l, 2))
    return theta

train(X_train, y_train)

Loss:  0.73
Loss:  0.6
Loss:  0.54
Loss:  0.5
Loss:  0.47
Loss:  0.44
Loss:  0.42
Loss:  0.41
Loss:  0.39
Loss:  0.38





array([[ 1.3326466 ,  1.17699356,  0.49879875],
       [ 1.84638559,  0.75029511,  0.40886567],
       [-0.19686822,  1.2119394 ,  1.98789651],
       [ 0.44451838,  0.84531752,  1.71079481]])

This was my first attempt… looking at Geron’s work there’s still a little to do here. The main difference is that he vectorized everything. He also added the bias term which I forgot to do.

def one_hot(y):
    K = len(set(y[:, 0]))
    new_y = np.zeros((len(y), K))
    i, j = np.indices(new_y.shape)
    return (j == y).astype(int)

def cross_entropy_loss(X, y, theta):
    m = X.shape[0]
    # Assume X has bias and y is onehot at this point
    s = X.dot(theta)
    p = np.exp(s) / np.sum(np.exp(s), axis=1)[:, np.newaxis]
    return (-1 / m) * np.sum(y * np.log(p))

def accuracy(X, y, theta):
    s = X.dot(theta)
    p = np.exp(s) / np.sum(np.exp(s), axis=1)[:, np.newaxis]
    pmax = np.argmax(p, axis=1)
    preds = y[range(y.shape[0]), pmax]
    return sum(preds) / sum(np.ones(len(preds)))
    


def train_vectorized(X, y, X_val, y_val, iters=1000001, lr=0.01, patience=5):
    # Add bias to X
    X = np.hstack((np.ones((X.shape[0], 1)), X))
    X_val = np.hstack((np.ones((X_val.shape[0], 1)), X_val))
    
    # One hot y
    y = one_hot(y)
    y_val = one_hot(y_val)
    
    
    m = X.shape[0]
    K = y.shape[1]

    # Initial guess for theta
    theta = np.ones((X.shape[1], K))
    eta = lr
    best_val_loss = np.inf
    
    for iteration in range(iters):
        # s_k = theta_k^T \cdot x
        s = X.dot(theta)
        p = np.exp(s) / np.sum(np.exp(s), axis=1)[:, np.newaxis]
        grad = (1 / m) * X.T.dot(p - y)
        theta -= eta * grad

        val_loss = cross_entropy_loss(X_val, y_val, theta)
        if ((iteration + 1) % 1000) == 0:
            print(iteration + 1, val_loss)
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_theta = theta
            patience_cnt = 0
        else:
            patience_cnt += 1
            if patience_cnt >= patience:
                print("Early Stopping")
                return best_theta, best_val_loss
    print("Warning no Early Stopping")
    return best_theta, best_val_loss
            
        
theta, _ = train_vectorized(X_train, y_train, X_val, y_val)

0.3342596544930866
0.25429893944101895
0.21449143442540694
0.19125970341767579
0.17637229776556052
0.16619324074143058
0.15889600701967801
0.1534742080223439
0.1493327048861036
0.14609930897339848
0.1435304671922347
0.14146072082146635
0.13977407415470203
0.1383870025727842
0.13723796172201708
0.136280686463588
0.13547978213770803
0.1348077477661006
0.1342429194633856
0.13376802026256793
0.1333691186414038
0.13303486811330995
0.13275594366356386
0.1325246183485923
0.13233444121623
0.13217998948842136
0.1320566758700738
0.13196059726122655
0.13188841490471628
0.1318372586424963
0.13180464983522938
0.1317884388562891
Early Stopping

# Add bias to X
X_test = np.hstack((np.ones((X_test.shape[0], 1)),X_test))

# One hot y
y_test = one_hot(y_test)
accuracy(X_test, y_test, theta)

1.0

s = X_test.dot(theta)
p = np.exp(s) / np.sum(np.exp(s), axis=1)[:, np.newaxis]
for i in range(len(p)):
    print(np.round(p[i], 3), y_test[i])

[0.    0.124 0.876] [0 0 1]
[0.001 0.983 0.015] [0 1 0]
[0.98 0.02 0.  ] [1 0 0]
[0.004 0.994 0.002] [0 1 0]
[0.992 0.008 0.   ] [1 0 0]
[0.996 0.004 0.   ] [1 0 0]
[0.003 0.986 0.011] [0 1 0]
[0.    0.004 0.996] [0 0 1]
[0.994 0.006 0.   ] [1 0 0]
[0.    0.038 0.962] [0 0 1]
[0.98 0.02 0.  ] [1 0 0]
[0.    0.948 0.051] [0 1 0]
[0.    0.003 0.997] [0 0 1]
[0. 0. 1.] [0 0 1]
[0.003 0.979 0.018] [0 1 0]

I had an error in my code where the variable m in the cross_entropy_loss was being taken from the global context (since it wasn’t defined in the scope of the function). As a result, the validation loss was constantly decreasing. That was pretty annoying because I was like how am I supposed to implement early stopping if it just keeps decreasing. Anyways, I’m glad I got this working eventually :). One thing I learned is how to convert from a sum to a vectorized form in code. Here’s a quick demo of that.

X_train_bias = np.hstack((np.ones((X_train.shape[0], 1)), X_train))
print(
    "X shape:",
    X_train_bias.shape,
    "\ny shape:",
    y_train.shape,
    "\n" "theta shape:",
    theta.shape,
)

# Number of features in the training set should be equal to the number of fitted parameters in theta
assert X_train_bias.shape[1] == theta.shape[0]

# Number of classes should be equal to the number of columns in theta
assert len(set(y_train[:, 0])) == theta.shape[1]


def longway(X, y, theta):
    dtheta = np.zeros(theta.shape)
    m = len(X)
    J = theta.shape[0]
    K = len(set(y_train[:, 0]))
    for k in range(K):
        for j in range(J):
            s = 0
            for i in range(m):
                s += (theta[:, k].dot(X[i]) - y[i, 0]) * X[i, j]
            dtheta[j, k] = (2 / m) * s
    return dtheta


def shortway(X, y, theta):
    m = len(X)
    J = theta.shape[0]
    K = len(set(y_train[:, 0]))
    X.dot(theta) - y
    return (2 / m) * X.T.dot(X.dot(theta) - y)

X shape: (105, 5) 
y shape: (105, 1) 
theta shape: (5, 3)

# These are equivalent methods
# Obviously the vectorized one is easier to code and understand
# although it takes some thought to write it out

np.array_equal(shortway(X_train_bias, y_train, theta).round(4), 
               longway(X_train_bias, y_train, theta).round(4))

True

# And the shortway is much faster!!
%timeit longway(X_train_bias, y_train, theta)
%timeit shortway(X_train_bias, y_train, theta)

2.69 ms ± 59.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
26.5 µs ± 165 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Home

About

Projects

Posts