Chapter 1 : The Machine Learning Landscape
Exercises
- Machine learning is the process of enabling machines (i.e. computers) to solve problems without being explicitly programs. This is done by providing data and letting the machine “decide” the optimal solution to the problem (via some loss function).
- It shines when solutions exist to a problem but they are too complicated to explicitly program, when solutions don’t actually exist to a problem, when you expect incoming data to change and you want a solution which is robust enough to adapt, and when you want to gain insight into the solution (i.e. dissect the blackbox).
- When you give the model the result (i.e. label) that you expect it to predict.
- Classification and a numerical prediction.
- Anomaly detection, dimensionality reduction, association rule learning, clustering.
- Reinforcement learning
- Clustering algorithm
- Spam is a supervised learning problem.
- When a model’s parameters actively change on new data.
- When you use online learning because data can’t fit into memory.
- Instance-based model
- The learning algorithm’s hyperparameter stays constant throughout the training whereas the model parameter changes.
- They search for an equation or non-linear combination of outputs which gives an accurate prediction. They tune parameters by minimizing a utility function. To make predictions you input the new data into the model and it gives you a prediction.
- Overfitting, lousy data, underfitting, and not having a sufficient amount of data.
- Overfitting. Moar data, impose regularization, provide a more representative sample of training data.
- A test set gives some indication of the accuracy of the production model (i.e. has never “seen” the incoming samples).
- The validation set allows you to tweak hyperparameters of the model while still maintaining the purity of the test set.
- The model may not generalize well when put in production since hyperparameters were fitted/chosen according to the test set. You’d probably see worse performance in production.
- I don’t think cross validation was very well explained. You make a test/train split and instead of splitting the training data again into a validation/train split, you pass over all the data multiple times, each time using a different (non-overlapping) portion for validation. You can average the validation loss over all the iterations. In this way your model is trained on all the data in the train set, while still reaping the benefits of having a validation set (hyperparameter tuning).