Chapter 1 : The Machine Learning Landscape

Notes


Most of this we already know…

What is machine learning?

  • Machine learning is the process of enabling a machine to solve a particular problem without being explicitly programmed.

Why use machine learning?

  • Existing solutions to a problem are vast and tedious to program
  • There is no known solution or the solution is sufficiently complicated to be solved programmatically/ analytically
  • You expect data to change. Machine learning models are capable of adapting to new data.

What are the different types of machine learning?

  • Supervised Learning - features of data go in along with labelled output, training occurs, and the model predicts the output give new input data
  • Unsupervised learning - samples are not labelled, think clustering algorithms etc.
  • Semisupervised learning - A few of the samples are labelled (e.g. unlabelled samples could be classified as belonging to the same cluster as the labelled sample)
  • Reinforcement learning - An agent observes an environment and learns to perform actions which optimize for rewards over penalties

Methods for learning

  • Batch learning - An entire dataset is trained on, the model is launched into production, and it simply makes predictions on the new incoming data. The model doesn’t change/learn anymore, unless it is pulled offline and trained again on a new set of data. Becomes difficult if you have a lot of data or data is varying often.
  • Online learning - Learning occurs sequentially by consuming mini-batches of data. Drawbacks of this include bad data (where the model adjusts to quickly and performs poorly) or a learning rate which is too slow (model doesn’t adjust quick enough to new data)

Different ways the machine learning models generalize

  • Instance-based learning - Memorize learning instances (i.e. samples), determine how similar a new instance is to existing data, act accordingly
  • Model-based learning - You decide an equation, learn the free parameters of that equation, and then use the equation to make predictions on new data. Think linear regression

Challenges of machine learning

  • Insufficient data - If you don’t have enough data your model won’t generalize well
  • Nonrepresentative training data - Obviously if you don’t give it data similar to what you want it to predict, it won’t stand a chance.
  • Sampling bias - The data you collect is biased. That would be like me feeding in only data from only west coast teams to a sports algorithm because those are teams I care about the most (for instance).
  • Lousy data - Errors/inconsistencies in training data.
  • Irrelevant features - You feed in data that does not impact the result. Feature engineering is the practice of discerning which features/data to use.
  • Overfitting - Model doesn’t generalize to new data because it “memorizes” a perfect/near-perfect model for training data
  • Underfitting

Train/Test Split

  • Leave behind some of the training data to test as input into your model. This will give you insight into how your model generalizes. A common split is 80/20, train/test. If you are trying to determine the best hyperparameters it is useful to also make a 3rd partition of the data known as the validation set. That way you are not selecting hyperparameters based on the test set.