Machine Learning Basics¶
- "Generalization": Perform well on data it hasn't seen before
- Linear regression measures the error on the training data, not on the test data
- Only works if training and test data sets have the same distribution
- The "i.i.d. assumptions"
- Assume that all examples are independent, and training and test set are identically distributed
- Use distribution of one sample to generate all test and training samples
- Optimizing for the training data will therefore also optimize for the test data
- Overfitting: Reach a low training error, but the test error is much larger
- i.e. no generalization, only "memorized" the training data
- Giving it too much capacity
- Capacity: Flexibility the system has
- e.g. for linear regression, increase capacity by allowing higher-order polynomials
-
Check for systems: If you can't get your system to overfit (by giving it lots of capacity), it doesn't learn properly
-
Even with a perfect model, with a reasonably complex system, there's always an error
- e.g. digital communication: noise makes it impossible to always decide correctly (between 1 and 0)
-
"No Free Lunch" theorem: No machine learning algorithm is universally better than any other
- But assumes data of all possible distributions
- In ML, we always concentrate on a limited range of data
- i.e. there's no "general" good ML algorithm
Regularization¶
-
: All elements squared and summed up
- i.e. small weights are preferable for low cost
- If we pick to be large, weights will almost be zero to achieve minimal cost (underfitting)
- Also called a "penalty term"
- Goal ist to minimize the generalization error but not the training error
Hyperparameters¶
-
is a "hyperparameter"
- parameters that are not learned by the scheme
- to train these parameters, we use part of the training data as "validation data" to "tune" the algorithm
- After we've used a test set once, it is "tainted" and should theoretically not be used anymore
- hard in practice to have enough data
Estimators, Bias, Variance¶
- Estimator: "Schätzer"
- Bias: Difference between Error of "guess" and real data
- Estimator is unbiased: It is on average right ("Erwartungstreu")
- Variance of Estimator
- "SE": Standard Error, a.k.a
- The variance of a sample is
- i.e. lots of data reduces the variance of the estimator
- Consistency: In the limit, the estimator "guesses the right value"
- Variance goes towards zero
Maximum Likelihood Estimation¶
-
are free variables
- Weights of the network in a neural network
- Find such that the likelihood that the model is producing the "correct" values is maximized
- Usually, the log of the likelihood is maximized
- KL divergence: Measures how similar to distributions are
- Goal: minimize KL divergence
Todo
Chapters 5.7 and 5.8 (supervised vs unsupervised learning)
Stochastic Gradient Descent¶
- Gradient Descent with a limited sample
- Negative log-likelihood is the cost function
- Minimizing this cost = maximizing likelihood
- Gradients over all of millions data points would take forever
- Use a select few of the data as an estimate of the gradient
- This subset is called a "minibatch"
- Has to be chosen truly randomly from the whole data set
- learning rate: step size for gradient descent
Building a ML algorithm¶
- Only gets useful once you have a lot of data