I am still in catch-up mode, but fortunately these seem to be the last lectures.

**Anomaly Detection**

- Problem Motivation – anomaly here means the same as outlier, I think. Or in certain contexts error or defect.
- Gaussian Distribution – a.k.a the Normal Distribution.
- Algorithm – the algorithm finds outliers given a threshold for the probability of a value in the data set assuming Gaussian distribution. (Apparently this assumption is OK. I am not so sure myself. I believe that there are other/better ways to find outliers.)
- Developing and Evaluating an Anomaly Detection System
- Anomaly Detection vs. Supervised Learning
- Choosing What Features to Use
- Multivariate Gaussian Distribution
- Anomaly Detection using the Multivariate Gaussian Distribution

**Recommender Systems**

- Problem Formulation – this lecture answers the age-old question of how to recommend movies.
- Content Based Recommendations – based on the content of the movies. We should have values for the degree of romance or action (features) in the motion picture. For example, “Capitalism – a love story” is clearly a romantic movie. So it will have the value 5 out of 5 for romance.
- Collaborative Filtering – in this scheme the content feature values can be partially missing. We try to learn those on the fly as well.
- Collaborative Filtering Algorithm – first we initialize to small random values. Then we minimize the cost function of ratings and features. Finally use the result to predict ratings (recommend).
- Vectorization: Low Rank Matrix Factorization
- Implementational Detail: Mean Normalization

**Large Scale Machine Learning**

- Learning With Large Datasets – having lots of training examples is problematic, because the cost function of gradient descent will require us to sum over a lot of terms.
- Stochastic Gradient Descent – perform random/drunk walk downhill one training example at a time instead of evaluating all of them.
- Mini-Batch Gradient Descent – partition the training examples and use small batches (dependent on concurrency capability) to make progress faster.
- Stochastic Gradient Descent Convergence – obviously if we are going downhill we should converge eventually, but it’s best to make sure by plotting the progress.
- Online Learning – online here means on the fly.
- Map Reduce and Data Parallelism

**Application Example: Photo OCR**

- Problem Description and Pipeline – the problem is to recognize text in photos.
- Sliding Windows – small rectangular patches are used to scan the photos.
- Getting Lots of Data and Artificial Data – this is similar to bootstrapping I guess. If you don’t have enough data, you can always mix whatever you have or add distortions and noise.
- Ceiling Analysis: What Part of the Pipeline to Work on Next – this analysis helps you find the low hanging fruit and focus on easy wins.