The topics for week 3 of Machine Learning on Coursera are logistic regression and regularization. Logistic regression is a classification algorithm, which can categorize data items by learning from an already classified training set. For instance we can classify a day as freezing, rainy, sunny or cloudy from a set of precipitation, temperature or cloud coverage values. Or we can classify stocks as being cheap or good investments based on Sharpe ratios, analyst ratings, average return, daily volatility and so on.
Logistic regression uses a special function to come up with a value between 0 and 1, that indicates the probability of an item belonging to a certain class. You probably know that given a set of data points, you can fit to it very tightly giving you (almost) a perfect fit. This may sound great, however in practice it often turns out that a new data point doesn’t fit the “perfect” fit any more. This phenomenon is called overfitting. Regularization tries to avoid overfitting. It basically attempts to keep higher terms (as in polynomial) small by adding a penalty for them.
The video lectures in the logistic regression module discuss the simplest case of binary classification with only two classes:
- Hypothesis Representation
- Decision Boundary – this can be visualized in two dimensional space as a line or curve separating the regions of each classes. So for instance, if we are classifying freezing temperature – the line will go through the origin and separate negative and positive temperatures. (This doesn’t work I think if you are using exotic temperature scales such as Fahrenheit.)
- Cost function – the cost function for logistic regression is very similar to the linear regression.
- Simplified Cost Function and Gradient Descent – gradient descent, but using logistic regression.
- Advanced Optimization – skip this module if you are not interested in Octave. A number of algorithms are mentioned, without much details. These algorithms are implemented in the scipy.optimize module, so if you are using Python, just use those functions. By the way scikit-learn has an implementation of logistic regression (see the LogisticRegression class).
- Multiclass classification one vs- all – binary classification can be extended to multiclass classification by applying the techniques we learned repeatedly. The clue is to realize that we can reduce the multiclass classification to the binary classification situation by pretending that we are dealing with only two classes, for example rainy weather and the rest. Then just repeat the procedure for cloudy weather versus the rest. Sunny weather versus the rest and so on and so forth.
The regularization module gives a bit of intuition about overfitting and how to avoid it:
- The Problem of Overfitting
- Cost function – the cost function gets extra terms to suppress higher order terms
- Regularized Linear Regression – gives improvements for gradient descent and the normal equation
- Regularized Logistic Regression
It shouldn’t be too hard to classify stocks as being cheap, low volatility or anything else that can be deemed desirable. Ones you have manually classified the stocks, you can use statistics on the returns or other financial data as features. Logistic regression would then be able to tell whether other stocks fall within the “buy” or “sell” category. Seems like an idea for a (long term) hedge fund.