Flipping coins with logistic regression

This entry is part 23 of 23 in the series Numpy Strategies

Under the Random Walk theory financial assets perform random walks, not unlike Malocchio in his usual drunken state, but probably without the loud cursing and swearing. The theory basically ignores irrational investors, inside trading, fraud, major political events and natural disasters. It’s more of a belief than anything else. You either believe that markets are random or you don’t.Trying to predict whether the market will go up the next day/week/month is like predicting a coin flip.
So if you believe the theory when you get period of lower lows, higher highs, increasing volume or other “trends” this is just coincidental, like for instance a long series of consecutive “heads” or “tails”.

We can define an up/down/flat day/period (close price) as a class in the machine learning sense. The features could then be the returns for this period. I decided to use the OHLC and Volume as features. It should be easy to come up with other features for example the returns of the Close(today) and Open(yesterday). The result is a bit better than a coin flip it seems. Therefore it seems plausible that we can improve this. The 2-d array is the probability of each class for the next day. Somehow we get only two classes. Obviously no flat days were found. The second number is the score over the overall data set. The last number is the predicted class, which is based on the highest probability of the 2-d array described previously. We are to expect an up day.

 [[ 0.49673447 0.50326553]] 0.52828438949 [ 1.]

You have at least two options in my opinion. Option 1 add/remove features. Option 2 combine the result with other (weak) learners. The usual disclaimer applies here. Any losses/hardships incurred by this content are none of my concern. However, anything positive should be shared in  a fair way.

 def prepare(a): a = np.nan_to_num(a) a = rets(a)   return pp.scale(a)   def classify_logistically(x, y): lg = LogisticRegression()   kf = KFold(len(y), n_folds=10) a = x[:-1]   for train,test in kf: lg.fit(a[train], y[train])   return (lg.predict_proba(x[-1]), lg.score(a,y), lg.predict(x[-1])) o = nan_to_mean(df['Open'].values) retso = prepare(o)   h = nan_to_mean(df['High'].values) retsh = prepare(h)   l = nan_to_mean(df['Low'].values) retsl = prepare(l)   c = nan_to_mean(df['Close'].values) retsc = prepare(c)   v = nan_to_mean(df['Volume'].values) retsv = prepare(v)     X = np.vstack((retso, retsh, retsl, retsc, retsv)) X = X.T y = np.sign(retsc[1:]) P, score, pred = classify_logistically(X, y) print P, score, pred