Social News in 1000 Steps – Step 10

This entry is part 10 of 14 in the series Social News

Following the previous step, I want to classify the Bing search results. I do this with the TfidfVectorizer and LogisticRegression classes. The results seem to be surprisingly good. The code is given as follows:

import core
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import confusion_matrix
import numpy as np
db = core.connect()
bing_searches = db['bing_searches']
x = []
y = []
for row in bing_searches.all():
    x.append("{0}\n{1}".format(row['title'], row['description']))
    y.append(row['flag'] == 'Use')
vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words='english')
train_x, test_x, train_y, test_y = train_test_split(x, y,
train_features = vectorizer.fit_transform(train_x)
test_features = vectorizer.transform(test_x)
lr = LogisticRegression(), train_y)
preds = lr.predict(test_features)
print('Test Accuracy', accuracy_score(test_y, preds))
print('Test Kappa', cohen_kappa_score(test_y, preds))
print('Confusion Matrix\n', confusion_matrix(test_y, preds))
feature_names = np.array(vectorizer.get_feature_names())
bottom10 = np.argsort(lr.coef_[0])[:10]
print("Possible negative keywords {}".format(feature_names[bottom10]))
Series NavigationSocial News in 1000 Steps – Step 9Social News in 1000 Steps – Step 11
By the author of NumPy Beginner's Guide, NumPy Cookbook and Instant Pygame. If you enjoyed this post, please consider leaving a comment or subscribing to the RSS feed to have future articles delivered to your feed reader.
This entry was posted in programming and tagged . Bookmark the permalink.