Social News in 1000 Steps – Step 4

This entry is part 4 of 14 in the series Social News

I decided to make the filtering of common words stricter. This should in theory give us more “technical” terms. The additional filtering consists of:

  • Excluding words found in the Reuters, Brown, names and words NLTK corpora.
  • Excluding words with low unigram TF-IDF score.
  • Excluding ngrams containing digits (for example ‘S3’ or ‘PyLearn2’).
  • Excluding ngrams with duplicate words (for example ‘Hadoop Hadoop’). This is a probable spamming technique.

The code for this step is as follows (also on Github):

import datetime
import dautil as dl
import pandas as pd
 
 
def get_terms(alist, sw):
    df = dl.nlp.calc_tfidf(alist, sw)
 
    return dl.nlp.select_terms(df)
 
 
corpus = dl.nlp.WebCorpus('sonar_corpus')
texts = corpus.get_texts()
 
sw = dl.nlp.common_unigrams()
unigrams_tfidf = dl.nlp.calc_tfidf(texts, ngram_range=None)
all_unigrams = set(unigrams_tfidf['term'].values.tolist())
uncommon = dl.nlp.select_terms(unigrams_tfidf)
sw = sw.union(all_unigrams - uncommon)
 
text_terms = get_terms(texts, sw)
title_terms = get_terms(corpus.get_titles(), sw)
 
terms = text_terms.intersection(title_terms) - corpus.get_authors()
 
fname = 'keywords.csv'
old = set(pd.read_csv(fname)['Term'].values.tolist())
 
with open(fname, 'a') as csv_file:
    for t in terms:
        if t not in old and not dl.nlp.has_digits(t) \
                and not dl.nlp.has_duplicates(t):
            ts = datetime.datetime.now().isoformat()
            csv_file.write(ts + ',' + t + ',Use\n')
Series NavigationSocial News in 1000 Steps – Step 3Social News in 1000 Steps – Step 5
By the author of NumPy Beginner's Guide, NumPy Cookbook and Instant Pygame. If you enjoyed this post, please consider leaving a comment or subscribing to the RSS feed to have future articles delivered to your feed reader.
Share
This entry was posted in programming and tagged . Bookmark the permalink.