Social News in 1000 Steps – Step 11

This entry is part 11 of 14 in the series Social News

Code is, of course, very valuable if you are into that sort of thing. High-ranking code n-grams, may be useful for research. To find code I used a regex, which matches several Python keywords and operators:

import re
import dautil as dl
import pandas as pd
 
 
def compile_pattern():
    patterns = ['lambda ', 'yield ', '==', 'def ', 'import ‘, ‘class ']
    return re.compile("|".join(patterns))
 
 
def clean_line(line):
    tokens = line.split()
    filtered = [t for t in tokens if not t.isdigit()]
 
    return " ".join(filtered)
 
if __name__ == "__main__":
    pattern = compile_pattern()
    corpus = dl.nlp.WebCorpus('sonar_corpus')
    selected = []
 
    for text in corpus.get_texts():
        for line in text.split("\n"):
            if len(pattern.findall(line)) > 0:
                selected.append(clean_line(line))
 
    df = dl.nlp.calc_tfidf(selected, ngram_range=(2, 4))
    flag_df = pd.DataFrame(columns=['Flag'])
    pd.concat([df, flag_df]).to_csv('code_keywords.csv')
Series NavigationSocial News in 1000 Steps – Step 10Social News in 1000 Steps – Step 12
By the author of NumPy Beginner's Guide, NumPy Cookbook and Instant Pygame. If you enjoyed this post, please consider leaving a comment or subscribing to the RSS feed to have future articles delivered to your feed reader.
Share
This entry was posted in programming and tagged . Bookmark the permalink.