Intention analysis using topic models

February 14, 2015 • hack

Topic models are great to categorize WHAT a text is about. It is pretty easy as well: Get an off-the-shelf LDA, train it on your corpus and you are set to go. But there are even more insights you can get about your texts. Modifying your corpus in a certain way (mostly removing everything but verb phrases) allows you to gather a deeper understanding about WHY a certain text was written.

I tested it on stack-overflow (SO) questions within the scope of a bigger media mining project.

Setup

I was using python and nltk for this project. In order To play around with the code which is posted below, make sure to have python installed. Use pip or easy_install to install the following packages: nltk, gensim.

Furthermore, if you want to use the SO questions as well download them from https://archive.org/details/stackexchange.

Creation of a corpus and dictionary

Before we can start to train a topic model we need a dictionary of all the tokens that are part of our corpus (in our case the corpus consists of all SO questions).

from gensim import corpora

class SOQuestionCorpus(corpora.TextCorpus):
    def __init__(self, question_file, tokenizer):
        # The stack-overflow questons are stored in a file, one question per line
        self.question_file = question_file

        # A tokenizer is a function that takes as input a text (possibly multiple sentences) and returns all 
        # containing tokens (a token is the unit we are going to train the LDA on, can be either a single 
        # word, a words stem or a word phrase) as an array of strings.
        self.tokenizer = tokenizer

        # The `TextCorpus` class is going to create a dictionary on all tokens of all documents we got. The 
        # tokens for every document are provided in the `get_texts` function. 
        super(SOQuestionCorpus, self).__init__(input=True)

        # Ignore common stop words (words that don't carry much meaning) lime 'the' or 'is'
        self.dictionary.filter_extremes(no_below=3, no_above=0.2)

    # Stack-overflow questions contain a lot of stuff we don't want to be included in our topic model, like 
    # code snippets or other markup.
    @staticmethod
    def pre_process(body):
        return remove_tags(remove_code(body))

    # Provides an array of arrays of all the tokens for all documents.
    # Example:
    #   Let documents be 
    #     `["Hello world. I am doc1.", "Nice code! I like it."]` 
    #   In that case the function will yield two arrays with each cell containing the tokens of the sentence
    #     `[["hello", "world", ".", "I", "am", "doc1", "."], ["Nice", "code", "!", "I", "like", "it", "."]]`
    def get_texts(self):
        with open(self.question_file) as questions:
            for question in questions:
                yield list(self.tokenizer(SOQuestionCorpus.pre_process(question)))

Creating the corpus will take some time since all the documents need to be processed. You should use self.dictionary.save('mydictionary.dict') after the creation to store the dictionary for later use.

As you might have noticed, we did not define the tokenizer function yet.

A standard topic model

For a standard topic model it is sufficient to use utils.simple_preprocess as a tokenizer. It will lowercase the input and use a regex to split the text into single words:

from gensim import utils
tokenizer = utils.simple_preprocess

We can now use this tokenizer and our corpus to train a model:

from gensim import models

corpus = SOQuestionCorpus(tokenizer)

model = models.LdaMulticore(corpus=corpus, iterations=50, chunksize=5000, num_topics=100,
                            id2word=corpus.dictionary, eval_every=3, workers=5)

The model got trained to fit 100 topics to the corpus. After the training we can use this model to predict topics on new questions. Here are some examples of the resulting topics of the LDA trained topic model:

#         Contained words sorted by frequency
50 request, response, requests, header, server, with, http, headers, proxy, get
62 ffmpeg, enable, gpu, cuda, sensitive, retina, with, sdl, for, opencl
73 feature, features, for, training, classification, dataset, lat, naive, predict, race

 

As you can see #50 is mostly about HTTP communication, #62 is about GPU computing and #73 seems to be about machine learning. But there are no topics that reveal why the questioner is asking the question.

Verb-phrase tokenization of documents

Let’s fiddle around with the tokenization step to find a way to extract topics, that correspond to intentions of questioners. Nouns and noun phrases (NP) often contain information about the WHAT and as you can see in the above examples, the LDA mostly focuses on them. In contrast to nouns, verb phrases (VP) often contain information about the intentions of a questioner.

So let’s try to get rid of the noun phrases and train the LDA on the remaining phrases. To do so, we obviously need to figure out which parts of a sentence correspond to noun phrases. Nltk provides us with a neat base class for that task ChunkParserI. We can build upon that and implement a simple bigram chunker (which can later be easily replaced by a more sophisticated model).

# Underlying chunker we are going to train
class BigramChunker(ChunkParserI):
    def __init__(self, train_sentences):
        train_data = [[(t, c) for w, t, c in tree2conlltags(sent)]
                      for sent in train_sentences]
        # Create a bigram tagger and use the supplied training data to create the model
        self.tagger = BigramTagger(train_data)

    # Extracts chunks from a sentence using the conll tree format. The incoming sentence needs to be 
    # tokenized and annotated with POS tags
    def parse(self, sentence):
        pos_tags = [pos for (word, pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunk_tags = [chunk_tag for (pos, chunk_tag) in tagged_pos_tags]
        conll_tags = [(word, pos, chunk_tag) for ((word, pos), chunk_tag) in zip(sentence, chunk_tags)]
        return conlltags2tree(conll_tags)

As a training data set for the hunker the conll2000 data can be used. Now this hunker can be plugged in into a pipeline that first tokenizes, POS tags and chunks the sentences. Afterwards we can throw out chunks we do not want to use.

from nltk import sent_tokenize, word_tokenize, pos_tag, BigramTagger, ChunkParserI
from nltk.chunk.util import conlltags2tree
from nltk.corpus import conll2000


# This is an implementation of a text chunker. It tries to split sentences into its
# phrases. To do that there are several steps: sentence splitting, word tokenization,
# POS tagging. After that a trained chunker will group the tokens. 
class TextChunker:

    def __init__(self):
        # loading the training data for the chunker
        train_sents = conll2000.chunked_sents('train.txt', chunk_types=['VP', 'NP'])
        self.chunker = BigramChunker(train_sents)

    # Given a document of sentences calculate the containing chunks for each sentence
    def chunk_text(self, rawtext):
        if self.chunker is None:
            raise Exception("Text chunker needs to be trained before it can be used.")
        sentences = sent_tokenize(rawtext.lower())  # NLTK default sentence segmenter
        tokenized = [word_tokenize(sent) for sent in sentences]  # NLTK word tokenizer
        postagged = [pos_tag(tokens) for tokens in tokenized]  # NLTK POS tagger
        for tagged in postagged:
            for chunk in self._extract_chunks(self.chunker.parse(tagged), exclude=["NP", ".", ":", "(", ")"]):
                if len(chunk) >= 2:
                    yield chunk

    def _token_of(self, tree):
        return tree[0]

    def _tag_of(self, tree):
        return tree[1]

    # The chunker will produce a parse tree. We need to analyse the parse tree and
    # extract and combine the tags we want.
    def _extract_chunks(self, tree, exclude):
        def traverse(tree):
            try:
                # Let's check if we are at a leaf node containing a token
                tree.label()
            except AttributeError:
                # We want to exclude all POS tags in `exclude` and furhtermore we want to ignore special characters.
                # The POS tag of a special character is equal to the character. The only other token for which this is 
                # true is `to` so we need to make sure to exclude everything but `to`.
                if self._tag_of(tree) in exclude \
                        or self._token_of(tree) in exclude \
                        or (self._token_of(tree) != "to" and self._token_of(tree) == self._tag_of(tree)):
                    return []
                else:
                    # return the token of the node
                    return [self._token_of(tree)]
            else:
                node = tree.label()
                if node in exclude:
                    return []
                else:
                    return [word for child in tree for word in traverse(child)]

        for child in tree:
            traversed = traverse(child)
            if len(traversed) > 0:
                # chunks get conected again using whitespaces
                yield " ".join(traversed)

Instead of the default tokenizer utils.simple_preprocess we can now use this chunk based tokenizer to split our sentences and train our LDA model. The result is a model that mainly relies on verb phrases.

chunker = TextChunker()
corpus = SOQuestionCorpus(chunker.chunk_text)

model = models.LdaMulticore(corpus=corpus, iterations=50, chunksize=5000, num_topics=100,
                            id2word=corpus.dictionary, eval_every=3, workers=5)

As a result, there will be topics that look somewhat like this:

#         Contained words sorted by frequency
12 why, here, not, does, am getting, get, do not understand, is not working, works
32 compiled, using, by, get, is driving, compiling, invalid, signing, compile, is configured
43 can i do, using, want to show, want, not, only, now, can i achieve, here, am using

 

The topics can be improved using stemming and a more sophisticated chunker.

Summary

In this post I tried to explain to you how to train a simple text chunker using the python library nltk in combination with gensim. Filtering out tokens before training a topic model or going even further and combining several towns to a single one will result in topics of a different meaning.

Let me know if you try this or have tried a similar approach. I would appreciate to get to know your insights and results.

References

The idea is inspired by Miltiadis Allamanis and Charles Sutton and published in Why, When and What: Analyzing Stack Overflow Questions by Topic, Type & Code

by Tom Bocklisch


Related posts