Machine learning tips: How to quickly generate a bag of words

Natural language processing is all about text. But it doesn't use the text itself. You need vectors of numbers, for example a bag of words. This can be easily built with a dictionary and some for loops. But there is an easier way.

NLP is about numbers

For those who haven't done anything with natural language processing it may sound weird that when you want to classify a piece of text you need vectors of numbers to do so. This has to do with the fact that machine learning is a mathematical concept and not a linguistic concept.

You could of course parse the sentence by applying rules. But that is tedious work and much more imperfect as doing the same thing with a machine learning algorithm.

When you apply machine learning to text you're not really looking at words, you're trying to understand a pattern in the text. So in order to do that you need to express properties of the text as numbers. Like how often the words appear in a sentence, the position within that sentence or other properties that may proof useful to recognize patterns.

There are a couple of ways you can use to turn sentences into a representation suitable for machine learning:

Bag of Words (BoW)
Global Vectors for Word Representation (GloVe)
Word2Vec

Bag of Words uses word counts to create a kind of signature of the input text to classify. The other two methods use complex vector representations that use things like distance to other words to express a pattern.

Turn text into a bag of words

The bag of words algorithm uses word counts to represent the input text for your machine learning algorithm. It works like this:

Create a bucket for each unique word you want represented (the vocabulary). Next go over the text and put a token in the right buckets for the words you encounter.

You can build this with plain python, but as I mentioned before it is not the most efficient method.

As with many things in IT, the problem has been solved before. The python package scikit-learn contains several tools for machine learning. Among them a set of so-called vectorizers.

from sklearn.feature_extraction.text import CountVectorizer

input_vectorizer = CountVectorizer(input='content')
input_vectorizer.fit([..., ...])

The sample above shows the CountVectorizer class. This class uses the bucketing principle. It assigns a bucket to every unique word when you call fit.

After you've trained the vectorizer you can transform texts

vector_output = input_vectorizer.transform([...])

The transform method accepts a list of strings you want converted into vectors. The output is a matrix containing the results, one sentence per row.

Watch out for the input setting

The CountVectorizer can be initialized with three values for the input argument: file, input and filename. This has an effect on how the fit and transform calls work. For example, when you use filename as the input argument you are required to specify filenames for both fit and transform. I've found that this limits the usability somewhat.

The filename input setting is useful for training, but not for actually using the vectorizer. So I fixed the problem by initializing the vectorizer with the input setting content and then using the function below to specify the input for the call to the fit method.

def input_documents(filenames):
    for filename in filenames:
        with open(filename, 'r') as input_file:
            while True:
                line = input_file.readline()
    
                if len(line) == 0:
                    break
    
                yield line

This results in the following training logic:

from sklearn.feature_extraction.text import CountVectorizer

input_vectorizer = CountVectorizer(input='content')
input_vectorizer.fit(input_documents(['data/file1.txt', 'data/file2.txt']))

This frees you from having to write single user sentences to file before you can transform them.

Final thoughts

The vectorizers make life much easier. Don't forget to checkout the documentation of scikit-learn to discover more!

Cheers!