Natural Language Processing: N-grams

There are many applications to natural language processing that include document classification, speech recognition and translation services. There are also a multitude of less commonly thought of applications that include: telephony, home automation, robotics, speech-to-text reporter, video game, transcription, compression algorithms. The performance of speech recognition is typically evaluated in terms of both speed and accuracy. 

Naturally, NLP is a very rich and deep field of research, with a large variety of techniques to extract information from languages. In subsequent posts we'll have a look at other more complicated examples. but for now we'll discuss one of the simplest NLP algorithms known as the n-gram model. 


The basic use of N-grams is sequence prediction. This is applicable to situations such as predictive text (used on mobile phones or Google search below) or even live data-compression, where encoding needs to be done on-the-fly.

Google output search: machine learning

The basis of the N-gram model is the chain rule of probability. Recall that if the probability of \(A\) occurring is \(P(A)\) and the probability of \(B\) occurring is \(P(B)\) then the joint probability of both \(A\) and \(B\) occurring \(P(A\cup B)\) can be written:

$$P(A\cup B) = P(A|B) P(B)$$

This can be generalized to:

$$P(x_1 \cup x_2 \cup \cdots \cup x_n) = \prod_{i=1,\cdots,n} P(x_i | x_1\cup x_2\cup\cdots\cup x_{i-1})$$

As an example the joint probability of the words in the sentence \(P(\text{"the quick brown dog"})\) can be heuristically written  out as

$$P(\text{"the quick brown fox"}) = P(\text{the}) P(\text{quick} | \text{the}) P(\text{brown}|\text{the quick}) P(\text{fox}|\text{the quick brown})$$

Ideally we would compute all conditional probabilities in a frequentist manner (simply get a training set and count the number of instances where "the quick brown" is followed by the word "dog" and the number of total times "the quick brown" occurs). Unfortunately, computing the conditional probabilities in this manner is simply not feasible due to the shear explosion of  possible combinations.

Instead we use an approximation known as the N-gram model. Expressed formulaically, 

$$P(x_1\cup x_2 \cup \cdots \cup x_n) \approxeq \prod_{i=1,\cdots,n} P(x_i | x_{i-N}\cdots x_{i-1})$$

Where we have computed the probabilities for the next word, based on a subset of the previous words - typically the N immediately preceding words, where N is determined by the complexity-vs-accuracy trade-off.

Predictive Text Example:

Let's construct a simple N=2 bigram example to see how this will work in practice. For bigrams, we can construct a dict of dicts in python (for N-grams this would probably generalise to a multiway tree of some sort) as follows:

Python snipet NLP

Let us see what this does to our training data set from Dr Seuss. 

Algorithm application to phrase input. 

Now that we have an input_dataset that we can use to compare against, we can now make a suggestion on the next word (using a similarly constructed find_unigram_count function), based on their probabilities:

Workflow continued: Suggestion, prediction text


And that's it! With a large enough dataset, this will make a reasonable predict of the next word using the previous one. Last time I heard, Google was using about an 5-gram model with a trillion word training set!

Final Thoughts

We have discussed the N-gram algorithm used in word prediction. Our discussion of bigrams in principle can be followed till you run into computational bounds. In practice, you don't require explicit computation of the unigram (N-1 gram in the general case), all that is required is to keep a running track of the probabilities of the N-gram. Another simplification that can be made is that instead of saving the probabilities, we keep the Log probabilities. This is done for two reasons, the probabilities are typically very small and you run into computational underflow issues, and furthermore, the multiplication of probabilities here is instead converted to a sum of their logs, which is computationally much faster. 

We have covered one of the simplest NLP algorithms. In subsequent posts, we will consider more complicated examples and their applications.