Document classification is currently one of the most important branches of Natural Language Processing (NLP). The general idea is to automatically classify documents into categories using machine learning algorithms.
The applications are almost endless, we can classify: Patient records, movie reviews, webpages, emails (spam vs not spam) - we can scrap the entire internet for information.
Bag of Words Representation
To translate raw text into features that we can classify, we can use a bag of words representation.
This is a very powerful, yet simple method. From the training data, we create a count for the number of times each word appears. We can implement some sort of smoothing to ensure that words in the training set but not the test set are represented (and have some non-zero probability).
Further, the count of words can later be modified in terms of counts of N-grams, or by weighting with some importance metric (term frequency-inverse document frequency).
For simplicity let's keep it as a bag of words, where we effectively take each word as independent (1-gram). Now that we have a set of features for a training set, let's make this concrete using an example.
Classification of Abstracts
The NIH has a very nice website, from which you can pull information for academic papers:
From here, we can download medical journal abstracts on all types of research in disease and epidemiology.
For this example, we set up a three class classification problem, with classes: Whooping cough, chronic cough, and AIDS. The goal is to classify abstracts that we haven't seen before into these three categories.
Let's first load the data and put it into a pandas data frame for ease of use. The data now looks like this:
We also do the usual random splitting of the data into training and testing sets for validation.
We now use sklearn's feature extraction for the bag of words representation:
Here train_data_features is now simply the set of features we want for the bag of words. After converting to an numpy array, it is simply a list of lists corresponding to counts of each word in each abstract.
Now that we have these features, we can construct a random forest classifier and feed in the training data:
Predicting on a test set and voila, we are done:
Unsurprisingly, AIDS is easier to segment, while there's larger confusion between the abstracts for papers for whooping and chronic coughs.
A simple representation such as the bag of words already shows great promise in document classification problems, reaching 90% accuracy for the AIDS class in our example. Having said that, this kind of analysis is a very simple, quick, and neat baseline on which to build our document classifier.
We can (and should) extend this further with more complicated features to improve both precision and recall, and importantly taking into account complications with smoothing depending on corpus size.