[Rui La, SFL Intern]
Automated text summarization through machine learning can be an extremely valuable tool to increase efficiency in both our everyday life and professional endeavors if the important information in a document can be extracted and accurately summarized.
The typical person consumes a wide variety of information every day--news articles, social media posts, work emails, study material, etc--and spends a large amount of time and energy reading and digesting the content. Since the first paper about text summarization was published in 1958 by Luhn, et. al, various techniques have been developed to successfully extract the important contents from a text document. The text summarization research field is a blossoming area of natural language processing (NLP), especially in recent years with implementation of neural networks in conjunction with cloud computing capabilities. In this post, several text summarization methods will be analyzed and explained.
Types of Summarization
Single vs. Multiple document
Single document summarization is to generate an abstract or outline based given a single document (i.e. a news article).
Multiple document summarization is to describe an event or theme common in several documents.
Extractive vs. Abstractive
The extractive method is to extract parts of the document that are more interesting than others by some metric and join them together to form a summary.
The abstractive method is to extract information from the original document and rephrase to new sentences. This is more analogous to how the human brain summarizes.
Most current document summarization systems are based on the extractive method since the abstractive methodology is still primarily in the research phase. In this post, we will mainly cover extractive summarization.
The main stages for the text summarization problem discussed in this blog post are as follows:
Important content selection
Training of sentence importance classification model
Stitch together important sentences in order that the appear in the original document.
In this project, we will use the the 20 newsgroups dataset for investigation. The dataset contains approximately 18000 newsgroups posts on 20 topics. First, a single document is broken into sentence segments. The goal is to set up a standard to determine which sentences are important. Here, the following 3 rules are used:
The first sentence in a document usually represents the summarization of the whole paragraph. Thus, we automatically consider the first sentence as important.
Normalized sentence Term Frequency-Inverse Document Frequency (Tf-Idf) score
Tf-Idf reflects how important a word is to a document in a collection or corpus. Normalized sentence Tf-Idf score is the total Tf-Idf score of a sentence divided by the length of sentence.
Special Named Entities rate
Named entities are persons, locations, organizations, time expressions, etc. Sentences that contain these special named entities may carry more useful information than others. A word recognized by named entity recognition (NER) is commonly recognized as a noun by a part-of-speech (POS) tagger. Therefore, the number of special named entities in a sentence represents the importance of a sentence.
Heat map of POS vs name entity. Most of non-”O” entities are in POS list of [“NN”, “NNP”, “NNPS”, “CD”, “JJ”]
Once the normalized Tf-Idf score and special named entities rates are calculated, sentences are able to be placed into binary classification groups. Important sentences to be included as part of the abstract are labeled “1”. Sentences that are not important are labeled “0” and will be excluded from the abstract.
For the classification task, algorithms such as Naive Bayes, Decision Trees, Hidden Markov Models, Conditional Random Fields, Recurrent Neural Networks, Long-Short Term Memory models, etc are used and can even be ensembled together (ensembling different models together often results in better class predictions).
The following table summarizes the properties we extracted from the dataset.
The threshold for the “norm_sent_tfidf” and “key_POS_rate” features were set at 0.45 and 0.5, respectively and are model tunable parameters. The figure below illustrates the number of important vs unimportant sentences of the corpus. There are 38,641 important sentences out of a total of 128,179 sentences translating to an importance rate of approximately 30%.
In order to check if our classification method is reasonable, we can use a t-SNE projection to reduce the dimension of Tfidf embeddings and visualize the classification result. From the figure below we can see that there is a clear distinction between the two colors. The yellow dots representing important sentences possess the upper left corner and the blue dots representing not important sentences occupy the other part of the plot.
The distribution of sentence length and number of special named entities in important/unimportant sentences can be used to confirm the reasonability of the classifier.
A violin plot is an alternative to a traditional box plot. The inner markings of the violin plots show the percentile while the width of the “violin” shows the distribution of words. The following violin plots show the effect of longer sentences and sentences which contain more special named entities on sentence importance. From the plots long sentences seem to be less importance, but number of special named entities seems irrelevant to the sentence importance.
Two different ways to extract features from dataset will be discussed: Tf-Idf and word2vec.
Tf-Idf. 2 Tf-Idf vectorizers were created: a word vectorizer and character vectorizer. These vectorizers convert a collection of raw documents into a matrix of Tf-Idf features. The character vectorizer tokenizes a sentence into 2 to 6 grams.
Word2vec. Google's word2vec model embeds the words in sentences via one-hot encoding to lower dimension vectors. Word2vec is a deep-learning inspired method that focuses on the meaning of words. It attempts to understand meaning and semantic relationships among words.
After extracting features from dataset, different machine learning or deep learning algorithms to classify the sentences can be used. In this analysis, the light Gradient Boosting Machine (GBM) algorithm is used to fit a model to the training data and predict sentence importance on the testing dataset which was initially “held out” (separated) from the training dataset. The confusion matrix plot below shows the results on the testing dataset. The accuracy score of prediction is approximately 91%.
Once the model is trained and tuned, the model was deployed on a short document to check if it can successfully summarize the document. The following is a paragraph from dataset. There are 8 sentences in the document, and the model returns that the first, second, and the fourth sentences are important. Therefore, the sentences were extracted and formed the abstract of this document.
In this model, all tokens are converted to a 500 dimension vector. Several tests are performed to check if the model is reasonable. The built in similarity function returns the correlation coefficient between the vectors representing these two words. The higher the score is, the more similar the two words.
The model can also rule out the word that doesn’t belong to a certain group such as car in the group of fruits as the example in the below.
Another model capability is the ranking of words which are similar to a target word.
Feature extraction using word2vec
Once the tokens were vectorized to 500 dimension vectors, the average feature vector of a sentence was created by summing the vectors of all the words in a sentence and dividing the total by the length of sentence. Creating an average feature vector is a way of representing the sentence in vector space with a single vector so that most machine learning algorithms can handle the input. Note that we do lose some information in collapsing the representation into one dimension; to get around this concern, we could choose a method to vectorize the document by “padding” the sentences in a document to the same length so that machine learning algorithms can ingest the information.
With the simple averaged sentences, the confusion matrix of prediction results on the testing dataset is shown below. The prediction accuracy score is 88%.
Similar to the Tf-Idf method, a document can be passed into the word2vec model to test if the model can successfully extract important sentences.
Predict the sentence importance:
Summary and Future Work
In this post, a rules based learning method was used to create a proxy for important and unimportant sentences. Then, two different methods--Tf-Idf and word2vec--are used to create features from the dataset and build a supervised machine learning classification model. Various performance metrics are applied to compare the robustness and predictive powers. Finally, a the model is applied to a document to test model functionality.
The sentences are then stitched together, in order, and a final summary is created. The parameters for importance in the binary classification algorithm can be tuned to create a longer or shorter algorithm as desired.
A common modification to extractive summarization uses the notion of similar sentences. In principle, the step after removing unimportant sentences is to ensure that the resulting sentences that are extracted are uncorrelated with each other. A simple cosine similarity score between word2vec sentences on the important sentences will allow such a comparison. Indeed, similarity scores can be used without the need to find “important” sentences to create maximally covering summaries.
Future work would include the exploration of similarity between sentences and the use of may subjects, titles, headers to create a more robust importance metric. Finally, deep learning algorithms (e.g. a RNN or LSTM) can also be implemented to either extract important sentences or create a final end-to-end abstractive summary.