Text summarization is a relatively novel field in machine learning. The goal is to automatically condense unstructured text articles into a summaries containing the most important information.
Instead of a human having to read entire documents, we can use a computer to summarize the most important information into something more manageable. The use cases for such algorithms are potentially limitless, from automatically creating summaries of books to reducing messages from millions of customers to quickly analyze their sentiment.
In this blog, we will consider the broad facets of text summarization. These include:
- raw text extraction/summarization methods,
- sentiment analysis
- and named entity recognition
In all these cases, we significantly reduce the amount of information that needs to should be considered directly by the human using machine learning techniques.
In principle there are several problems that we face before we mine the text for information:
- text is often a unstructured, with pertinent information may dotted around
- text may be poorly formatted, with HTML, and non-ascii encodings. Further, text can contain spelling or grammatical errors
- information may be given in different languages.
- finally, responses may contain not only text and numbers, but emojis, pictures, links, and non-ascii characters
Despite these problems of cleaning data, similar to all fields of machine learning, NLP methods can significantly facilitate the processing of the feedback. Below we will consider several approaches to text analysis and NLP methods and algorithms for the current task.
Automatic Text Extraction and Summarization
The first method of text summarization can be thought of keyword/keyphrase extraction. We can reduce millions of sentences to a few hundred (or even a tunable number of sentences – that trades informativeness to length, with some qualitative metrics). We try in this case to create a representative subset of the text that includes the information of the entire set.
There are two approaches to automatic summarization, extraction and abstraction. The former is where we extract relevant existing words, phrases or sentences from the original text and the latter builds a more semantic summary using NLP techniques.
The basic algorithms are listed below and can be something as simple as a frequency count in a word cloud to creating a coherent and readable summary of a text.
In later blog entries we will discuss some of these approaches in more detail.
The next summarization approach we will discuss is concerned with the “emotional” component of text, be it positive or negative. This is usually formulated as a classification problem of sentiment.
Training data is typically in the form of product/service reviews where text reviews are accompanied by labelled rating scores assigned by their reviewers, e.g., 0-10, or simply positive and negative classes that can be determined from the ratings.
To covert the text into something that a Machine Learning algorithm can ingest, we can use one or several techniques to extract information in the form of features:
With these features, it is simple to use a traditional classification algorithms such as a neural network, to do the final classification. Again, we will come back to algorithms in future posts.
With a trained algorithm, we can thus predict sentiment on any text without this information. This can be call-logs (after audio transcription using time-series techniques discussed previously), help-center chat logs, customer reviews, twitter feeds, etc.
The next steps from the business perspective can then be made, examples include:
- managers can consider all negative reviews, and draw conclusions on how to improve the quality of the product or provided service.
- filter important concepts so you can pursue the most promising opportunities in product development
- identify changing trends and preferences over time
- determine market segmentation on varying trends within your clientele
- can catch important issues like product defects or service errors at an early stage
Such methods are typically highly accurate and often are in-line with human levels of agreement.
Named Entity Recognition
The last approach, that we will discuss in this article, is "Entity recognition". In this approach the problem is reduced to looking for meaningful objects in the text. Objects may for example be any of the following:
- brand etc
Analysis of the results may help you to understand:
- what product or model is most often mentioned
- dates for appointments and time-lines of delays and complaints
- if there are some reviews related to particular offices or locations if you have several.
There are several methods to do named entity extraction. The idea being that you can identify terms that are most important to the text.
The simplest method includes using a dictionary of words – this can be helpful for known quantities such as days of the week, months perhaps even city and country names. More complicated methods include the following:
- Supervised: feature engineering using word length and shapes (e.g. does the word have an “x”, then more likely to be a drug name etc), and sequence modelling Conditional Random Fields (CRFs), Maximum Entropy Hidden Markov Models (HMM) etc.
- Semi-supervised: with a small seed of known named entities bootstrap iteratively a regular expression
We gave a high level introduction into how text summarization can greatly facilitate and expedite the processing of text.
We discussed the three main approaches to text summarization - automatic summarization, sentiment analysis and named entity extraction - that can be used to process books, reviews, any text document. This will signficantly reduce the time required by a human to understand all the text based information out there, be it web-pages, customer reviews, or entire novels!
There are plenty of advanced tools, algorithms, and libraries already out there that can do this. In future blogs, we will also go over some text summarization algorithms in detail to illustrate a real world example.