Information Extraction from Text

Business Challenge

Over the last several years, members of healthcare industry have faced the challenge of digitizing their physical databases. Transitioning has become a necessity because consumers expect information to be consistent and available. Companies benefit greatly from being able to access a databases with organized information such as drug dosage, frequency and recurrent demographic. Manually extracting information out of a text normally requires many cost-inefficient human hours. Documents can be written in over ten languages which adds a significant translation cost for the company. Real world data is messy and unstructured, and often contains missing values or inconsistencies. All of these aspects pose various problems for a human, but for a machine, they are considerably more manageable.


SFL's Approach

SFL-Scientific’s goal is twofold: to save the company money by automatically storing information in a database, and provide a customizable platform for future text extraction. The most common information extracted is the classes of drug names, dosage frequency, drug type and container size, in over ten European languages. The text is initially tokenized, meaning it is broken up into words, symbols, phrases or other meaningful groupings referred to as tokens. Those tokens are passed into a machine learning algorithm that attempts to generate features based on capitalization, parts of speech, vowel groupings, and so on. This feature-based model was augmented by ensembling with several sequence labeling methods such as the Hidden Markov model, conditional random fields, n-grams, and Maximum Entropy text classifiers. Together, SFL used these methods of pattern recognition to extract only the pertinent information despite their varying structures. Indeed, out of several pages of text, only a handful of tokens are typically important in these particular documents. The result is a model that can parse and upload text into a database. SFL-Scientific has chosen to develop a web API for easy access to this information.


Business Value

The benefits of the digitization of pharmaceutical records extends to the company and the consumer. With this product, pharmaceuticals spend a significantly reduced amount of time finding and storing information. This software automates the process of manually reading through texts, saving thousands of human hours. The most common application is the extraction of dates, companies, people and amounts from books, pamphlets and more. Due to the customizable nature of this product, essentially any document’s key information can be extracted.