Solutions  >  Case Studies  >

Text Extraction from Legal Documents Using Natural Language Processing

Business Challenge

LinkSquares approached SFL Scientific about devising a solution to algorithmically extract key terms from contracts and legal documents for their Smart Values contract analysis cloud.

For legal and finance departments time is extremely valuable, and the need for automation spans every industry, no matter the company size. If the time it takes to review one contract is significantly reduced down to seconds, the aggregate savings across initial legal review, change orders, and the final record keeping process, has tremendous business impact. The time savings from not having to manually sift through the legal-speak and verbosity, standard phrases, and insertions in contracts can be spent elsewhere, helping the organization in more critical areas and negotiations.

LinkSquares provides a contract analysis and reporting service intended to relieve businesses from the burden of manually storing and reviewing contracts line-by-line.

SFL Scientific Solution

SFL’s task was to provide LinkSquares with an algorithm to perform two main operations:

  1. Extract key terms from a legal document.
  2. Classify tokenized text into pre-defined categories.

The algorithm that SFL delivered was then deployed through AWS, which ran the code on demand: Whenever a legal document was uploaded, SFL's code was automatically launched. With AWS, the whole deployment into production process was streamlined and made easily accessible to all LinkSquares employees who needed access to this asset. In addition, AWS S3 was used for daily backups.

The text extraction algorithm consisted of three main steps:

  1. Feature Engineering
  2. Modeling — stacking ensemble
  3. Post-processing

Feature Engineering

First, the algorithm tokenizes the raw text of the legal documents using a regular expression tokenizer, which simply means that each word of the text is parsed and stored as an independent observation. Then, hundreds of features are created from the tokenized text in three ways: Rule-based features, token-based features, and sequence- level classes as features.

Rule-based features are created by matching a “fuzzy dictionary” to a set of predefined, known smart terms values. Hard-coded rules should translate to a known class. For example, North American states are hard-coded to be the “Governing Law” class. The classes determined from these hard-coded set of rules are saved as features.

Token-based features are generated on a per- token basis based on knowledge of the pre- defined, known smart terms. The features are lumped into three general categories: Token- level (Is this word a noun?), sentence-level (Is this token the first token in a sentence?) and document-level (Is this token found in the X section?). Other examples of token-based features include items such as “Is the token capitalized?”, “How many letter ’A’s are there in the token?”, “How long is the token?”, “Is the next word a known smart term?”, etc.

Sequence-level features are generated by predicting the classes of each token using several sequence-level machine learning models: Conditional Random Field, Hidden Markov Model, N-gram model, and a neural network. Just as the rule-based classes are saved as features, the classes determined from these sequence-level machine learning models are also saved as features.


Since no single model or rule can guarantee a token is a specific class, a model stacking ensemble technique was implemented to better predict the class of a token. XGBoost (a gradient boosted decision tree-based model) was implemented as a meta-classifier, which uses the class predictions from the hard- coded rules and sequence-level models as features. The XGBoost meta-classifier also uses the hundreds of token-based features as predictors and is trained against the human or manually tagged data. XGBoost assigns a probability associated with each class prediction, so a probability threshold was determined from the probabilities assigned using a holdout set to maximize F-measure to determine the final class.

Post Processing: Once the classes for each token were predicted via the XGBoost meta-classifier, the predictions were cleaned. Continuous tokens are concatenated with each other enabling a more homogenous output to be produced. For example, dates are formatted into Month/Day/ Year as opposed to be left as a separate token for each value.


The algorithm developed by SFL produced significantly better classification results than all previous models using F-measure as the scoring metric.

LinkSquares was satisfied with the resulting algorithm and deployed the solution into production with the help of AWS. It is estimated that over 100,000 legal documents have been analyzed, saving tens of thousands of hours, and enabling businesses to be more streamlined, transparent, and quantitative about their legal contracts.

"For over a year, we've developed a strong relationship with SFL Scientific and leveraged their skills to develop and deploy machine learning in our systems."

— Eric Alexander, CTO, LinkSquares

Tools & Technologies

Python, NLP, Hidden Markov Model, Conditional Random Field, Neural Networks, XGBoost, AWS, IAM, S3.


LinkSquares Inc.


Tech, Legal


Extract terms to automate contract analysis.


Used NLP techniques in conjunction with supervised learning algorithms to extract key terms from contracts for LinkSquares' Smart Values contract analysis cloud.

Tools & Technologies:

Python, NLP, Hidden Markov Model, Conditional Random Field, Neural Networks, XGBoost, AWS, IAM, S3.

"AWS enables LinkSquares to innovate faster: Computing resources re easily provisioned, integrated, with compliant and secure software deployment a breeze."

— LinkSquares on AWS

 SFL Scientific is a AWS consulting partner for data science, big data, and artificial intelligence development..

About Us

DISCOVER MORE: SFL Scientific is a data science consulting firm offering custom development and solutions, helping companies enable, operate, and innovate using machine learning and predictive analytics. We accelerate the adoption of AI and deep learning and apply domain knowledge & industry expertise in solving complex, R&D, and novel business problems with data-driven systems.