Solutions  >  Data Strategy & Roadmap  >

Accelerating the adoption & effectiveness of AI and machine learning.

We help our clients understand their current state, comparing it to the desired business outcome and feasible development. To assess an organization's data and analytics XX

Developing an understanding where an organiztion sits, what use-cases it can solve today, and what improvements and novel products it could create in the future requires an understanding of:

Data Strategy

Data Engineering


Machine Learning Development

Deployment & Integration

Data Science & analytics is a business outcome enabler, it bridges the gap between commericial management and technical expertise. To be sucessful, organizations must adopt a methodology that ensures data-driven focus on business priorities and avoid technical questions of hardware, accuracy, and speed, etc. before understanding the business case and objectives. Involving upstream and downstream users with data scientists will align objectives and produce a coherent approach faster.

Building a Data Strategy & Technology Roadmap

A good data strategy begins with identifying business-driven needs and use cases that are technology agnostic. Orienting solutions to outcomes allows leaders, consultants, and data scientists to reliability determine the lift generated through adoption of machine learning systems. A sound data strategy encompasses all aspects of data, engineering, algorithm development, IT, and all associated stakeholders that are XX as a result of innovation and development.

Identify gaps and develop a comprehensive plan for establishing best practices and solutions with machine learning, technology, and use-case driven development. Requirement actionable, feasible, and practical roadmap with clear timelines of pilots and R&D phase, MVP, and integration into business processes.

Data Strategy is Dynamic

Data strategy and roadmaps do not need be only for specific use cases, but competitive advantage for your entire organization. Technology is high dynamic, allowing for automation and optimizatio previously reserved for closed-environment systems - priotizing investments that generate immediate business value and prototyping solutions that will become a competitive advantage are critical to maintain flexibility in a changing landscapes.

The algorithm that SFL delivered was then deployed through AWS, which ran the code on demand: Whenever a legal document was uploaded, SFL's code was automatically launched. With AWS, the whole deployment into production process was streamlined and made easily accessible to all LinkSquares employees who needed access to this asset. In addition, AWS S3 was used for daily backups.

The text extraction algorithm consisted of three main steps:

  1. Feature Engineering
  2. Modeling — stacking ensemble
  3. Post-processing

Feature Engineering

First, the algorithm tokenizes the raw text of the legal documents using a regular expression tokenizer, which simply means that each word of the text is parsed and stored as an independent observation. Then, hundreds of features are created from the tokenized text in three ways: Rule-based features, token-based features, and sequence- level classes as features.

Rule-based features are created by matching a “fuzzy dictionary” to a set of predefined, known smart terms values. Hard-coded rules should translate to a known class. For example, North American states are hard-coded to be the “Governing Law” class. The classes determined from these hard-coded set of rules are saved as features.

Token-based features are generated on a per- token basis based on knowledge of the pre- defined, known smart terms. The features are lumped into three general categories: Token- level (Is this word a noun?), sentence-level (Is this token the first token in a sentence?) and document-level (Is this token found in the X section?). Other examples of token-based features include items such as “Is the token capitalized?”, “How many letter ’A’s are there in the token?”, “How long is the token?”, “Is the next word a known smart term?”, etc.

Sequence-level features are generated by predicting the classes of each token using several sequence-level machine learning models: Conditional Random Field, Hidden Markov Model, N-gram model, and a neural network. Just as the rule-based classes are saved as features, the classes determined from these sequence-level machine learning models are also saved as features.


Since no single model or rule can guarantee a token is a specific class, a model stacking ensemble technique was implemented to better predict the class of a token. XGBoost (a gradient boosted decision tree-based model) was implemented as a meta-classifier, which uses the class predictions from the hard- coded rules and sequence-level models as features. The XGBoost meta-classifier also uses the hundreds of token-based features as predictors and is trained against the human or manually tagged data. XGBoost assigns a probability associated with each class prediction, so a probability threshold was determined from the probabilities assigned using a holdout set to maximize F-measure to determine the final class.

Post Processing: Once the classes for each token were predicted via the XGBoost meta-classifier, the predictions were cleaned. Continuous tokens are concatenated with each other enabling a more homogenous output to be produced. For example, dates are formatted into Month/Day/ Year as opposed to be left as a separate token for each value.


The algorithm developed by SFL produced significantly better classification results than all previous models using F-measure as the scoring metric.

LinkSquares was satisfied with the resulting algorithm and deployed the solution into production with the help of AWS. It is estimated that over 100,000 legal documents have been analyzed, saving tens of thousands of hours, and enabling businesses to be more streamlined, transparent, and quantitative about their legal contracts.

> "> For over a year, we've developed a strong relationship with SFL Scientific and leveraged their skills to develop and deploy machine learning in our systems.> "

— Eric Alexander, CTO, LinkSquares

Tools & Technologies

Python, NLP, Hidden Markov Model, Conditional Random Field, Neural Networks, XGBoost, AWS, IAM, S3.


LinkSquares Inc.


Tech, Legal


Extract terms to automate contract analysis.


Used NLP techniques in conjunction with supervised learning algorithms to extract key terms from contracts for LinkSquares' Smart Values contract analysis cloud.

Tools & Technologies:

Python, NLP, Hidden Markov Model, Conditional Random Field, Neural Networks, XGBoost, AWS, IAM, S3.

"AWS enables LinkSquares to innovate faster: Computing resources re easily provisioned, integrated, with compliant and secure software deployment a breeze."

— LinkSquares on AWS

SFL Scientific is a AWS consulting partner for data science, big data, and artificial intelligence development..
preferred deep learning partner NVIDIA

About Us

DISCOVER MORE: SFL Scientific is a data science consulting firm offering custom development and solutions, helping companies enable, operate, and innovate using machine learning and predictive analytics. We accelerate the adoption of AI and deep learning and apply domain knowledge & industry expertise in solving complex, R&D, and novel business problems with data-driven systems.