Natural Language Processing: Relation Extraction I


Relationship Extraction is a very interesting problem in natural language processing. The idea is to link two entities, such as the owner of a company, or the someone's company position and a person in unstructured text sources. An example would be to extract Bill Gates and Microsoft from the following unstructured text:

"<PERSON> Bill Gates </PERSON>, the founder of <ORG> Microsoft </ORG>, hosted a party last night".

Surprisingly the methods that are typically used on large projects are remarkably simple based on feature extraction as a binary based classification problem. More complicated supervised methods using kernel methods that bypass the need for generating explicit features perform better but are not as well suited for larger domain problems. We shall discuss these methods in a later post.

Supervised Feature Extraction

In most cases, it is typical to try and first identify which words are the entities we wish to extract.

First, entity based features can be developed. These include:

  • Gazetteers (i.e. dictionaries of organization names etc)
  • Features based on:
    • Parts of speech tags
    • Regular expressions
    • Word length
    • Word shape
    • Substring
    • Capitalized letters
  • More complicated features such as:
    • Bigrams
    • Sequencing modeling - especially for the words between the two sequences.

Now that we have features generated we can classify every pair of words using typical supervised machine learning algorithms. 

Semi-Supervised Seeding

We can also proceed in semi-supervised manner, where if we know certain relationships, say Bill Gates and Microsoft, we can learn the rule:

"<PERSON> X </PERSON>, the founder of <ORG> Y </ORG>, hosted a party last night".

With a large dataset, we can apply this exact sentence pattern to find new Person-Organization relations. Then, say we find X' and Y' entities, we can then find different sentence structures, and bootstrap this until we have found them all. 

As with a lot of these rule generation methods, the results will typically give high precision but low recall. 

Final Thoughts

Both simple supervised and semi-supervised methods play a large role in today's RE methods. These methods are geared towards automatically discovering relations from large databases of text, including that from internet webpages. 

In a later post, we will discuss more domain specific algorithms that make use of kernels.