lexnlp.extract: Extracting structured data from unstructured text

The lexnlp.extract module contains methods that allow for the extraction of structured data from unstructured textual sources. Supported data types include a wide range of facts relevant to contract or document analysis, including dates, amounts, proper noun types, and conditional statements.

This module is structured along ISO 2-character language codes. Currently, the following languages are stable:
  • English: lexnlp.extract.en

Extraction methods follow a simple get_X pattern as demonstrated below:

>>> import lexnlp.extract.en.amounts
>>> text = "There are ten cows in the 2 acre pasture."
>>> print(list(lexnlp.extract.en.amounts.get_amounts(text)))
[10, 2.0]

Pattern-based extraction methods

The full list of supported pattern-based structured data types is below:

NLP-based extraction methods

In addition to pattern-based structured data types, the lexnlp.extract module also supports NLP methods based on tagged part-of-speech classifiers. These classifiers are based on NLTK and, optionally, Stanford NLP libraries. The list of these modules is below:

  • named entity extraction with NLTK maximum entropy classifier
  • named entity extraction with NLTK and regular expressions
  • named entity extraction with Stanford Named Entity Recognition (NER) models