lexnlp.extract: Extracting structured data from unstructured text

The lexnlp.extract module contains methods that allow for the extraction of structured data from unstructured textual sources. Supported data types include a wide range of facts relevant to contract or document analysis, including dates, amounts, proper noun types, and conditional statements.

This module is structured along ISO 2-character language codes. Currently, the following languages are stable:
  • English: lexnlp.extract.en
  • German: lexnlp.extract.de
  • Spanish: lexnlp.extract.es

Extraction methods follow a simple get_X pattern as demonstrated below:

>>> import lexnlp.extract.en.amounts
>>> text = "There are ten cows in the 2 acre pasture."
>>> print(list(lexnlp.extract.en.amounts.get_amounts(text)))
[10, 2.0]

Pattern-based extraction methods

The full list of supported pattern-based structured data types is below:
  • “EN” locale:
  • “DE” locale:
    • amounts, e.g., “1 tausend” or “eine halbe Million Dollar”
    • citations, e.g., “BGBl. I S. 434”
    • copyrights, e.g., “siemens.com globale Website Siemens © 1996 – 2019”
    • court citations, e.g., “BStBl I 2003, 240”
    • courts, e.g., “Amtsgerichte”
    • dates, e.g., “vom 29. März 2017”
    • definitions
    • durations, e.g., “14. Lebensjahr” or “fünfundzwanzig Jahren”
    • geographic and geopolitical entities, e.g., “Albanien”
    • percents, e.g., “15 Volumenprozent”
  • “ES” locale:
    • copyrights, e.g., “”Website BBC Mundo © 1996 – 2019”
    • courts, e.g., “Tribunal Superior de Justicia de Madrid”
    • dates, e.g., “15 de febrero” or “1ºde enero de 1999”
    • definitions, e.g., “”El ser humano”: una anatomía moderna humana”
    • regulations, e.g., “Comisión Nacional Bancaria y de Valores”

NLP-based extraction methods

In addition to pattern-based structured data types, the lexnlp.extract module also supports NLP methods based on tagged part-of-speech classifiers. These classifiers are based on NLTK and, optionally, Stanford NLP libraries. The list of these modules is below:

  • named entity extraction with NLTK maximum entropy classifier
  • named entity extraction with NLTK and regular expressions
  • named entity extraction with Stanford Named Entity Recognition (NER) models
These modules allow to extract data types like:
  • addresses, e.g., “1999 Mount Read Blvd, Rochester, NY, USA, 14615”
  • companies, e.g., “Lexpredict LLC”
  • persons, e.g., “John Doe”