`lexnlp.extract`: Extracting structured data from unstructured text¶

The lexnlp.extract module contains methods that allow for the extraction of structured data from unstructured textual sources. Supported data types include a wide range of facts relevant to contract or document analysis, including dates, amounts, proper noun types, and conditional statements.

This module is structured along ISO 2-character language codes. Currently, the following languages are stable:

English: lexnlp.extract.en

Extraction methods follow a simple get_X pattern as demonstrated below:

>>> import lexnlp.extract.en.amounts
>>> text = "There are ten cows in the 2 acre pasture."
>>> print(list(lexnlp.extract.en.amounts.get_amounts(text)))
[10, 2.0]

Pattern-based extraction methods¶

The full list of supported pattern-based structured data types is below:

amounts, e.g., “ten pounds” or “5.8 megawatts”
citations, e.g., “10 U.S. 100” or “1998 S. Ct. 1”
conditions, e.g., “subject to …” or “unless and until …”
constraints, e.g., “no more than” or “
courts, e.g., “Supreme Court of New York”
dates, e.g., “June 1, 2017” or “2018-01-01”
definitions, e.g., “Term shall mean …”
distances, e.g., “fifteen miles”
durations, e.g., “ten years” or “thirty days”
geographic and geopolitical entities, e.g., “New York” or “Norway”
money and currency usages, e.g., “$5” or “10 Euro”
percents and rates, e.g., “10%” or “50 bps”
PII, e.g., “212-212-2121” or “999-999-9999”
ratios, e.g.,” 3:1” or “four to three”
regulations, e.g., “32 CFR 170”
trademarks, e.g., “MyApp (TM)”
URLs, e.g., “http://acme.com/”

NLP-based extraction methods¶

In addition to pattern-based structured data types, the lexnlp.extract module also supports NLP methods based on tagged part-of-speech classifiers. These classifiers are based on NLTK and, optionally, Stanford NLP libraries. The list of these modules is below:

named entity extraction with NLTK maximum entropy classifier

named entity extraction with NLTK and regular expressions

named entity extraction with Stanford Named Entity Recognition (NER) models

`lexnlp.extract`: Extracting structured data from unstructured text¶

Pattern-based extraction methods¶

NLP-based extraction methods¶

Table Of Contents

Related Topics

This Page

lexnlp.extract: Extracting structured data from unstructured text¶

Pattern-based extraction methods¶

NLP-based extraction methods¶

`lexnlp.extract`: Extracting structured data from unstructured text¶