lexnlp.extract
: Extracting structured data from unstructured text¶
The lexnlp.extract
module contains methods that allow for the extraction
of structured data from unstructured textual sources. Supported data types include a
wide range of facts relevant to contract or document analysis, including dates, amounts,
proper noun types, and conditional statements.
- This module is structured along ISO 2-character language codes. Currently, the following languages are stable:
- English:
lexnlp.extract.en
- English:
Extraction methods follow a simple get_X pattern as demonstrated below:
>>> import lexnlp.extract.en.amounts
>>> text = "There are ten cows in the 2 acre pasture."
>>> print(list(lexnlp.extract.en.amounts.get_amounts(text)))
[10, 2.0]
Pattern-based extraction methods¶
- The full list of supported pattern-based structured data types is below:
- amounts, e.g., “ten pounds” or “5.8 megawatts”
- citations, e.g., “10 U.S. 100” or “1998 S. Ct. 1”
- conditions, e.g., “subject to …” or “unless and until …”
- constraints, e.g., “no more than” or “
- copyright, e.g., “(C) Copyright 2000 Acme”
- courts, e.g., “Supreme Court of New York”
- dates, e.g., “June 1, 2017” or “2018-01-01”
- definitions, e.g., “Term shall mean …”
- distances, e.g., “fifteen miles”
- durations, e.g., “ten years” or “thirty days”
- geographic and geopolitical entities, e.g., “New York” or “Norway”
- money and currency usages, e.g., “$5” or “10 Euro”
- percents and rates, e.g., “10%” or “50 bps”
- PII, e.g., “212-212-2121” or “999-999-9999”
- ratios, e.g.,” 3:1” or “four to three”
- regulations, e.g., “32 CFR 170”
- trademarks, e.g., “MyApp (TM)”
- URLs, e.g., “http://acme.com/”
NLP-based extraction methods¶
In addition to pattern-based structured data types, the lexnlp.extract module also supports NLP methods based on tagged part-of-speech classifiers. These classifiers are based on NLTK and, optionally, Stanford NLP libraries. The list of these modules is below:
- named entity extraction with NLTK maximum entropy classifier
- named entity extraction with NLTK and regular expressions
- named entity extraction with Stanford Named Entity Recognition (NER) models