lexnlp.extract
: Extracting structured data from unstructured text¶
The lexnlp.extract
module contains methods that allow for the extraction
of structured data from unstructured textual sources. Supported data types include a
wide range of facts relevant to contract or document analysis, including dates, amounts,
proper noun types, and conditional statements.
- This module is structured along ISO 2-character language codes. Currently, the following languages are stable:
English:
lexnlp.extract.en
German:
lexnlp.extract.de
Spanish:
lexnlp.extract.es
Extraction methods follow a simple get_X pattern as demonstrated below:
>>> import lexnlp.extract.en.amounts
>>> text = "There are ten cows in the 2 acre pasture."
>>> print(list(lexnlp.extract.en.amounts.get_amounts(text)))
[10, 2.0]
Pattern-based extraction methods¶
- The full list of supported pattern-based structured data types is below:
“EN” locale:
acts, e.g., “section 1 of the Advancing Hope Act, 1986”
amounts, e.g., “ten pounds” or “5.8 megawatts”
citations, e.g., “10 U.S. 100” or “1998 S. Ct. 1”
companies, e.g., “Lexpredict LLC”
conditions, e.g., “subject to …” or “unless and until …”
constraints, e.g., “no more than” or “
copyright, e.g., “(C) Copyright 2000 Acme”
courts, e.g., “Supreme Court of New York”
CUSIP, e.g., “392690QT3”
dates, e.g., “June 1, 2017” or “2018-01-01”
definitions, e.g., “Term shall mean …”
distances, e.g., “fifteen miles”
durations, e.g., “ten years” or “thirty days”
geographic and geopolitical entities, e.g., “New York” or “Norway”
money and currency usages, e.g., “$5” or “10 Euro”
percents and rates, e.g., “10%” or “50 bps”
PII, e.g., “212-212-2121” or “999-999-9999”
ratios, e.g.,” 3:1” or “four to three”
regulations, e.g., “32 CFR 170”
trademarks, e.g., “MyApp (TM)”
URLs, e.g., “http://acme.com/”
“DE” locale:
amounts, e.g., “1 tausend” or “eine halbe Million Dollar”
citations, e.g., “BGBl. I S. 434”
copyrights, e.g., “siemens.com globale Website Siemens © 1996 – 2019”
court citations, e.g., “BStBl I 2003, 240”
courts, e.g., “Amtsgerichte”
dates, e.g., “vom 29. März 2017”
definitions
durations, e.g., “14. Lebensjahr” or “fünfundzwanzig Jahren”
geographic and geopolitical entities, e.g., “Albanien”
percents, e.g., “15 Volumenprozent”
“ES” locale:
copyrights, e.g., “”Website BBC Mundo © 1996 – 2019”
courts, e.g., “Tribunal Superior de Justicia de Madrid”
dates, e.g., “15 de febrero” or “1ºde enero de 1999”
definitions, e.g., “”El ser humano”: una anatomía moderna humana”
regulations, e.g., “Comisión Nacional Bancaria y de Valores”
NLP-based extraction methods¶
In addition to pattern-based structured data types, the lexnlp.extract module also supports NLP methods based on tagged part-of-speech classifiers. These classifiers are based on NLTK and, optionally, Stanford NLP libraries. The list of these modules is below:
named entity extraction with NLTK maximum entropy classifier
named entity extraction with NLTK and regular expressions
named entity extraction with Stanford Named Entity Recognition (NER) models
- These modules allow to extract data types like:
addresses, e.g., “1999 Mount Read Blvd, Rochester, NY, USA, 14615”
companies, e.g., “Lexpredict LLC”
persons, e.g., “John Doe”