lexnlp.extract
: Extracting structured data from unstructured text¶
The lexnlp.extract
module contains methods that allow for the extraction
of structured data from unstructured textual sources. Supported data types include a
wide range of facts relevant to contract or document analysis, including dates, amounts,
proper noun types, and conditional statements.
- This module is structured along ISO 2-character language codes. Currently, the following languages are stable:
- English:
lexnlp.extract.en
- German:
lexnlp.extract.de
- Spanish:
lexnlp.extract.es
- English:
Extraction methods follow a simple get_X pattern as demonstrated below:
>>> import lexnlp.extract.en.amounts
>>> text = "There are ten cows in the 2 acre pasture."
>>> print(list(lexnlp.extract.en.amounts.get_amounts(text)))
[10, 2.0]
Pattern-based extraction methods¶
- The full list of supported pattern-based structured data types is below:
- “EN” locale:
- acts, e.g., “section 1 of the Advancing Hope Act, 1986”
- amounts, e.g., “ten pounds” or “5.8 megawatts”
- citations, e.g., “10 U.S. 100” or “1998 S. Ct. 1”
- companies, e.g., “Lexpredict LLC”
- conditions, e.g., “subject to …” or “unless and until …”
- constraints, e.g., “no more than” or “
- copyright, e.g., “(C) Copyright 2000 Acme”
- courts, e.g., “Supreme Court of New York”
- CUSIP, e.g., “392690QT3”
- dates, e.g., “June 1, 2017” or “2018-01-01”
- definitions, e.g., “Term shall mean …”
- distances, e.g., “fifteen miles”
- durations, e.g., “ten years” or “thirty days”
- geographic and geopolitical entities, e.g., “New York” or “Norway”
- money and currency usages, e.g., “$5” or “10 Euro”
- percents and rates, e.g., “10%” or “50 bps”
- PII, e.g., “212-212-2121” or “999-999-9999”
- ratios, e.g.,” 3:1” or “four to three”
- regulations, e.g., “32 CFR 170”
- trademarks, e.g., “MyApp (TM)”
- URLs, e.g., “http://acme.com/”
- “DE” locale:
- amounts, e.g., “1 tausend” or “eine halbe Million Dollar”
- citations, e.g., “BGBl. I S. 434”
- copyrights, e.g., “siemens.com globale Website Siemens © 1996 – 2019”
- court citations, e.g., “BStBl I 2003, 240”
- courts, e.g., “Amtsgerichte”
- dates, e.g., “vom 29. März 2017”
- definitions
- durations, e.g., “14. Lebensjahr” or “fünfundzwanzig Jahren”
- geographic and geopolitical entities, e.g., “Albanien”
- percents, e.g., “15 Volumenprozent”
- “ES” locale:
- copyrights, e.g., “”Website BBC Mundo © 1996 – 2019”
- courts, e.g., “Tribunal Superior de Justicia de Madrid”
- dates, e.g., “15 de febrero” or “1ºde enero de 1999”
- definitions, e.g., “”El ser humano”: una anatomía moderna humana”
- regulations, e.g., “Comisión Nacional Bancaria y de Valores”
- “EN” locale:
NLP-based extraction methods¶
In addition to pattern-based structured data types, the lexnlp.extract module also supports NLP methods based on tagged part-of-speech classifiers. These classifiers are based on NLTK and, optionally, Stanford NLP libraries. The list of these modules is below:
- named entity extraction with NLTK maximum entropy classifier
- named entity extraction with NLTK and regular expressions
- named entity extraction with Stanford Named Entity Recognition (NER) models
- These modules allow to extract data types like:
- addresses, e.g., “1999 Mount Read Blvd, Rochester, NY, USA, 14615”
- companies, e.g., “Lexpredict LLC”
- persons, e.g., “John Doe”