lexnlp.extract.en package

Subpackages

Submodules

lexnlp.extract.en.acts module

lexnlp.extract.en.acts.get_act_list(*args, **kwargs) → List[Dict[str, str]]
lexnlp.extract.en.acts.get_acts(text: str) → Generator[[Dict[str, Any], None], None]
lexnlp.extract.en.acts.get_acts_annotations(text: str) → Generator[[lexnlp.extract.common.annotations.act_annotation.ActAnnotation, None], None]
lexnlp.extract.en.acts.get_acts_annotations_list(text: str) → List[lexnlp.extract.common.annotations.act_annotation.ActAnnotation]

lexnlp.extract.en.amounts module

Amount extraction for English.

This module implements basic amount extraction functionality in English.

This module supports converting: - numbers with comma delimiter: “25,000.00”, “123,456,000” - written numbers: “Seven Hundred Eighty” - mixed written numbers: “5 million” or “2.55 BILLION” - written ordinal numbers: “twenty-fifth” - fractions (non written): “1/33”, “25/100”; where 1 < numerator < 99; 1 < denominator < 999 - fraction No/100 wil be treated as 00/100 - written numbers and fractions: “twenty one AND 5/100” - written fractions: “one-third”, “three tenths”, “ten ninety-ninths”, “twenty AND one-hundredths”,

“2 hundred and one-thousandth”;
where 1 < numerator < 99 and 2 < denominator < 99 and numerator < denominator; or 1 < numerator < 99 and denominator == 100, i.e. 1/99 - 99/100; or 1 < numerator < 99 and denominator == 1000, i.e. 1/1000 - 99/1000;
  • floats starting with “.” (dot): “.5 million”
  • “dozen”: “twenty-two DOZEN”
  • “half”: “Six and a HALF Billion”, “two and a half”
  • “quarter”: “five and one-quarter”, “5 and one-quarter”, “three-quartes”
  • multiple numbers: “$25,400, 1 million people and 3.5 tons”

Avoids: - skip: “5.3.1.”, “1/1/2010”

lexnlp.extract.en.amounts.get_amount_annotations(text: str, extended_sources: bool = True, float_digits: int = 4) → Generator[[lexnlp.extract.common.annotations.amount_annotation.AmountAnnotation, None], None]

Find possible amount references in the text. :param text: text :param extended_sources: return data around amount itself :param float_digits: round float to N digits, don’t round if None :return: list of amounts

lexnlp.extract.en.amounts.get_amounts(text: str, return_sources: bool = False, extended_sources: bool = True, float_digits: int = 4) → Generator[[Union[decimal.Decimal, Tuple[decimal.Decimal, str]], None], None]

Find possible amount references in the text. :param text: text :param return_sources: return amount AND source text :param extended_sources: return data around amount itself :param float_digits: round float to N digits, don’t round if None :return: list of amounts

lexnlp.extract.en.amounts.get_np(text) → Generator[[Tuple[str, str], None], None]
lexnlp.extract.en.amounts.quantize_by_float_digit(amount: decimal.Decimal, float_digits: int) → decimal.Decimal
lexnlp.extract.en.amounts.text2num(s: str, search_fraction: bool = True) → Optional[decimal.Decimal]

Convert written amount into Decimal. :param s: written number :param search_fraction: extract fraction :return: Decimal or None

lexnlp.extract.en.citations module

Citation extraction for English.

This module implements citation extraction functionality in English.

Todo:
  • Improved unit tests and case coverage
lexnlp.extract.en.citations.get_citation_annotations(text: str) → Generator[[lexnlp.extract.common.annotations.citation_annotation.CitationAnnotation, None], None]

Get citations. :param text: :param return_source: :param as_dict: :return: tuple or dict (volume, reporter, reporter_full_name, page, page2, court, year[, source text])

lexnlp.extract.en.citations.get_citations(text: str, return_source=False, as_dict=False) → Generator

Get citations. :param text: :param return_source: :param as_dict: :return: tuple or dict (volume, reporter, reporter_full_name, page, page2, court, year[, source text])

lexnlp.extract.en.conditions module

Condition extraction for English.

This module implements basic condition extraction functionality in English.

Todo:
  • Improved unit tests and case coverage
lexnlp.extract.en.conditions.create_condition_pattern(condition_pattern_template, condition_phrases)

Create condition pattern. :param condition_pattern_template: :param condition_phrases: :return:

lexnlp.extract.en.conditions.get_condition_annotations(text: str, strict=True) → Generator[[lexnlp.extract.common.annotations.condition_annotation.ConditionAnnotation, None], None]

Find possible conditions in natural language. :param text: :param strict: :return:

lexnlp.extract.en.conditions.get_conditions(text, strict=True) → Generator

lexnlp.extract.en.constraints module

Constraint extraction for English.

This module implements basic constraint extraction functionality in English.

Todo:
  • Improved unit tests and case coverage
lexnlp.extract.en.constraints.create_constraint_pattern(constraint_pattern_template, constraint_phrases)

Create constraint pattern. :param constraint_pattern_template: :param constraint_phrases: :return:

lexnlp.extract.en.constraints.get_constraint_annotations(text: str, strict=False) → Generator[[lexnlp.extract.common.annotations.constraint_annotation.ConstraintAnnotation, None], None]

Find possible constraints in natural language. :param text: :param strict: :return:

lexnlp.extract.en.constraints.get_constraints(text: str, strict=False) → Generator

Find possible constraints in natural language. :param text: :param strict: :return:

lexnlp.extract.en.copyright module

Copyright extraction for English using NLTK and NLTK pre-trained maximum entropy classifier.

This module implements basic Copyright extraction functionality in English relying on the pre-trained NLTK functionality, including POS tagger and NE (fuzzy) chunkers.

Todo: -

class lexnlp.extract.en.copyright.CopyrightEnParser

Bases: lexnlp.extract.common.copyrights.copyright_en_style_parser.CopyrightEnStyleParser

classmethod extract_phrases_with_coords(sentence: str) → List[Tuple[str, int, int]]
class lexnlp.extract.en.copyright.CopyrightNPExtractor(grammar=None)

Bases: lexnlp.extract.en.utils.NPExtractor

allowed_pos = ['IN', 'CC', 'NN']
allowed_sym = ['&', 'and', 'of', '©']
static strip_np(np)

lexnlp.extract.en.courts module

Court extraction for English.

This module implements extraction functionality for courts in English, including formal names, abbreviations, and aliases.

Todo:
  • Add utilities for loading court data
lexnlp.extract.en.courts.get_court_annotations(text: str, language: str = None) → Generator[[lexnlp.extract.common.annotations.court_annotation.CourtAnnotation, None], None]
lexnlp.extract.en.courts.get_courts(text: str, court_config_list: List[lexnlp.extract.en.dict_entities.DictionaryEntry], priority: bool = False, text_languages: List[str] = None, simplified_normalization: bool = False) → Generator[[Tuple[lexnlp.extract.en.dict_entities.DictionaryEntry, lexnlp.extract.en.dict_entities.DictionaryEntryAlias], Any], Any]

Searches for courts from the provided config list and yields tuples of (court_config, court_alias). Court config is: (court_id, court_name, [list of aliases]) Alias is: (alias_text, language, is_abbrev, alias_id)

This method uses general searching routines for dictionary entities from dict_entities.py module. Methods of dict_entities module can be used for comfortable creating the config: entity_config(), entity_alias(), add_aliases_to_entity(). :param text: :param court_config_list: List list of all possible known courts in the form of tuples:

(id, name, [(alias, lang, is_abbrev], …).
Parameters:
  • return_source
  • priority – If two courts found with the totally equal matching aliases - then use the one with the lowest id.
  • text_languages – Language(s) of the source text. If a language is specified then only aliases of this
language will be searched for. For example: this allows ignoring “Island” - a German language
alias of Iceland for English texts.
Parameters:simplified_normalization – don’t use NLTK for just “normalizing” the text
Returns:Generates tuples: (court entity, court alias)
lexnlp.extract.en.courts.setup_en_parser()

lexnlp.extract.en.cusip module

Ratio extraction for English.

This module implements ratio extraction functionality in English.

Todo:
  • Improved unit tests and case coverage
lexnlp.extract.en.cusip.get_cusip(text: str) → Generator[[Dict[str, Any], None], None]
lexnlp.extract.en.cusip.get_cusip_annotations(text: str) → Generator[[lexnlp.extract.common.annotations.cusip_annotation.CusipAnnotation, None], None]

INFO: https://www.cusip.com/pdf/CUSIP_Intro_03.14.11.pdf

lexnlp.extract.en.cusip.get_cusip_list(text)
lexnlp.extract.en.cusip.is_cusip_valid(code, return_checksum=False)

lexnlp.extract.en.date_model module

Date extraction for English.

This module implements date extraction functionality in English.

lexnlp.extract.en.date_model.get_date_features(text: str, start_index: int, end_index: int, include_bigrams: bool = True, window: int = 5, characters=None, norm: bool = True) → Dict[str, int]

Get features to use for classification of date as false positive. :param text: raw text around potential date :param start_index: date start index :param end_index: date end index :param include_bigrams: whether to include bigram/bicharacter features :param window: window around match :param characters: characters to use for feature generation, e.g., digits only, alpha only :param norm: whether to norm, i.e., transform to proportion :return:

lexnlp.extract.en.dates module

Date extraction for English.

This module implements date extraction functionality in English.

class lexnlp.extract.en.dates.DateFeaturesDataframeBuilder

Bases: object

classmethod build_feature_df(dic: Dict[str, float]) → pandas.core.frame.DataFrame
feature_df_by_key_count = {}
class lexnlp.extract.en.dates.FeatureTemplate(df: pandas.core.frame.DataFrame = None, keys: List[str] = None)

Bases: object

lexnlp.extract.en.dates.build_date_model(input_examples, output_file, verbose=True)

Build a sklearn model for classifying date strings as potential false positives. :param input_examples: :param output_file: :param verbose: :return:

lexnlp.extract.en.dates.check_date_parts_are_in_date(date: datetime.datetime, date_props: Dict[str, List[Any]]) → bool

Checks that when we transformed “possible date” into date, we found place for each “token” from the initial phrase :param date: :param date_string: “13.2 may” :param date_props: {‘time’: [], ‘hours’: [] … ‘digits’: [‘13’, ‘2’] …} :return: True if date is OK

lexnlp.extract.en.dates.get_date_annotations(text: str, strict=False, base_date=None, threshold=0.5) → Generator[[lexnlp.extract.common.annotations.date_annotation.DateAnnotation, None], None]

Find dates after cleaning false positives. :param text: raw text to search :param strict: whether to return only complete or strict matches :param base_date: base date to use for implied or partial matches :param return_source: whether to return raw text around date :param threshold: probability threshold to use for false positive classifier :return:

lexnlp.extract.en.dates.get_date_features(text, start_index, end_index, include_bigrams=True, window=5, characters=None, norm=True)

Get features to use for classification of date as false positive. :param text: raw text around potential date :param start_index: date start index :param end_index: date end index :param include_bigrams: whether to include bigram/bicharacter features :param window: window around match :param characters: characters to use for feature generation, e.g., digits only, alpha only :param norm: whether to norm, i.e., transform to proportion :return:

lexnlp.extract.en.dates.get_dates(text: str, strict=False, base_date=None, return_source=False, threshold=0.5) → Generator

Find dates after cleaning false positives. :param text: raw text to search :param strict: whether to return only complete or strict matches :param base_date: base date to use for implied or partial matches :param return_source: whether to return raw text around date :param threshold: probability threshold to use for false positive classifier :return:

lexnlp.extract.en.dates.get_dates_list(text, **kwargs) → List
lexnlp.extract.en.dates.get_month_by_name()
lexnlp.extract.en.dates.get_raw_date_list(text, strict=False, base_date=None, return_source=False) → List
lexnlp.extract.en.dates.get_raw_dates(text, strict=False, base_date=None, return_source=False) → Generator

Find “raw” or potential date matches prior to false positive classification. :param text: raw text to search :param strict: whether to return only complete or strict matches :param base_date: base date to use for implied or partial matches :param return_source: whether to return raw text around date :return:

lexnlp.extract.en.dates.train_default_model(save=True)

Train default model. :return:

lexnlp.extract.en.definition_parsing_methods module

Definition extraction for English.

This module implements basic definition extraction functionality in English.

Todo:
  • Improved unit tests and case coverage
class lexnlp.extract.en.definition_parsing_methods.DefinitionCaught(name: str, text: str, coords: Tuple[int, int])

Bases: object

Each definition is stored in this class with its name, full text and “coords” within the whole document

coords
does_consume_target(target) → int
Parameters:target – a definition that is, probably, “consumed” by the current one
Returns:1 if self consumes the target, -1 if the target consumes self, overwise 0
name
text
lexnlp.extract.en.definition_parsing_methods.does_term_are_service_words(term_pos: List[Tuple[str, str, int, int]]) → bool

Does term consist of service words only?

lexnlp.extract.en.definition_parsing_methods.filter_definitions_for_self_repeating(definitions: List[lexnlp.extract.en.definition_parsing_methods.DefinitionCaught]) → List[lexnlp.extract.en.definition_parsing_methods.DefinitionCaught]
Parameters:definitions
Returns:excludes definitions that are “overlapped”, leaves unique definitions only
lexnlp.extract.en.definition_parsing_methods.get_definition_list_in_sentence(sentence_coords: Tuple[int, int, str], decode_unicode=True) → List[lexnlp.extract.en.definition_parsing_methods.DefinitionCaught]

Find possible definitions in natural language in a single sentence. :param sentence_coords: sentence, sentence start, end :param decode_unicode: :return:

lexnlp.extract.en.definition_parsing_methods.get_quotes_count_in_string(text: str) → int
Parameters:text – text to calculate quotes within
Returns:calculates count of quotes within the text passed
lexnlp.extract.en.definition_parsing_methods.join_collection(collection)
lexnlp.extract.en.definition_parsing_methods.regex_matches_to_word_coords(pattern: Pattern[str], text: str, phrase_start: int = 0) → List[Tuple[str, int, int]]
Parameters:
  • pattern – pattern for searching for matches within the text
  • text – text to search for matches
  • phrase_start – a value to be add to start / end
Returns:

tuples of (match_text, start, end) out of the regex (pattern) matches in text

lexnlp.extract.en.definition_parsing_methods.split_definitions_inside_term(term: str, src_with_coords: Tuple[int, int, str], term_start: int, term_end: int) → List[Tuple[str, int, int]]

The whole phrase can be considered definition (“MSRB”, “we”, “us” or “our”), but in fact the phrase can be a collection of definitions. Here we split definition phrase to a list of definitions.

Source string could be pre-processed, that’s why we search for each sub-phrase’s coordinates (PhrasePositionFinder) :param term: a definition or, probably, a set of definitions (“MSRB”, “we”, “us” or “our”) :param src_with_coords: a sentence (probably), containing the term + its coords :param term_start: “term” start coordinate within the source sentence :param term_end: “term” end coordinate within the source sentence :return: [(definition, def_start, def_end), …]

lexnlp.extract.en.definition_parsing_methods.trim_defined_term(term: str, start: int, end: int) → Tuple[str, int, int, bool]

Remove pair of quotes / brackets framing text Replace N-grams of spaces with single spaces Replace line breaks with spaces :param term: a phrase that may contain excess framing symbols :param start: original term’s start position, may be changed :param end: original term’s end position, may be changed :return: updated term, start, end and the flag indicating that the whole phrase was inside quotes

lexnlp.extract.en.definitions module

lexnlp.extract.en.definitions.get_definition_annotations(text: str, decode_unicode=True, locator_type: lexnlp.extract.common.annotation_locator_type.AnnotationLocatorType = <AnnotationLocatorType.RegexpBased: 1>) → Generator[[lexnlp.extract.common.annotations.definition_annotation.DefinitionAnnotation, None], None]
lexnlp.extract.en.definitions.get_definition_objects_list(text, decode_unicode=True) → List[lexnlp.extract.en.definition_parsing_methods.DefinitionCaught]
Parameters:
  • text – text to search for definitions
  • decode_unicode
Returns:

a list of found definitions - objects of class DefinitionCaught

lexnlp.extract.en.definitions.get_definitions(text: str, return_sources=False, decode_unicode=True, return_coords=False, locator_type: lexnlp.extract.common.annotation_locator_type.AnnotationLocatorType = <AnnotationLocatorType.RegexpBased: 1>) → Generator

Find possible definitions in natural language in text. The text will be split to sentences first. :param return_coords: returns a (x, y) tuple in each record. x - definition’s text start, y - definition’s text end :param decode_unicode: :param return_sources: returns a tuple with the extracted term and the source sentence :param text: the input text :param locator_type: use default (Regexp-based) or ML-based locator :return: Generator[name] or Generator[name, text] or Generator[name, text, coords]

lexnlp.extract.en.definitions.get_definitions_explicit(text, decode_unicode=True, locator_type: lexnlp.extract.common.annotation_locator_type.AnnotationLocatorType = <AnnotationLocatorType.RegexpBased: 1>) → Generator
lexnlp.extract.en.definitions.get_definitions_in_sentence(sentence: str, return_sources=False, decode_unicode=True) → Generator

lexnlp.extract.en.dict_entities module

Universal extraction of entities for which we have full dictionaries of possible names and aliases from English text.

Example: Courts - we have the full dictionary of known courts with their names and aliases and are able to search the text for each possible court.

Geo entities - we have the full set of known geo entities and can search any text for their occurrences.

Search methods of this module require lists of possible entities with their ids, names and sets of aliases in different languages. To allow using these methods in Celery and especially for allowing building these configuration lists once and using them in multiple Celery tasks it is required to allow their easy and fast serialization. By default Celery uses JSON serialization starting from v. 4 and does not allow serializing objects of custom classes out of the box. So we will have to use either dicts or tuples to avoid requiring special configuration for Celery. Tuples are faster.

To avoid typos in development and utilize typization hints in IDE there are few methods in this module for operating tuples which represent entities and aliases. They accept named parameters lists and return tuples.

class lexnlp.extract.en.dict_entities.AliasBanList(aliases: Optional[List[str]] = None, abbreviations: Optional[List[str]] = None)

Bases: object

class lexnlp.extract.en.dict_entities.AliasBanRecord(alias: str = '', lang: Optional[str] = '', is_abbrev: bool = False)

Bases: object

class lexnlp.extract.en.dict_entities.DictionaryEntity(entity: Any, coords: Tuple[int, int])

Bases: object

class lexnlp.extract.en.dict_entities.DictionaryEntry(id: int = 0, name: str = '', priority: int = 0, name_is_alias: bool = True, aliases: Optional[List[lexnlp.extract.en.dict_entities.DictionaryEntryAlias]] = None)

Bases: object

class lexnlp.extract.en.dict_entities.DictionaryEntryAlias(alias: str = '', language: str = '', is_abbreviation: bool = False, alias_id: Optional[int] = None, normalized_alias: str = '')

Bases: object

classmethod entity_alias(alias: str, language: str = None, is_abbreviation: bool = False, alias_id: int = None) → lexnlp.extract.en.dict_entities.DictionaryEntryAlias
class lexnlp.extract.en.dict_entities.SearchResultPosition(entity: lexnlp.extract.en.dict_entities.DictionaryEntry, alias: lexnlp.extract.en.dict_entities.DictionaryEntryAlias, start: int, end: int, source_text: str = '')

Bases: object

Represents a position in the normalized source text at which one or more entities have been detected. One or more entities having equal aliases can be detected on a position in the text.

add_entity(entity: lexnlp.extract.en.dict_entities.DictionaryEntry, alias: lexnlp.extract.en.dict_entities.DictionaryEntryAlias) → lexnlp.extract.en.dict_entities.SearchResultPosition
alias_text
end
entities_dict
get_entities_aliases() → List[Tuple[lexnlp.extract.en.dict_entities.DictionaryEntry, lexnlp.extract.en.dict_entities.DictionaryEntryAlias]]
overlaps(other: lexnlp.extract.en.dict_entities.SearchResultPosition) → bool
source_text
start
lexnlp.extract.en.dict_entities.alias_is_banlisted(alias_ban_list: Optional[Dict[str, lexnlp.extract.en.dict_entities.AliasBanList]], norm_alias: str, alias_lang: str, is_abbrev: bool) → bool
lexnlp.extract.en.dict_entities.conflicts_take_first_by_id(conflicting_entities_aliases: List[Tuple[lexnlp.extract.en.dict_entities.DictionaryEntry, lexnlp.extract.en.dict_entities.DictionaryEntryAlias]]) → List[Tuple[lexnlp.extract.en.dict_entities.DictionaryEntry, lexnlp.extract.en.dict_entities.DictionaryEntryAlias]]

Default conflict resolving function for dropping all entities detected at the same position excepting the one having the smallest id. To be used in find_dict_entities() method.

lexnlp.extract.en.dict_entities.conflicts_top_by_priority(conflicting_entities_aliases: List[Tuple[lexnlp.extract.en.dict_entities.DictionaryEntry, lexnlp.extract.en.dict_entities.DictionaryEntryAlias]]) → List[Tuple[lexnlp.extract.en.dict_entities.DictionaryEntry, lexnlp.extract.en.dict_entities.DictionaryEntryAlias]]

Default conflict resolving function for dropping all entities detected at the same position excepting the one having the smallest id. To be used in find_dict_entities() method.

lexnlp.extract.en.dict_entities.find_dict_entities(text: str, all_possible_entities: List[lexnlp.extract.en.dict_entities.DictionaryEntry], text_languages: Union[List[str], Tuple[str], Set[str]] = None, conflict_resolving_func: Callable[List[Tuple[lexnlp.extract.en.dict_entities.DictionaryEntry, lexnlp.extract.en.dict_entities.DictionaryEntryAlias]], List[Tuple[lexnlp.extract.en.dict_entities.DictionaryEntry, lexnlp.extract.en.dict_entities.DictionaryEntryAlias]]] = None, use_stemmer: bool = False, remove_time_am_pm: bool = True, min_alias_len: int = None, prepared_alias_ban_list: Optional[Dict[str, lexnlp.extract.en.dict_entities.AliasBanList]] = None, simplified_normalization: bool = False) → Generator[[lexnlp.extract.en.dict_entities.DictionaryEntity, None], None]

Find all entities defined in the ‘all_possible_entities’ list appeared in the source text. This method takes care of leaving only the longest matching search result for the case of multiple entities having aliases - one being a substring of another. This method takes care of the language of the text and aliases - if language is specified both for the text and for the alias - then this alias is used only if they are the same. This method may detect multiple possibly matching entities at a position in the text - because there can be entites having the same aliases in the same language. To resolve such conflicts a special resolving function can be specified. This method takes care of time AM/PM components which possibly can appear in the aliases of some entities - it tries to detect minutes/seconds/milliseconds before AM/PM and ignore them in such cases.

Algorithm of this method: 1. Normalize the source text (we need lowercase and non-lowercase versions for abbrev searches). 2. Create a shared search context - a map of position -> (alias text + list of matching entities) 3. For each possible entity do search using the shared context:

3.1. For each alias of the entity:
3.1.1. Iteratively search for all occurrences of the alias taking into account its language, abbrev status.
For each found occurrence of the alias - check if there is already found another alias and entity at this position and leave only the one having the longest alias (“Something” vs “Something Bigger”) If there is already a found different entity on this position having totally equal alias with the same language - then store them both for this position in the text.

4. Now we have a map filled with: position -> (alias text + list of entities having this alias). After sorting the items of this dict by position we will be able to get rid of overlaping of longer and shorter aliases being one a substirng of another (“Bankr. E.D.N.Y.” vs “E.D.N.Y.”). 5. For each next position check if it overlaps with the next one [position; position + len(alias)]. If overlaps - then leave the longest alias and drop the shorter.

Main complexity of this algorithm is caused by the requirement to detect the longest match for each piece of text while the longer match can start at the earlier position then the shorter match and there can be multiple aliases of different entities matching the same piece of text.

Another algorithm for this function can be based on the idea that or-kind regexp returns the longest matching group. We could form regexps containing the possible aliases and apply them to the source text: r’alias1|alias2|longer alias2|…’

TODO Compare to other algorithms for time and memory complexity

Parameters:
  • text
  • all_possible_entities – list of dict or list of DictEntity - all possible entities to search for
  • min_alias_len – Minimal length of alias/name to search for. Can be used to ignore too short aliases like “M.”

while searching. :param prepared_alias_ban_list: List of aliases to remove from searching. Can be used to ignore concrete aliases. Prepared ban list of aliases to exclude from search. Should be: dict of language -> tuple (list of normalized non-abbreviations, list of normalized abbreviations) :param text_languages: If set - then only aliases of these languages will be searched for. :param conflict_resolving_func: A function for resolving conflicts when there are multiple entities detected at the same position in the source text and their detected aliases are of the same length. The function takes a list of conflicting entities and should return a list of one or more entities which should be returned. :param use_stemmer: Use stemmer instead of tokenizer. Stemmer converts words to their simple form (singular number, e.t.c.). Stemmer works better for searching for “tables”, “developers”, … Tokenizer fits for “United States”, “Mississippi”, … :param remove_time_am_pm: Remove from final results AM/PM abbreviations which look like end part of time strings - 11:45 am, 10:00 pm. :param simplified_normalization: Don’t use NLTK for text “normalization” :return:

lexnlp.extract.en.dict_entities.normalize_text(text: str, spaces_on_start_end: bool = True, spaces_after_dots: bool = True, lowercase: bool = True, use_stemmer: bool = False, simple_tokenization: bool = False) → str

Normalizes text for substring search operations - extracts tokens, joins them back with spaces, adds missing spaces after dots for abbreviations, e.t.c. Overall aim of this method is to weaken substring matching conditions by normalizing both the text and the substring being searched by the same way removing obsolete differences between them (case, punctuation, …). :param text: :param spaces_on_start_end: :param spaces_after_dots: :param lowercase: :param simple_tokenization: don’t use nltk, just split text by space characters :param use_stemmer: Use stemmer instead of tokenizer. When using stemmer all words will be converted to singular number (or to some the most plain form) before matching. When using tokenizer - the words are compared as is. Using tokenizer should be enough for searches for entities which exist in a single number in the real world - geo entities, courts, …. Stemmer is required for searching for some common objects - table, pen, developer, … :return: “normazlied” string

lexnlp.extract.en.dict_entities.normalize_text_with_map(text: str, spaces_on_start_end: bool = True, spaces_after_dots: bool = True, lowercase: bool = True, use_stemmer: bool = False, simple_tokenization: bool = False) → Tuple[str, List[int]]

Almost like normalize_text, but also returns source-to-resulted char index map: map[i] = I, where i is the character coordinate within the source text,

I is the same character’s coordinate within the resulted text
lexnlp.extract.en.dict_entities.prepare_alias_banlist_dict(alias_banlist: List[lexnlp.extract.en.dict_entities.AliasBanRecord], use_stemmer: bool = False) → Optional[Dict[str, lexnlp.extract.en.dict_entities.AliasBanList]]

Prepare alias ban list for providing it to find_dict_entities() function. :param alias_banlist: Non-normalized form of the banlist: [(alias, lang, is_abbrev), …] :param use_stemmer: Use stemmer for alias normalization. Otherwise - tokenizer only. :return:

lexnlp.extract.en.dict_entities.reverse_src_to_dest_map(conv_map: List[int], normalized_text_len=0) → List[int]

1 2 3 4 5 012345678901234567890123456789012345678901234567890 One one Bankr. E.D.N.C. two two two.

One one Bankr . E . D . N . C . two two two .

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11,12,13,15,16,17,19 <- map [0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11,12,12,13,13 <- reversed

lexnlp.extract.en.distances module

Distance extraction for English.

This module implements basic distance extraction functionality in English.

lexnlp.extract.en.distances.get_distance_annotations(text: str, float_digits: int = 4) → Generator[[lexnlp.extract.common.annotations.distance_annotation.DistanceAnnotation, None], None]
lexnlp.extract.en.distances.get_distances(text: str, return_sources: bool = False, float_digits: int = 4) → Generator[[Union[Tuple[decimal.Decimal, str], Tuple[decimal.Decimal, str, str]], None], None]

lexnlp.extract.en.durations module

This module implements duration extraction functionality in English.

class lexnlp.extract.en.durations.EnDurationParser

Bases: lexnlp.extract.common.durations.durations_parser.DurationParser

DURATION_MAP = {'anniversaries': Fraction(365, 1), 'anniversary': Fraction(365, 1), 'annum': Fraction(365, 1), 'day': Fraction(1, 1), 'hour': Fraction(1, 24), 'minute': Fraction(1, 1440), 'month': Fraction(30, 1), 'quarter': Fraction(365, 4), 'second': Fraction(1, 86400), 'week': Fraction(7, 1), 'year': Fraction(365, 1)}
DURATION_PTN = '\n ((\n(?:(?:(?:(?:(?:[\\.\\d][\\d\\.,]*\\s*|\\W|^)\n(?:(?:seventeenths|seventeenth|thirteenths|fourteenths|eighteenths|nineteenths|seventieths|thirteenth|fourteenth|eighteenth|nineteenth|seventieth|fifteenths|sixteenths|twentieths|thirtieths|eightieths|ninetieths|seventeen|fifteenth|sixteenth|twentieth|thirtieth|eightieth|ninetieth|elevenths|fortieths|fiftieths|sixtieths|thirteen|fourteen|eighteen|nineteen|eleventh|fortieth|fiftieth|sixtieth|sevenths|twelfths|fifteen|sixteen|seventy|seventh|twelfth|fourths|eighths|eleven|twelve|twenty|thirty|eighty|ninety|zeroth|second|fourth|eighth|thirds|fifths|sixths|ninths|tenths|three|seven|eight|forty|fifty|sixty|first|third|fifth|sixth|ninth|tenth|zero|four|five|nine|one|two|six|ten|thousandths|thousandth|thousand|trillion|million|billion|trill|bil|mm|k|m|b\n|hundred(?:th(?:s)?)?|dozen|and|a\\s+half|quarters?)[\\s-]*)+)\n(?:(?:no|\\d{1,2})/100)?)|(?<=\\W|^)(?:[\\.\\d][\\d\\.,/]*))(?:\\W|$))(?:\\s{0,2}[½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅐⅛⅜⅝⅞⅑⅒]+)*)\n (?:\\s*(?:calendar|business|actual))?[\\s-]*\n (second|minute|hour|day|week|month|quarter|year|annum|anniversary|anniversaries)s?)(?:\\W|$)\n '
DURATION_PTN_RE = regex.Regex('\n ((\n(?:(?:(?:(?:(?:[\\.\\d][\\d\\.,]*\\s*|\\W|^)\n(?:(?:seventeenths|seventeenth|thirteenths|fourteenths|eighteenths|nineteenths|seventieths|thirteenth|fourteenth|eighteenth|nineteenth|seventieth|fifteenths|sixteenths|twentieths|thirtieths|eightieths|ninetieths|seventeen|fifteenth|sixteenth|twentieth|thirtieth|eightieth|ninetieth|elevenths|fortieths|fiftieths|sixtieths|thirteen|fourteen|eighteen|nineteen|eleventh|fortieth|fiftieth|sixtieth|sevenths|twelfths|fifteen|sixteen|seventy|seventh|twelfth|fourths|eighths|eleven|twelve|twenty|thirty|eighty|ninety|zeroth|second|fourth|eighth|thirds|fifths|sixths|ninths|tenths|three|seven|eight|forty|fifty|sixty|first|third|fifth|sixth|ninth|tenth|zero|four|five|nine|one|two|six|ten|thousandths|thousandth|thousand|trillion|million|billion|trill|bil|mm|k|m|b\n|hundred(?:th(?:s)?)?|dozen|and|a\\s+half|quarters?)[\\s-]*)+)\n(?:(?:no|\\d{1,2})/100)?)|(?<=\\W|^)(?:[\\.\\d][\\d\\.,/]*))(?:\\W|$))(?:\\s{0,2}[½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅐⅛⅜⅝⅞⅑⅒]+)*)\n (?:\\s*(?:calendar|business|actual))?[\\s-]*\n (second|minute|hour|day|week|month|quarter|year|annum|anniversary|anniversaries)s?)(?:\\W|$)\n ', flags=regex.S | regex.I | regex.M | regex.X | regex.V0)
INNER_CONJUNCTIONS = ['and', 'plus']
INNER_PUNCTUATION = regex.Regex('[\\s\\,]', flags=regex.V0)
classmethod get_all_annotations(text: str, float_digits: int = 4) → List[lexnlp.extract.common.annotations.duration_annotation.DurationAnnotation]
lexnlp.extract.en.durations.get_duration_annotations(text: str, float_digits=4) → Generator[[lexnlp.extract.common.annotations.duration_annotation.DurationAnnotation, None], None]
lexnlp.extract.en.durations.get_duration_annotations_list(text: str, float_digits=4) → List[lexnlp.extract.common.annotations.duration_annotation.DurationAnnotation]
lexnlp.extract.en.durations.get_durations(text: str, return_sources=False, float_digits=4) → Generator

lexnlp.extract.en.en_language_tokens module

class lexnlp.extract.en.en_language_tokens.EnLanguageTokens

Bases: object

abbreviations = {'A.D.', 'A.V.', 'Abbrev.', 'Abd.', 'Aberd.', 'Aberdeensh.', 'Abol.', 'Aborig.', 'Abp.', 'Abr.', 'Abridg.', 'Abridgem.', 'Absol.', 'Abst.', 'Abstr.', 'Acad.', 'Acc.', 'Accept.', 'Accomm.', 'Accompl.', 'Accs.', 'Acct.', 'Accts.', 'Achievem.', 'Add.', 'Addit.', 'Addr.', 'Adm.', 'Admin.', 'Admir.', 'Admon.', 'Admonit.', 'Adv.', 'Advancem.', 'Advert.', 'Advoc.', 'Advt.', 'Advts.', 'Aerodynam.', 'Aeronaut.', 'Aff.', 'Affect.', 'Afr.', 'Agric.', 'Alch.', 'Alg.', 'Alleg.', 'Allit.', 'Alm.', 'Alph.', 'Amer.', 'Anal.', 'Analyt.', 'Anat.', 'Anc.', 'Anecd.', 'Ang.', 'Angl.', 'Anglo-Ind.', 'Anim.', 'Ann.', 'Anniv.', 'Annot.', 'Anon.', 'Answ.', 'Ant.', 'Anthrop.', 'Anthropol.', 'Antiq.', 'Apoc.', 'Apol.', 'App.', 'Appl.', 'Applic.', 'Apr.', 'Arab.', 'Arb.', 'Arch.', 'Archaeol.', 'Archipel.', 'Archit.', 'Argt.', 'Arith.', 'Arithm.', 'Arrangem.', 'Artic.', 'Artific.', 'Artill.', 'Ashm.', 'Assemb.', 'Assoc.', 'Assoc. Football', 'Assyriol.', 'Astr.', 'Astrol.', 'Astron.', 'Astronaut.', 'Att.', 'Attrib.', 'Aug.', 'Austral.', 'Auth.', 'Autobiog.', 'Autobiogr.', 'Ayrsh.', 'B.C.', 'BNC', 'Bacteriol.', 'Bedford.', 'Bedfordsh.', 'Bel & Dr.', 'Belg.', 'Berks.', 'Berksh.', 'Berw.', 'Berwicksh.', 'Bibliogr.', 'Biochem.', 'Biog.', 'Biogr.', 'Biol.', 'Bk.', 'Bks.', 'Bord.', 'Bot.', 'Bp.', 'Braz.', 'Brit.', 'Bucks.', 'Build.', 'Bull.', 'Bur.', 'Cal.', 'Calc.', 'Calend.', 'Calif.', 'Calligr.', 'Camb.', 'Cambr.', 'Campanol.', 'Canad.', 'Canterb.', 'Capt.', 'Cartogr.', 'Catal.', 'Catech.', 'Cath.', 'Cent.', 'Ceram.', 'Cert.', 'Certif.', 'Ch.', 'Ch. Hist.', 'Chamb.', 'Char.', 'Charac.', 'Chas.', 'Chem.', 'Chem. Engin.', 'Chesh.', 'Chr.', 'Chron.', 'Chronol.', 'Chrons.', 'Cinematogr.', 'Circ.', 'Civ. Law', 'Civil Engin.', 'Cl.', 'Class.', 'Class. Antiq.', 'Classif.', 'Climatol.', 'Clin.', 'Col.', 'Coll.', 'Collect.', 'Colloq.', 'Coloss.', 'Com.', 'Comb.', 'Combs.', 'Comm.', 'Comm. Law', 'Commandm.', 'Commend.', 'Commerc.', 'Commiss.', 'Commonw.', 'Communic.', 'Comp.', 'Comp. Anat.', 'Compan.', 'Compar.', 'Compend.', 'Compl.', 'Compos.', 'Conc.', 'Conch.', 'Concl.', 'Conf.', 'Confid.', 'Confl.', 'Confut.', 'Congr.', 'Congreg.', 'Congress.', 'Conn.', 'Consc.', 'Consecr.', 'Consid.', 'Consol.', 'Constit.', 'Constit. Hist.', 'Constr.', 'Contemp.', 'Contempl.', 'Contend.', 'Content.', 'Contin.', 'Contradict.', 'Contrib.', 'Controv.', 'Conv.', 'Convent.', 'Conversat.', 'Convoc.', 'Cor.', 'Cornw.', 'Coron.', 'Corr.', 'Corresp.', 'Counc.', 'Courtsh.', 'Craniol.', 'Craniom.', 'Crim.', 'Crim. Law', 'Crit.', 'Crt.', 'Crts.', 'Cryptogr.', 'Crystallogr.', 'Ct.', 'Cumb.', 'Cumberld.', 'Cumbld.', 'Cycl.', 'Cytol.', 'D.C.', 'Dan.', 'Dau.', 'Deb.', 'Dec.', 'Declar.', 'Ded.', 'Def.', 'Deliv.', 'Demonstr.', 'Dep.', 'Depred.', 'Depredat.', 'Dept.', 'Derbysh.', 'Descr.', 'Deut.', 'Devel.', 'Devonsh.', 'Dial.', 'Dict.', 'Diffic.', 'Direct.', 'Dis.', 'Disc.', 'Discipl.', 'Discov.', 'Discrim.', 'Discuss.', 'Diss.', 'Dist.', 'Distemp.', 'Distill.', 'Distrib.', 'Div.', 'Divers.', 'Dk.', 'Doc.', 'Doctr.', 'Domest.', 'Durh.', 'E. Afr.', 'E. Angl.', 'E. Anglian', 'E. Ind.', 'E.D.D.', 'E.E.T.S.', 'East Ind.', 'Eccl.', 'Eccl. Hist.', 'Eccl. Law', 'Eccles.', 'Ecclus.', 'Ecol.', 'Econ.', 'Ed.', 'Edin.', 'Edinb.', 'Educ.', 'Edw.', 'Egypt.', 'Egyptol.', 'Electr.', 'Electr. Engin.', 'Electro-magn.', 'Electro-physiol.', 'Elem.', 'Eliz.', 'Elizab.', 'Emb.', 'Embryol.', 'Encycl.', 'Encycl. Brit.', 'Encycl. Metrop.', 'Eng.', 'Engin.', 'Englishw.', 'Enq.', 'Ent.', 'Enthus.', 'Entom.', 'Entomol.', 'Enzymol.', 'Ep.', 'Eph.', 'Ephes.', 'Epil.', 'Episc.', 'Epist.', 'Epit.', 'Equip.', 'Esd.', 'Ess.', 'Essent.', 'Establ.', 'Esth.', 'Ethnol.', 'Etymol.', 'Eval.', 'Evang.', 'Even.', 'Evid.', 'Evol.', 'Ex. doc.', 'Exalt.', 'Exam.', 'Exch.', 'Exec.', 'Exerc.', 'Exhib.', 'Exod.', 'Exped.', 'Exper.', 'Explan.', 'Explic.', 'Explor.', 'Expos.', 'Ezek.', 'Fab.', 'Fam.', 'Farew.', 'Feb.', 'Ff.', 'Fifesh.', 'Footpr.', 'Forfarsh.', 'Fortif.', 'Fortn.', 'Found.', 'Fr.', 'Fragm.', 'Fratern.', 'Friendsh.', 'Fund.', 'Furnit.', 'Gal.', 'Gard.', 'Gastron.', 'Gaz.', 'Gd.', 'Gen.', 'Geo.', 'Geog.', 'Geogr.', 'Geol.', 'Geom.', 'Geomorphol.', 'Ger.', 'Glac.', 'Glasg.', 'Glos.', 'Gloss.', 'Glouc.', 'Gloucestersh.', 'Gosp.', 'Gov.', 'Govt.', 'Gr.', 'Gram.', 'Gramm. Analysis', 'Gt.', 'Gynaecol.', 'Hab.', 'Haematol.', 'Hag.', 'Hampsh.', 'Handbk.', 'Hants.', 'Heb.', 'Hebr.', 'Hen.', 'Her.', 'Herb.', 'Heref.', 'Hereford.', 'Herefordsh.', 'Hertfordsh.', 'Hierogl.', 'Hist.', 'Histol.', 'Hom.', 'Horol.', 'Hort.', 'Hos.', 'Hosp.', 'Househ.', 'Housek.', 'Husb.', 'Hydraul.', 'Hydrol.', 'Ichth.', 'Icthyol.', 'Ideol.', 'Idol.', 'Illustr.', 'Imag.', 'Immunol.', 'Impr.', 'Inaug.', 'Inc.', 'Inclos.', 'Ind.', 'Industr.', 'Industr. Rel.', 'Infl.', 'Innoc.', 'Inorg.', 'Inq.', 'Inst.', 'Instr.', 'Intell.', 'Intellect.', 'Interc.', 'Interl.', 'Internat.', 'Interpr.', 'Intro.', 'Introd.', 'Inv.', 'Invent.', 'Invert. Zool.', 'Invertebr.', 'Investig.', 'Investm.', 'Invoc.', 'Ir.', 'Irel.', 'Isa.', 'Ital.', 'Jahrb.', 'Jam.', 'Jan.', 'Jap.', 'Jas.', 'Jer.', 'Josh.', 'Jrnl.', 'Jrnls.', 'Jud.', 'Judg.', 'Jul.', 'Jun.', 'Jurisd.', 'Jurisdict.', 'Jurispr.', 'Justif.', 'Justific.', 'Kent.', 'Kgs.', 'Kingd.', 'King’s Bench Div.', 'Knowl.', 'Kpr.', 'LXX', 'Lab.', 'Lam.', 'Lament', 'Lament.', 'Lanc.', 'Lancash.', 'Lancs.', 'Lang.', 'Langs.', 'Lat.', 'Ld.', 'Lds.', 'Lect.', 'Leechd.', 'Leg.', 'Leicest.', 'Leicester.', 'Leicestersh.', 'Leics.', 'Let.', 'Lett.', 'Lev.', 'Lex.', 'Libr.', 'Limnol.', 'Lincolnsh.', 'Lincs.', 'Ling.', 'Linn.', 'Lit.', 'Lithogr.', 'Lithol.', 'Liturg.', 'Lond.', 'MS.', 'MSS.', 'Macc.', 'Mach.', 'Mag.', 'Magn.', 'Mal.', 'Man.', 'Managem.', 'Manch.', 'Manip.', 'Manuf.', 'Mar.', 'Mass.', 'Math.', 'Matt.', 'Meas.', 'Measurem.', 'Mech.', 'Med.', 'Medit.', 'Mem.', 'Merc.', 'Merch.', 'Metall.', 'Metallif.', 'Metallogr.', 'Metamorph.', 'Metaph.', 'Meteorol.', 'Meth.', 'Metrop.', 'Mex.', 'Mic.', 'Mich.', 'Microbiol.', 'Microsc.', 'Mil.', 'Milit.', 'Min.', 'Mineral.', 'Misc.', 'Miscell.', 'Mod.', 'Monum.', 'Morphol.', 'Mt.', 'Mtg.', 'Mts.', 'Munic.', 'Munif.', 'Munim.', 'Mus.', 'Myst.', 'Myth.', 'Mythol.', 'N. Afr.', 'N. Amer.', 'N. Carolina', 'N. Dakota', 'N. Ir.', 'N. Irel.', 'N.E.', 'N.E.D.', 'N.S. Wales', 'N.S.W.', 'N.T.', 'N.W.', 'N.Y.', 'N.Z.', 'Nah.', 'Narr.', 'Narrat.', 'Nat.', 'Nat. Hist.', 'Nat. Philos.', 'Nat. Sci.', 'Naut.', 'Nav.', 'Navig.', 'Neh.', 'Neighb.', 'Nerv.', 'Neurol.', 'Neurosurg.', 'New Hampsh.', 'Newc.', 'Newspr.', 'No.', 'Non-conf.', 'Nonconf.', 'Norf.', 'Northamptonsh.', 'Northants.', 'Northumb.', 'Northumbld.', 'Northumbr.', 'Norw.', 'Norweg.', 'Notts.', 'Nov.', 'Nucl.', 'Num.', 'Numism.', 'O.E.D.', 'O.T.', 'OE', 'Obad.', 'Obed.', 'Obj.', 'Obs.', 'Observ.', 'Obstet.', 'Obstetr.', 'Obstetr. Med.', 'Occas.', 'Occup.', 'Occurr.', 'Oceanogr.', 'Oct.', 'Off.', 'Offic.', 'Okla.', 'Ont.', 'Ophthalm.', 'Ophthalmol.', 'Oppress.', 'Opt.', 'Orac.', 'Ord.', 'Org.', 'Org. Chem.', 'Organ. Chem.', 'Orig.', 'Orkn.', 'Ornith.', 'Ornithol.', 'Orthogr.', 'Outl.', 'Oxf.', 'Oxfordsh.', 'Oxon.', 'P. R.', 'Pa.', 'Palaeobot.', 'Palaeogr.', 'Palaeont.', 'Palaeontol.', 'Paraphr.', 'Parasitol.', 'Parl.', 'Parnass.', 'Path.', 'Pathol.', 'Peculat.', 'Penins.', 'Perf.', 'Periodontol.', 'Pers.', 'Persec.', 'Perthsh.', 'Pet.', 'Petrogr.', 'Petrol.', 'Pharm.', 'Pharmaceut.', 'Pharmacol.', 'Phil.', 'Philad.', 'Philem.', 'Philipp.', 'Philol.', 'Philos.', 'Phoen.', 'Phonol.', 'Photog.', 'Photogr.', 'Phrenol.', 'Phys.', 'Physical Chem.', 'Physical Geogr.', 'Physiogr.', 'Physiol.', 'Pict.', 'Poet.', 'Pol.', 'Pol. Econ.', 'Polit.', 'Polytechn.', 'Pop.', 'Porc.', 'Port.', 'Posth.', 'Postm.', 'Pott.', 'Pract.', 'Predict.', 'Pref.', 'Preh.', 'Prehist.', 'Prerog.', 'Pres.', 'Presb.', 'Preserv.', 'Prim.', 'Princ.', 'Print.', 'Probab.', 'Probl.', 'Proc.', 'Prod.', 'Prol.', 'Pronunc.', 'Prop.', 'Pros.', 'Prov.', 'Provid.', 'Provinc.', 'Provis.', 'Ps.', 'Psych.', 'Psychoanal.', 'Psychoanalyt.', 'Psychol.', 'Psychopathol.', 'Pt.', 'Publ.', 'Purg.', 'Q. Eliz.', 'Qld.', 'Quantum Mech.', 'Queen’s Bench Div.', 'R.A.F.', 'R.C.', 'R.C. Church', 'R.N.', 'Radiol.', 'Reas.', 'Reb.', 'Rebell.', 'Rec.', 'Reclam.', 'Recoll.', 'Redempt.', 'Ref.', 'Refl.', 'Refus.', 'Refut.', 'Reg.', 'Regic.', 'Regist.', 'Regr.', 'Rel.', 'Relig.', 'Reminisc.', 'Remonstr.', 'Renfrewsh.', 'Rep.', 'Reprod.', 'Rept.', 'Repub.', 'Res.', 'Resid.', 'Ret.', 'Retrosp.', 'Rev.', 'Revol.', 'Rhet.', 'Rhode Isl.', 'Rich.', 'Rom.', 'Rom. Antiq.', 'Ross-sh.', 'Roxb.', 'Roy.', 'Rudim.', 'Russ.', 'S. Afr.', 'S. Carolina', 'S. Dakota', 'S.E.', 'S.T.S.', 'S.W.', 'SS.', 'Sam.', 'Sask.', 'Sat.', 'Sax.', 'Sc.', 'Scand.', 'Sch.', 'Sci.', 'Scot.', 'Scotl.', 'Script.', 'Sculpt.', 'Seismol.', 'Sel.', 'Sel. comm.', 'Select.', 'Sept.', 'Ser.', 'Serm.', 'Sess.', 'Settlem.', 'Sev.', 'Shakes.', 'Shaks.', 'Sheph.', 'Shetl.', 'Shropsh.', 'Soc.', 'Sociol.', 'Som.', 'Song Sol.', 'Song of Sol.', 'Sonn.', 'Span.', 'Spec.', 'Specif.', 'Specim.', 'Spectrosc.', 'St.', 'Staff.', 'Stafford.', 'Staffordsh.', 'Staffs.', 'Stand.', 'Stat.', 'Statist.', 'Stock Exch.', 'Stratigr.', 'Struct.', 'Stud.', 'Subj.', 'Subscr.', 'Subscript.', 'Suff.', 'Suppl.', 'Supplic.', 'Suppress.', 'Surg.', 'Surv.', 'Sus.', 'Symmetr.', 'Symp.', 'Syst.', 'Taxon.', 'Techn.', 'Technol.', 'Tel.', 'Telecomm.', 'Telegr.', 'Teleph.', 'Teratol.', 'Terminol.', 'Terrestr.', 'Test.', 'Textbk.', 'Theat.', 'Theatr.', 'Theol.', 'Theoret.', 'Thermonucl.', 'Thes.', 'Thess.', 'Tim.', 'Tit.', 'Topogr.', 'Trad.', 'Trag.', 'Trans.', 'Transl.', 'Transubstant.', 'Trav.', 'Treas.', 'Treat.', 'Treatm.', 'Trib.', 'Trig.', 'Trigonom.', 'Trop.', 'Troub.', 'Troubl.', 'Typog.', 'Typogr.', 'U.K.', 'U.S.', 'U.S.A.F.', 'U.S.S.R.', 'Univ.', 'Unnat.', 'Unoffic.', 'Urin.', 'Utilit.', 'Va.', 'Vac.', 'Valedict.', 'Veg.', 'Veg. Phys.', 'Veg. Physiol.', 'Venet.', 'Vertebr.', 'Vet.', 'Vet. Med.', 'Vet. Path.', 'Vet. Sci.', 'Vet. Surg.', 'Vic.', 'Vict.', 'Vind.', 'Vindic.', 'Virg.', 'Virol.', 'Voc.', 'Vocab.', 'Vol.', 'Vols.', 'Voy.', 'Vulg.', 'W. Afr.', 'W. Ind.', 'W. Indies', 'W. Va.', 'Warwicksh.', 'Wd.', 'Westm.', 'Westmld.', 'Westmorld.', 'Westmrld.', 'Will.', 'Wilts.', 'Wiltsh.', 'Wis.', 'Wisd.', 'Wk.', 'Wkly.', 'Wks.', 'Wonderf.', 'Worc.', 'Worcestersh.', 'Worcs.', 'Writ.', 'Yearbk.', 'Yng.', 'Yorks.', 'Yorksh.', 'Yr.', 'Yrs.', 'Zech.', 'Zeitschr.', 'Zeph.', 'Zoogeogr.', 'Zool.', 'abbrev.', 'abl.', 'abs.', 'absol.', 'abstr.', 'acc.', 'accus.', 'act.', 'ad.', 'adj.', 'adj. phr.', 'adjs.', 'adv.', 'advb.', 'advs.', 'agst.', 'alt.', 'aphet.', 'app.', 'appos.', 'arch.', 'art.', 'attrib.', 'bef.', 'betw.', 'cent.', 'cf.', 'cl.', 'cogn. w.', 'collect.', 'colloq.', 'comb. form', 'comp.', 'compar.', 'compl.', 'conc.', 'concr.', 'conj.', 'cons.', 'const.', 'contempt.', 'contr.', 'corresp.', 'cpd.', 'dat.', 'def.', 'dem.', 'deriv.', 'derog.', 'dial.', 'dim.', 'dyslog.', 'e. midl.', 'eOE', 'east.', 'ed.', 'ellipt.', 'emph.', 'erron.', 'esp.', 'etym.', 'etymol.', 'euphem.', 'exc.', 'fam.', 'famil.', 'fem.', 'fig.', 'fl.', 'freq.', 'fut.', 'gen.', 'gerund.', 'hist.', 'imit.', 'imp.', 'imperf.', 'impers.', 'impf.', 'improp.', 'inc.', 'ind.', 'indef.', 'indic.', 'indir.', 'infin.', 'infl.', 'instr.', 'int.', 'interj.', 'interrog.', 'intr.', 'intrans.', 'iron.', 'irreg.', 'joc.', 'lOE', 'lit.', 'll.', 'masc.', 'med.', 'metaphor.', 'metr. gr.', 'midl.', 'mispr.', 'mod.', 'n.e.', 'n.w.', 'no.', 'nom.', 'nonce-wd.', 'north.', 'nr.', 'ns.', 'obj.', 'obl.', 'obs.', 'occas.', 'opp.', 'orig.', 'p.', 'pa.', 'pa. pple.', 'pa. t.', 'pass.', 'perf.', 'perh.', 'pers.', 'personif.', 'pf.', 'phonet.', 'phr.', 'pl.', 'plur.', 'poet.', 'pop.', 'poss.', 'ppl.', 'ppl. a.', 'ppl. adj.', 'ppl. adjs.', 'pple.', 'pples.', 'pr.', 'pr. pple.', 'prec.', 'pred.', 'predic.', 'pref.', 'prep.', 'pres.', 'pres. pple.', 'priv.', 'prob.', 'pron.', 'pronunc.', 'prop.', 'propr.', 'prov.', 'pseudo-Sc.', 'pseudo-arch.', 'pseudo-dial.', 'q.v.', 'quot.', 'quots.', 'redupl.', 'refash.', 'refl.', 'reg.', 'rel.', 'repr.', 'rhet.', 's.e.', 's.v.', 's.w.', 'sc.', 'sing.', 'south.', 'sp.', 'spec.', 'str.', 'subj.', 'subjunct.', 'subord.', 'subord. cl.', 'subseq.', 'subst.', 'suff.', 'superl.', 'syll.', 'techn.', 'tr.', 'trans.', 'transf.', 'transl.', 'ult.', 'unkn.', 'unstr.', 'usu.', 'v.r.', 'v.rr.', 'var.', 'varr.', 'vars.', 'vb.', 'vbl.', 'vbl. ns.', 'vbl.n.', 'vbs.', 'viz.', 'vulg.', 'wd.', 'west.', 'wk.'}
articles = ['a', 'the', 'an']
conjunctions = ['for', 'and', 'nor', 'but', 'or', 'yet', 'so']
static init()
pronouns = {'I', 'all', 'another', 'any', 'anybody', 'anyone', 'anything', 'both', 'each', 'each other', 'either', 'enough', 'everybody', 'everyone', 'everything', 'few', 'he', 'her', 'hers', 'herself', 'him', 'himself', 'his', 'i', 'it', 'itself', 'little', 'many', 'me', 'mine', 'more', 'most', 'much', 'myself', 'neither', 'no one', 'nobody', 'none', 'nothing', 'one', 'one another', 'other', 'others', 'ours', 'ourselves', 'several', 'she', 'some', 'somebody', 'someone', 'something', 'such', 'that', 'theirs', 'them', 'themselves', 'these', 'they', 'this', 'those', 'us', 'we', 'what', 'whatever', 'which', 'whichever', 'who', 'whoever', 'whom', 'whomever', 'whose', 'you', 'yours', 'yourself'}

lexnlp.extract.en.geoentities module

Geo Entity extraction for English.

This module implements extraction functionality for geo entities in English, including formal names, abbreviations, and aliases.

lexnlp.extract.en.geoentities.get_geoentities(text: str, geo_config_list: List[lexnlp.extract.en.dict_entities.DictionaryEntry], priority: bool = False, priority_by_id: bool = False, text_languages: List[str] = None, min_alias_len: int = 2, prepared_alias_ban_list: Union[None, Dict[str, Tuple[List[str], List[str]]]] = None, simplified_normalization: bool = False) → Generator[[Tuple[lexnlp.extract.en.dict_entities.DictionaryEntry, lexnlp.extract.en.dict_entities.DictionaryEntryAlias], Any], Any]

Searches for geo entities from the provided config list and yields pairs of (entity, alias). Entity is: (entity_id, name, [list of aliases]) Alias is: (alias_text, lang, is_abbrev, alias_id)

This method uses general searching routines for dictionary entities from dict_entities.py module. Methods of dict_entities module can be used for comfortable creating the config: entity_config(), entity_alias(), add_aliases_to_entity(). :param text: :param geo_config_list: List of all possible known geo entities in the form of tuples (id, name, [(alias, lang, is_abbrev, alias_id), …]). :param priority: If two entities found with the totally equal matching aliases - then use the one with the greatest priority field. :param priority_by_id: If two entities found with the totally equal matching aliases - then use the one with the lowest id. :param text_languages: Language(s) of the source text. If a language is specified then only aliases of this language will be searched for. For example: this allows ignoring “Island” - a German language

alias of Iceland for English texts.
Parameters:
  • min_alias_len – Minimal length of geo entity aliases to search for.
  • prepared_alias_ban_list – List of aliases to exclude from searching in the form: dict of lang -> (list of normalized non-abbreviation aliases, list of normalized abbreviation aliases). Use dict_entities.prepare_alias_banlist_dict() for preparing this dict.
  • simplified_normalization – don’t use NLTK for “normalizing” text
Returns:

Generates tuples: (entity, alias)

lexnlp.extract.en.geoentities.get_geoentity_annotations(text: str, geo_config_list: List[lexnlp.extract.en.dict_entities.DictionaryEntry], priority: bool = False, priority_by_id: bool = False, text_languages: List[str] = None, min_alias_len: int = 2, prepared_alias_ban_list: Union[None, Dict[str, Tuple[List[str], List[str]]]] = None, simplified_normalization: bool = False) → Generator[[lexnlp.extract.common.annotations.geo_annotation.GeoAnnotation, None], None]

See get_geoentities

lexnlp.extract.en.geoentities.load_entities_dict_by_path(entities_fn: str, aliases_fn: str)

lexnlp.extract.en.introductory_words_detector module

class lexnlp.extract.en.introductory_words_detector.IntroductoryWordsDetector

Bases: object

INTRODUCTORY_POS = [[('RB', {'also', 'so'}), ('VBN', {'known', 'called', 'named'})], [('RB', {'also', 'so'}), ('JJ', {'known', 'called', 'named'})], [('VBN', {'known', 'called', 'named'})]]
INTRO_ADVERBS = {'also', 'so'}
INTRO_VERBS = {'called', 'known', 'named'}
MAX_INTRO_LEN = 2
PUNCTUATION_POS = {'\t', '!', '"', '$', '%', '&', "'", '(', ')', '*', ',', '-', '.', '/', ':', ';', '?', '@', '\\', ']', '^', '``', '{', '}['}
static remove_term_introduction(term: str, term_pos: List[Tuple[str, str, int, int]]) → str

so called “champerty’ => “champerty’ :param term: source phrase :param term_pos: sourse phrase

lexnlp.extract.en.money module

Money extraction for English.

This module implements basic money extraction functionality in English.

Todo:
  • Improved unit tests and case coverage
lexnlp.extract.en.money.get_money(text: str, return_sources: bool = False, float_digits: int = 4) → Generator
lexnlp.extract.en.money.get_money_annotations(text: str, float_digits: int = 4) → Generator[[lexnlp.extract.common.annotations.money_annotation.MoneyAnnotation, None], None]

lexnlp.extract.en.percents module

Percent extraction for English.

This module implements percent extraction functionality in English.

lexnlp.extract.en.percents.get_percent_annotations(text: str, float_digits: int = 4) → Generator[[lexnlp.extract.common.annotations.percent_annotation.PercentAnnotation, None], None]

Get percent usages within text.

lexnlp.extract.en.percents.get_percents(text: str, return_sources: bool = False, float_digits: int = 4) → Generator[[Union[Tuple[str, decimal.Decimal, decimal.Decimal], Tuple[str, decimal.Decimal, decimal.Decimal, str]], None], None]

Get percent usages within text. :param text: :param return_sources: :param float_digits: :return:

lexnlp.extract.en.pii module

PII extraction for English.

This module implements PII extraction functionality in English.

Todo:
lexnlp.extract.en.pii.get_pii(text: str, return_sources=False) → Generator

Find possible PII references in the text. :param text: :param return_sources: :return:

lexnlp.extract.en.pii.get_pii_annotations(text: str) → Generator[[lexnlp.extract.common.annotations.text_annotation.TextAnnotation, None], None]

Find possible PII references in the text.

lexnlp.extract.en.pii.get_ssn_annotations(text: str) → Generator[[lexnlp.extract.common.annotations.ssn_annotation.SsnAnnotation, None], None]
lexnlp.extract.en.pii.get_ssns(text, return_sources=False) → Generator

Find possible SSN references in the text.

lexnlp.extract.en.pii.get_us_phone_annotations(text: str) → Generator[[lexnlp.extract.common.annotations.phone_annotation.PhoneAnnotation, None], None]

Find possible telephone numbers in the text.

lexnlp.extract.en.pii.get_us_phones(text: str, return_sources=False) → Generator

Find possible telephone numbers in the text.

lexnlp.extract.en.ratios module

Ratio extraction for English.

This module implements ratio extraction functionality in English.

Todo:
  • Improved unit tests and case coverage
lexnlp.extract.en.ratios.get_ratio_annotations(text: str, float_digits: int = 4) → Generator[[lexnlp.extract.common.annotations.ratio_annotation.RatioAnnotation, None], None]
lexnlp.extract.en.ratios.get_ratios(text: str, return_sources: bool = False, float_digits: int = 4) → Generator[[Union[Tuple[decimal.Decimal, decimal.Decimal, decimal.Decimal], Tuple[decimal.Decimal, decimal.Decimal, decimal.Decimal, str]], None], None]

lexnlp.extract.en.regulations module

Regulation extraction for English.

This module implements regulation extraction functionality in English.

Todo:
  • Improved unit tests and case coverage
lexnlp.extract.en.regulations.get_regulation_annotations(text: str) → Generator[[lexnlp.extract.common.annotations.regulation_annotation.RegulationAnnotation, None], None]

Get regulations. :param text: :param return_source: :param as_dict: :return: tuple or dict (volume, reporter, reporter_full_name, page, page2, court, year[, source text])

lexnlp.extract.en.regulations.get_regulations(text, return_source=False, as_dict=False) → Generator

Get regulations. :param text: :param return_source: :param as_dict: :return: tuple or dict (volume, reporter, reporter_full_name, page, page2, court, year[, source text])

lexnlp.extract.en.trademarks module

Trademark extraction for English using NLTK and NLTK pre-trained maximum entropy classifier.

This module implements basic Trademark extraction functionality in English relying on the pre-trained NLTK functionality, including POS tagger and NE (fuzzy) chunkers.

Todo: -

lexnlp.extract.en.trademarks.get_trademark_annotations(text: str) → Generator[[lexnlp.extract.common.annotations.trademark_annotation.TrademarkAnnotation, None], None]

Find trademarks in text.

lexnlp.extract.en.trademarks.get_trademarks(text: str) → Generator[[str, None], None]

Find trademarks in text.

lexnlp.extract.en.urls module

Urls extraction for English using NLTK and NLTK pre-trained maximum entropy classifier.

This module implements basic urls extraction functionality in English relying on the pre-trained NLTK functionality, including POS tagger and NE (fuzzy) chunkers.

Todo: -

lexnlp.extract.en.urls.get_url_annotations(text: str) → Generator[[lexnlp.extract.common.annotations.url_annotation.UrlAnnotation, None], None]

Find urls in text.

lexnlp.extract.en.urls.get_urls(text: str) → Generator[[str, None], None]

Find urls in text.

lexnlp.extract.en.utils module

Extraction utilities for English.

class lexnlp.extract.en.utils.NPExtractor(grammar=None)

Bases: object

cleanup_leaves(leaves)
exception_pos = ['IN', 'CC']
exception_sym = ['&', 'and', 'of']
get_np(text: str) → Generator[[str, None], None]
get_np_with_coords(text: str) → List[Tuple[str, int, int]]
get_tokenizer()
join(np_items)
replace(text, back=False)
replacements = [[('(\\w)&(\\w)', '\\1-=AND=-\\2'), ('-=AND=-', '&')]]
sep(n, current_pos, last_pos)
static strip_np(np)
sym_with_space = ['(', '&']
sym_without_space = ['!', '"', '#', '$', '%', "'", ')', '*', '+', ',', '-', '.', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', "'s"]
lexnlp.extract.en.utils.strip_unicode_punctuation(text, valid_punctuation=None)

This method strips all unicode punctuation that is not accepted. :param text: text to strip :param valid_punctuation: valid punctuation to accept :return:

Module contents