lexnlp.extract.en package

Subpackages

Submodules

lexnlp.extract.en.acts module

lexnlp.extract.en.acts.get_act_list(*args, **kwargs) → List[Dict[str, str]]
lexnlp.extract.en.acts.get_acts(text: str) → Generator[[Dict[str, Any], None], None]
lexnlp.extract.en.acts.get_acts_annotations(text: str) → Generator[[lexnlp.extract.common.annotations.act_annotation.ActAnnotation, None], None]
lexnlp.extract.en.acts.get_acts_annotations_list(text: str) → List[lexnlp.extract.common.annotations.act_annotation.ActAnnotation]

lexnlp.extract.en.amounts module

Amount extraction for English.

This module implements basic amount extraction functionality in English.

This module supports converting: - numbers with comma delimiter: “25,000.00”, “123,456,000” - written numbers: “Seven Hundred Eighty” - mixed written numbers: “5 million” or “2.55 BILLION” - written ordinal numbers: “twenty-fifth” - fractions (non written): “1/33”, “25/100”; where 1 < numerator < 99; 1 < denominator < 999 - fraction No/100 wil be treated as 00/100 - written numbers and fractions: “twenty one AND 5/100” - written fractions: “one-third”, “three tenths”, “ten ninety-ninths”, “twenty AND one-hundredths”,

“2 hundred and one-thousandth”;
where 1 < numerator < 99 and 2 < denominator < 99 and numerator < denominator; or 1 < numerator < 99 and denominator == 100, i.e. 1/99 - 99/100; or 1 < numerator < 99 and denominator == 1000, i.e. 1/1000 - 99/1000;
  • floats starting with “.” (dot): “.5 million”
  • “dozen”: “twenty-two DOZEN”
  • “half”: “Six and a HALF Billion”, “two and a half”
  • “quarter”: “five and one-quarter”, “5 and one-quarter”, “three-quartes”
  • multiple numbers: “$25,400, 1 million people and 3.5 tons”

Avoids: - skip: “5.3.1.”, “1/1/2010”

lexnlp.extract.en.amounts.get_amount_annotations(text: str, extended_sources=True, float_digits=4) → Generator[[lexnlp.extract.common.annotations.amount_annotation.AmountAnnotation, None], None]

Find possible amount references in the text. :param text: text :param extended_sources: return data around amount itself :param float_digits: round float to N digits, don’t round if None :return: list of amounts

lexnlp.extract.en.amounts.get_amounts(text: str, return_sources=False, extended_sources=True, float_digits=4) → Generator[[float, None], None]

Find possible amount references in the text. :param text: text :param return_sources: return amount AND source text :param extended_sources: return data around amount itself :param float_digits: round float to N digits, don’t round if None :return: list of amounts

lexnlp.extract.en.amounts.get_np(text) → Generator
lexnlp.extract.en.amounts.text2num(s, search_fraction=True)

Convert written amount into integer/float. :param s: written number :param search_fraction: extract fraction :return: integer/float

lexnlp.extract.en.citations module

Citation extraction for English.

This module implements citation extraction functionality in English.

Todo:
  • Improved unit tests and case coverage
lexnlp.extract.en.citations.get_citation_annotations(text: str) → Generator[[lexnlp.extract.common.annotations.citation_annotation.CitationAnnotation, None], None]

Get citations. :param text: :param return_source: :param as_dict: :return: tuple or dict (volume, reporter, reporter_full_name, page, page2, court, year[, source text])

lexnlp.extract.en.citations.get_citations(text: str, return_source=False, as_dict=False) → Generator

Get citations. :param text: :param return_source: :param as_dict: :return: tuple or dict (volume, reporter, reporter_full_name, page, page2, court, year[, source text])

lexnlp.extract.en.conditions module

Condition extraction for English.

This module implements basic condition extraction functionality in English.

Todo:
  • Improved unit tests and case coverage
lexnlp.extract.en.conditions.create_condition_pattern(condition_pattern_template, condition_phrases)

Create condition pattern. :param condition_pattern_template: :param condition_phrases: :return:

lexnlp.extract.en.conditions.get_condition_annotations(text: str, strict=True) → Generator[[lexnlp.extract.common.annotations.condition_annotation.ConditionAnnotation, None], None]

Find possible conditions in natural language. :param text: :param strict: :return:

lexnlp.extract.en.conditions.get_conditions(text, strict=True) → Generator

lexnlp.extract.en.constraints module

Constraint extraction for English.

This module implements basic constraint extraction functionality in English.

Todo:
  • Improved unit tests and case coverage
lexnlp.extract.en.constraints.create_constraint_pattern(constraint_pattern_template, constraint_phrases)

Create constraint pattern. :param constraint_pattern_template: :param constraint_phrases: :return:

lexnlp.extract.en.constraints.get_constraint_annotations(text: str, strict=False) → Generator[[lexnlp.extract.common.annotations.constraint_annotation.ConstraintAnnotation, None], None]

Find possible constraints in natural language. :param text: :param strict: :return:

lexnlp.extract.en.constraints.get_constraints(text: str, strict=False) → Generator

Find possible constraints in natural language. :param text: :param strict: :return:

lexnlp.extract.en.copyright module

Copyright extraction for English using NLTK and NLTK pre-trained maximum entropy classifier.

This module implements basic Copyright extraction functionality in English relying on the pre-trained NLTK functionality, including POS tagger and NE (fuzzy) chunkers.

Todo: -

class lexnlp.extract.en.copyright.CopyrightEnParser

Bases: lexnlp.extract.common.copyrights.copyright_en_style_parser.CopyrightEnStyleParser

classmethod extract_phrases_with_coords(sentence: str) → List[Tuple[str, int, int]]
class lexnlp.extract.en.copyright.CopyrightNPExtractor(grammar=None)

Bases: lexnlp.extract.en.utils.NPExtractor

allowed_pos = ['IN', 'CC', 'NN']
allowed_sym = ['&', 'and', 'of', '©']
static strip_np(np)

lexnlp.extract.en.courts module

Court extraction for English.

This module implements extraction functionality for courts in English, including formal names, abbreviations, and aliases.

Todo:
  • Add utilities for loading court data
lexnlp.extract.en.courts.get_court_annotations(text: str, language: str = None) → Generator[[lexnlp.extract.common.annotations.court_annotation.CourtAnnotation, None], None]
lexnlp.extract.en.courts.get_courts()

Searches for courts from the provided config list and yields tuples of (court_config, court_alias). Court config is: (court_id, court_name, [list of aliases]) Alias is: (alias_text, language, is_abbrev, alias_id)

This method uses general searching routines for dictionary entities from dict_entities.py module. Methods of dict_entities module can be used for comfortable creating the config: entity_config(), entity_alias(), add_aliases_to_entity(). :param text: :param court_config_list: List list of all possible known courts in the form of tuples:

(id, name, [(alias, lang, is_abbrev], …).
Parameters:
  • return_source
  • priority – If two courts found with the totally equal matching aliases - then use the one with the lowest id.
  • text_languages – Language(s) of the source text. If a language is specified then only aliases of this
language will be searched for. For example: this allows ignoring “Island” - a German language
alias of Iceland for English texts.
Returns:Generates tuples: (court entity, court alias)
lexnlp.extract.en.courts.setup_en_parser()

lexnlp.extract.en.cusip module

Ratio extraction for English.

This module implements ratio extraction functionality in English.

Todo:
  • Improved unit tests and case coverage
lexnlp.extract.en.cusip.get_cusip(text: str) → Generator[[Dict[str, Any], None], None]
lexnlp.extract.en.cusip.get_cusip_annotations(text: str) → Generator[[lexnlp.extract.common.annotations.cusip_annotation.CusipAnnotation, None], None]

INFO: https://www.cusip.com/pdf/CUSIP_Intro_03.14.11.pdf

lexnlp.extract.en.cusip.get_cusip_list(text)
lexnlp.extract.en.cusip.is_cusip_valid(code, return_checksum=False)

lexnlp.extract.en.date_model module

lexnlp.extract.en.dates module

lexnlp.extract.en.definition_parsing_methods module

Definition extraction for English.

This module implements basic definition extraction functionality in English.

Todo:
  • Improved unit tests and case coverage
class lexnlp.extract.en.definition_parsing_methods.DefinitionCaught(name: str, text: str, coords: Tuple[int, int])

Bases: object

Each definition is stored in this class with its name, full text and “coords” within the whole document

coords
does_consume_target(target) → int
Parameters:target – a definition that is, probably, “consumed” by the current one
Returns:1 if self consumes the target, -1 if the target consumes self, overwise 0
name
text
lexnlp.extract.en.definition_parsing_methods.does_term_are_service_words(term_pos: List[Tuple[str, str, int, int]]) → bool

Does term consist of service words only?

lexnlp.extract.en.definition_parsing_methods.filter_definitions_for_self_repeating(definitions: List[lexnlp.extract.en.definition_parsing_methods.DefinitionCaught]) → List[lexnlp.extract.en.definition_parsing_methods.DefinitionCaught]
Parameters:definitions
Returns:excludes definitions that are “overlapped”, leaves unique definitions only
lexnlp.extract.en.definition_parsing_methods.get_definition_list_in_sentence(sentence_coords: Tuple[int, int, str], decode_unicode=True) → List[lexnlp.extract.en.definition_parsing_methods.DefinitionCaught]

Find possible definitions in natural language in a single sentence. :param sentence_coords: sentence, sentence start, end :param decode_unicode: :return:

lexnlp.extract.en.definition_parsing_methods.get_quotes_count_in_string(text: str) → int
Parameters:text – text to calculate quotes within
Returns:calculates count of quotes within the text passed
lexnlp.extract.en.definition_parsing_methods.join_collection(collection)
lexnlp.extract.en.definition_parsing_methods.regex_matches_to_word_coords(pattern: Pattern[str], text: str, phrase_start: int = 0) → List[Tuple[str, int, int]]
Parameters:
  • pattern – pattern for searching for matches within the text
  • text – text to search for matches
  • phrase_start – a value to be add to start / end
Returns:

tuples of (match_text, start, end) out of the regex (pattern) matches in text

lexnlp.extract.en.definition_parsing_methods.split_definitions_inside_term(term: str, src_with_coords: Tuple[int, int, str], term_start: int, term_end: int) → List[Tuple[str, int, int]]

The whole phrase can be considered definition (“MSRB”, “we”, “us” or “our”), but in fact the phrase can be a collection of definitions. Here we split definition phrase to a list of definitions.

Source string could be pre-processed, that’s why we search for each sub-phrase’s coordinates (PhrasePositionFinder) :param term: a definition or, probably, a set of definitions (“MSRB”, “we”, “us” or “our”) :param src_with_coords: a sentence (probably), containing the term + its coords :param term_start: “term” start coordinate within the source sentence :param term_end: “term” end coordinate within the source sentence :return: [(definition, def_start, def_end), …]

lexnlp.extract.en.definition_parsing_methods.trim_defined_term(term: str, start: int, end: int) → Tuple[str, int, int, bool]

Remove pair of quotes / brackets framing text Replace N-grams of spaces with single spaces Replace line breaks with spaces :param term: a phrase that may contain excess framing symbols :param start: original term’s start position, may be changed :param end: original term’s end position, may be changed :return: updated term, start, end and the flag indicating that the whole phrase was inside quotes

lexnlp.extract.en.definitions module

lexnlp.extract.en.definitions.get_definition_annotations(text: str, decode_unicode=True, locator_type: lexnlp.extract.common.annotation_locator_type.AnnotationLocatorType = <AnnotationLocatorType.RegexpBased: 1>) → Generator[[lexnlp.extract.common.annotations.definition_annotation.DefinitionAnnotation, None], None]
lexnlp.extract.en.definitions.get_definition_objects_list(text, decode_unicode=True) → List[lexnlp.extract.en.definition_parsing_methods.DefinitionCaught]
Parameters:
  • text – text to search for definitions
  • decode_unicode
Returns:

a list of found definitions - objects of class DefinitionCaught

lexnlp.extract.en.definitions.get_definitions(text: str, return_sources=False, decode_unicode=True, return_coords=False, locator_type: lexnlp.extract.common.annotation_locator_type.AnnotationLocatorType = <AnnotationLocatorType.RegexpBased: 1>) → Generator

Find possible definitions in natural language in text. The text will be split to sentences first. :param return_coords: returns a (x, y) tuple in each record. x - definition’s text start, y - definition’s text end :param decode_unicode: :param return_sources: returns a tuple with the extracted term and the source sentence :param text: the input text :param locator_type: use default (Regexp-based) or ML-based locator :return: Generator[name] or Generator[name, text] or Generator[name, text, coords]

lexnlp.extract.en.definitions.get_definitions_explicit(text, decode_unicode=True, locator_type: lexnlp.extract.common.annotation_locator_type.AnnotationLocatorType = <AnnotationLocatorType.RegexpBased: 1>) → Generator
lexnlp.extract.en.definitions.get_definitions_in_sentence(sentence: str, return_sources=False, decode_unicode=True) → Generator

lexnlp.extract.en.dict_entities module

Universal extraction of entities for which we have full dictionaries of possible names and aliases from English text.

Example: Courts - we have the full dictionary of known courts with their names and aliases and are able to search the text for each possible court.

Geo entities - we have the full set of known geo entities and can search any text for their occurrences.

Search methods of this module require lists of possible entities with their ids, names and sets of aliases in different languages. To allow using these methods in Celery and especially for allowing building these configuration lists once and using them in multiple Celery tasks it is required to allow their easy and fast serialization. By default Celery uses JSON serialization starting from v. 4 and does not allow serializing objects of custom classes out of the box. So we will have to use either dicts or tuples to avoid requiring special configuration for Celery. Tuples are faster.

To avoid typos in development and utilize typization hints in IDE there are few methods in this module for operating tuples which represent entities and aliases. They accept named parameters lists and return tuples.

class lexnlp.extract.en.dict_entities.DictionaryEntity(entity: Any, coords: Tuple[int, int])

Bases: object

class lexnlp.extract.en.dict_entities.SearchResultPosition

Bases: object

Represents a position in the normalized source text at which one or more entities have been detected. One or more entities having equal aliases can be detected on a position in the text.

add_entity()
alias_text
end
entities_dict
get_entities_aliases()
overlaps(other: lexnlp.extract.en.dict_entities.SearchResultPosition)
start
lexnlp.extract.en.dict_entities.add_alias_to_entity()

Add alias to entity. Entities are in the form of tuples: (entity_id, name, [(alias_text, lang, is_abbrev, alias_id), …]). This method is just for more comfortable development - to ensure type safety and avoid accessing properties of entities by their indexes. :param entity: :param alias: :param language: :param is_abbreviation: :param alias_id: Alias id or None if identifying is not supported. :return:

lexnlp.extract.en.dict_entities.add_aliases_to_entity(entity: Tuple[int, str, int, List[Tuple[str, str, bool]]], aliases_csv: str, language: str = None, is_abbreviation: bool = None, alias_id: int = None, csv_separator: str = ';')

Add alias to entity. Entities are in the form of tuples: (entity_id, name, [(alias_text, lang, is_abbrev, alias_id), …]). This method can be used if there is a comma separated list of aliases stored somewhere and they all have the same language and is_abbreviation value. This method is just for more comfortable development - to ensure type safety and avoid accessing properties of entities by their indexes. :param entity: :param aliases_csv: :param language: :param is_abbreviation: :param alias_id: :param csv_separator: :return:

lexnlp.extract.en.dict_entities.alias_is_blacklisted(alias_black_list: Union[None, Dict[str, Tuple[List[str], List[str]]]], norm_alias: str, alias_lang: str, is_abbrev: bool) → bool
lexnlp.extract.en.dict_entities.conflicts_take_first_by_id()

Default conflict resolving function for dropping all entities detected at the same position excepting the one having the smallest id. To be used in find_dict_entities() method. :param conflicting_entities_aliases: list of (entity, alias) pairs :return:

lexnlp.extract.en.dict_entities.conflicts_top_by_priority()

Default conflict resolving function for dropping all entities detected at the same position excepting the one having the smallest id. To be used in find_dict_entities() method. :param conflicting_entities_aliases: list of (entity, alias) pairs :return:

lexnlp.extract.en.dict_entities.entity_alias(alias: str, language: str = None, is_abbreviation: bool = False, alias_id: int = None) → Tuple[str, str, bool, int, str]

Create entity alias tuple. This method is just for ensuring type safety of alias components in IDE. :param alias_id: Alias id. None if there is no id. :param alias: Alias text - ‘Mississippi’, ‘MS’, ‘CAN’, … :param language: Language - en, de, fr, … :param is_abbreviation: Is this alias representing an abbreviation or not. Abbreviations have different rules of searching. :return: A tuple representing the alias in format: (alias_text, lang, is_abbreviation, alias_id)

lexnlp.extract.en.dict_entities.entity_config()

Create entity configuration for a possible entity with its id, name and aliases to search. :param entity_id: Unique identifier of the entity. :param name: Human-readable name to displaying in UI. Searches are made not for name but for the possible aliases - each one having its assigned language. And name may or may not be added to the list of search aliases. :param priority: Optional int priority value for the entity. Can be used for sorting. Entities with higher prio should be selected first. :param aliases: List of aliases to search for. Each alias can be either string or (alias, language, is_abbreviation, alias_id) tuple. For string - a tuple with default values is created. entity_alias() function can be used to create the alias tuple ensuring its components type safety in IDE. :param name_is_alias: If True - then add entity name to the list of aliases with undefined language. :return: A tuple representing the entity in format (entity_id, name, [(alias, lang, is_abbrev, alias_id), …])

lexnlp.extract.en.dict_entities.find_dict_entities()

Find all entities defined in the ‘all_possible_entities’ list appeared in the source text. This method takes care of leaving only the longest matching search result for the case of multiple entities having aliases - one being a substring of another. This method takes care of the language of the text and aliases - if language is specified both for the text and for the alias - then this alias is used only if they are the same. This method may detect multiple possibly matching entities at a position in the text - because there can be entites having the same aliases in the same language. To resolve such conflicts a special resolving function can be specified. This method takes care of time AM/PM components which possibly can appear in the aliases of some entities - it tries to detect minutes/seconds/milliseconds before AM/PM and ignore them in such cases.

Algorithm of this method: 1. Normalize the source text (we need lowercase and non-lowercase versions for abbrev searches). 2. Create a shared search context - a map of position -> (alias text + list of matching entities) 3. For each possible entity do search using the shared context:

3.1. For each alias of the entity:
3.1.1. Iteratively search for all occurrences of the alias taking into account its language, abbrev status.
For each found occurrence of the alias - check if there is already found another alias and entity at this position and leave only the one having the longest alias (“Something” vs “Something Bigger”) If there is already a found different entity on this position having totally equal alias with the same language - then store them both for this position in the text.

4. Now we have a map filled with: position -> (alias text + list of entities having this alias). After sorting the items of this dict by position we will be able to get rid of overlaping of longer and shorter aliases being one a substirng of another (“Bankr. E.D.N.Y.” vs “E.D.N.Y.”). 5. For each next position check if it overlaps with the next one [position; position + len(alias)]. If overlaps - then leave the longest alias and drop the shorter.

Main complexity of this algorithm is caused by the requirement to detect the longest match for each piece of text while the longer match can start at the earlier position then the shorter match and there can be multiple aliases of different entities matching the same piece of text.

Another algorithm for this function can be based on the idea that or-kind regexp returns the longest matching group. We could form regexps containing the possible aliases and apply them to the source text: r’alias1|alias2|longer alias2|…’

TODO Compare to other algorithms for time and memory complexity

Parameters:
  • text
  • all_possible_entities – list of dict or list of DictEntity - all possible entities to search for
  • min_alias_len – Minimal length of alias/name to search for. Can be used to ignore too short aliases like “M.”

while searching. :param prepared_alias_black_list: List of aliases to remove from searching. Can be used to ignore concrete aliases. Prepared black list of aliases to exclude from search. Should be: dict of language -> tuple (list of normalized non-abbreviations, list of normalized abbreviations) :param text_languages: If set - then only aliases of these languages will be searched for. :param conflict_resolving_func: A function for resolving conflicts when there are multiple entities detected at the same position in the source text and their detected aliases are of the same length. The function takes a list of conflicting entities and should return a list of one or more entities which should be returned. :param use_stemmer: Use stemmer instead of tokenizer. Stemmer converts words to their simple form (singular number, e.t.c.). Stemmer works better for searching for “tables”, “developers”, … Tokenizer fits for “United States”, “Mississippi”, … :param remove_time_am_pm: Remove from final results AM/PM abbreviations which look like end part of time strings - 11:45 am, 10:00 pm. :return:

lexnlp.extract.en.dict_entities.get_alias_id(alias: Tuple[str, str, bool, int]) → int

Get alias text from alias tuple. This method is just for more comfortable development - to avoid accessing properties of aliases by their indexes. :param alias: :return:

lexnlp.extract.en.dict_entities.get_alias_text(alias: Tuple[str, str, bool, int]) → str

Get alias text from alias tuple. This method is just for more comfortable development - to avoid accessing properties of aliases by their indexes. :param alias: :return:

lexnlp.extract.en.dict_entities.get_entity_aliases()

Get aliases of the entity. This method is just for more comfortable development - to avoid accessing properties of entities by their indexes. :param entity: :return:

lexnlp.extract.en.dict_entities.get_entity_id()

Get id of the entity. This method is just for more comfortable development - to avoid accessing properties of entities by their indexes. :param entity: :return:

lexnlp.extract.en.dict_entities.get_entity_name()

Get name of the entity. This method is just for more comfortable development - to avoid accessing properties of entities by their indexes. :param entity: :return:

lexnlp.extract.en.dict_entities.get_entity_priority()

Get priority of the entity. This method is just for more comfortable development - to avoid accessing properties of entities by their indexes. :param entity: :return:

lexnlp.extract.en.dict_entities.normalize_text(text: str, spaces_on_start_end: bool = True, spaces_after_dots: bool = True, lowercase: bool = True, use_stemmer: bool = False) → str

Normalizes text for substring search operations - extracts tokens, joins them back with spaces, adds missing spaces after dots for abbreviations, e.t.c. Overall aim of this method is to weaken substring matching conditions by normalizing both the text and the substring being searched by the same way removing obsolete differences between them (case, punctuation, …). :param text: :param spaces_on_start_end: :param spaces_after_dots: :param lowercase: :param use_stemmer: Use stemmer instead of tokenizer. When using stemmer all words will be converted to singular number (or to some the most plain form) before matching. When using tokenizer - the words are compared as is. Using tokenizer should be enough for searches for entities which exist in a single number in the real world - geo entities, courts, …. Stemmer is required for searching for some common objects - table, pen, developer, … :return:

lexnlp.extract.en.dict_entities.prepare_alias_blacklist_dict(alias_blacklist: List[Tuple[str, str, bool]], use_stemmer: bool = False) → Union[None, Dict[str, Tuple[List[str], List[str]]]]

Prepare alias black list for providing it to find_dict_entities() function. :param alias_blacklist: Non-normalized form of the blacklist: [(alias, lang, is_abbreb), …] :param use_stemmer: Use stemmer for alias normalization. Otherwise - tokenizer only. :return:

lexnlp.extract.en.distances module

Distance extraction for English.

This module implements basic distance extraction functionality in English.

lexnlp.extract.en.distances.get_distance_annotations(text: str, float_digits=4) → Generator[[lexnlp.extract.common.annotations.distance_annotation.DistanceAnnotation, None], None]
lexnlp.extract.en.distances.get_distances(text: str, return_sources=False, float_digits=4) → Generator

lexnlp.extract.en.durations module

This module implements duration extraction functionality in English.

class lexnlp.extract.en.durations.EnDurationParser

Bases: lexnlp.extract.common.durations.durations_parser.DurationParser

DURATION_MAP = {'anniversaries': 365, 'anniversary': 365, 'annum': 365, 'day': 1, 'hour': 0.041666666666666664, 'minute': 0.0006944444444444445, 'month': 30, 'quarter': 91.25, 'second': 1.1574074074074073e-05, 'week': 7, 'year': 365}
DURATION_PTN = '\n ((\n(?:(?:(?:(?:(?:[\\.\\d][\\d\\.,]*\\s*|\\W|^)\n(?:(?:seventeenths|seventeenth|thirteenths|fourteenths|eighteenths|nineteenths|seventieths|thirteenth|fourteenth|eighteenth|nineteenth|seventieth|fifteenths|sixteenths|twentieths|thirtieths|eightieths|ninetieths|seventeen|fifteenth|sixteenth|twentieth|thirtieth|eightieth|ninetieth|elevenths|fortieths|fiftieths|sixtieths|thirteen|fourteen|eighteen|nineteen|eleventh|fortieth|fiftieth|sixtieth|sevenths|twelfths|fifteen|sixteen|seventy|seventh|twelfth|fourths|eighths|eleven|twelve|twenty|thirty|eighty|ninety|zeroth|second|fourth|eighth|thirds|fifths|sixths|ninths|tenths|three|seven|eight|forty|fifty|sixty|first|third|fifth|sixth|ninth|tenth|zero|four|five|nine|one|two|six|ten|thousandths|thousandth|thousand|trillion|million|billion|trill|bil|mm|k|m|b\n|hundred(?:th(?:s)?)?|dozen|and|a\\s+half|quarters?)[\\s-]*)+)\n(?:(?:no|\\d{1,2})/100)?)|(?<=\\W|^)(?:[\\.\\d][\\d\\.,/]*))(?:\\W|$))(?:\\s{0,2}[½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅐⅛⅜⅝⅞⅑⅒]+)*)\n (?:\\s*(?:calendar|business|actual))?[\\s-]*\n (second|minute|hour|day|week|month|quarter|year|annum|anniversary|anniversaries)s?)(?:\\W|$)\n '
DURATION_PTN_RE = regex.Regex('\n ((\n(?:(?:(?:(?:(?:[\\.\\d][\\d\\.,]*\\s*|\\W|^)\n(?:(?:seventeenths|seventeenth|thirteenths|fourteenths|eighteenths|nineteenths|seventieths|thirteenth|fourteenth|eighteenth|nineteenth|seventieth|fifteenths|sixteenths|twentieths|thirtieths|eightieths|ninetieths|seventeen|fifteenth|sixteenth|twentieth|thirtieth|eightieth|ninetieth|elevenths|fortieths|fiftieths|sixtieths|thirteen|fourteen|eighteen|nineteen|eleventh|fortieth|fiftieth|sixtieth|sevenths|twelfths|fifteen|sixteen|seventy|seventh|twelfth|fourths|eighths|eleven|twelve|twenty|thirty|eighty|ninety|zeroth|second|fourth|eighth|thirds|fifths|sixths|ninths|tenths|three|seven|eight|forty|fifty|sixty|first|third|fifth|sixth|ninth|tenth|zero|four|five|nine|one|two|six|ten|thousandths|thousandth|thousand|trillion|million|billion|trill|bil|mm|k|m|b\n|hundred(?:th(?:s)?)?|dozen|and|a\\s+half|quarters?)[\\s-]*)+)\n(?:(?:no|\\d{1,2})/100)?)|(?<=\\W|^)(?:[\\.\\d][\\d\\.,/]*))(?:\\W|$))(?:\\s{0,2}[½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅐⅛⅜⅝⅞⅑⅒]+)*)\n (?:\\s*(?:calendar|business|actual))?[\\s-]*\n (second|minute|hour|day|week|month|quarter|year|annum|anniversary|anniversaries)s?)(?:\\W|$)\n ', flags=regex.S | regex.I | regex.M | regex.X | regex.V0)
INNER_CONJUNCTIONS = ['and', 'plus']
INNER_PUNCTUATION = regex.Regex('[\\s\\,]', flags=regex.V0)
classmethod get_all_annotations(text: str, float_digits=4) → List[lexnlp.extract.common.annotations.duration_annotation.DurationAnnotation]
lexnlp.extract.en.durations.get_duration_annotations(text: str, float_digits=4) → Generator[[lexnlp.extract.common.annotations.duration_annotation.DurationAnnotation, None], None]
lexnlp.extract.en.durations.get_duration_annotations_list(text: str, float_digits=4) → List[lexnlp.extract.common.annotations.duration_annotation.DurationAnnotation]
lexnlp.extract.en.durations.get_durations(text: str, return_sources=False, float_digits=4) → Generator

lexnlp.extract.en.en_language_tokens module

class lexnlp.extract.en.en_language_tokens.EnLanguageTokens

Bases: object

abbreviations = {'A.D.', 'A.V.', 'Abbrev.', 'Abd.', 'Aberd.', 'Aberdeensh.', 'Abol.', 'Aborig.', 'Abp.', 'Abr.', 'Abridg.', 'Abridgem.', 'Absol.', 'Abst.', 'Abstr.', 'Acad.', 'Acc.', 'Accept.', 'Accomm.', 'Accompl.', 'Accs.', 'Acct.', 'Accts.', 'Achievem.', 'Add.', 'Addit.', 'Addr.', 'Adm.', 'Admin.', 'Admir.', 'Admon.', 'Admonit.', 'Adv.', 'Advancem.', 'Advert.', 'Advoc.', 'Advt.', 'Advts.', 'Aerodynam.', 'Aeronaut.', 'Aff.', 'Affect.', 'Afr.', 'Agric.', 'Alch.', 'Alg.', 'Alleg.', 'Allit.', 'Alm.', 'Alph.', 'Amer.', 'Anal.', 'Analyt.', 'Anat.', 'Anc.', 'Anecd.', 'Ang.', 'Angl.', 'Anglo-Ind.', 'Anim.', 'Ann.', 'Anniv.', 'Annot.', 'Anon.', 'Answ.', 'Ant.', 'Anthrop.', 'Anthropol.', 'Antiq.', 'Apoc.', 'Apol.', 'App.', 'Appl.', 'Applic.', 'Apr.', 'Arab.', 'Arb.', 'Arch.', 'Archaeol.', 'Archipel.', 'Archit.', 'Argt.', 'Arith.', 'Arithm.', 'Arrangem.', 'Artic.', 'Artific.', 'Artill.', 'Ashm.', 'Assemb.', 'Assoc.', 'Assoc. Football', 'Assyriol.', 'Astr.', 'Astrol.', 'Astron.', 'Astronaut.', 'Att.', 'Attrib.', 'Aug.', 'Austral.', 'Auth.', 'Autobiog.', 'Autobiogr.', 'Ayrsh.', 'B.C.', 'BNC', 'Bacteriol.', 'Bedford.', 'Bedfordsh.', 'Bel & Dr.', 'Belg.', 'Berks.', 'Berksh.', 'Berw.', 'Berwicksh.', 'Bibliogr.', 'Biochem.', 'Biog.', 'Biogr.', 'Biol.', 'Bk.', 'Bks.', 'Bord.', 'Bot.', 'Bp.', 'Braz.', 'Brit.', 'Bucks.', 'Build.', 'Bull.', 'Bur.', 'Cal.', 'Calc.', 'Calend.', 'Calif.', 'Calligr.', 'Camb.', 'Cambr.', 'Campanol.', 'Canad.', 'Canterb.', 'Capt.', 'Cartogr.', 'Catal.', 'Catech.', 'Cath.', 'Cent.', 'Ceram.', 'Cert.', 'Certif.', 'Ch.', 'Ch. Hist.', 'Chamb.', 'Char.', 'Charac.', 'Chas.', 'Chem.', 'Chem. Engin.', 'Chesh.', 'Chr.', 'Chron.', 'Chronol.', 'Chrons.', 'Cinematogr.', 'Circ.', 'Civ. Law', 'Civil Engin.', 'Cl.', 'Class.', 'Class. Antiq.', 'Classif.', 'Climatol.', 'Clin.', 'Col.', 'Coll.', 'Collect.', 'Colloq.', 'Coloss.', 'Com.', 'Comb.', 'Combs.', 'Comm.', 'Comm. Law', 'Commandm.', 'Commend.', 'Commerc.', 'Commiss.', 'Commonw.', 'Communic.', 'Comp.', 'Comp. Anat.', 'Compan.', 'Compar.', 'Compend.', 'Compl.', 'Compos.', 'Conc.', 'Conch.', 'Concl.', 'Conf.', 'Confid.', 'Confl.', 'Confut.', 'Congr.', 'Congreg.', 'Congress.', 'Conn.', 'Consc.', 'Consecr.', 'Consid.', 'Consol.', 'Constit.', 'Constit. Hist.', 'Constr.', 'Contemp.', 'Contempl.', 'Contend.', 'Content.', 'Contin.', 'Contradict.', 'Contrib.', 'Controv.', 'Conv.', 'Convent.', 'Conversat.', 'Convoc.', 'Cor.', 'Cornw.', 'Coron.', 'Corr.', 'Corresp.', 'Counc.', 'Courtsh.', 'Craniol.', 'Craniom.', 'Crim.', 'Crim. Law', 'Crit.', 'Crt.', 'Crts.', 'Cryptogr.', 'Crystallogr.', 'Ct.', 'Cumb.', 'Cumberld.', 'Cumbld.', 'Cycl.', 'Cytol.', 'D.C.', 'Dan.', 'Dau.', 'Deb.', 'Dec.', 'Declar.', 'Ded.', 'Def.', 'Deliv.', 'Demonstr.', 'Dep.', 'Depred.', 'Depredat.', 'Dept.', 'Derbysh.', 'Descr.', 'Deut.', 'Devel.', 'Devonsh.', 'Dial.', 'Dict.', 'Diffic.', 'Direct.', 'Dis.', 'Disc.', 'Discipl.', 'Discov.', 'Discrim.', 'Discuss.', 'Diss.', 'Dist.', 'Distemp.', 'Distill.', 'Distrib.', 'Div.', 'Divers.', 'Dk.', 'Doc.', 'Doctr.', 'Domest.', 'Durh.', 'E. Afr.', 'E. Angl.', 'E. Anglian', 'E. Ind.', 'E.D.D.', 'E.E.T.S.', 'East Ind.', 'Eccl.', 'Eccl. Hist.', 'Eccl. Law', 'Eccles.', 'Ecclus.', 'Ecol.', 'Econ.', 'Ed.', 'Edin.', 'Edinb.', 'Educ.', 'Edw.', 'Egypt.', 'Egyptol.', 'Electr.', 'Electr. Engin.', 'Electro-magn.', 'Electro-physiol.', 'Elem.', 'Eliz.', 'Elizab.', 'Emb.', 'Embryol.', 'Encycl.', 'Encycl. Brit.', 'Encycl. Metrop.', 'Eng.', 'Engin.', 'Englishw.', 'Enq.', 'Ent.', 'Enthus.', 'Entom.', 'Entomol.', 'Enzymol.', 'Ep.', 'Eph.', 'Ephes.', 'Epil.', 'Episc.', 'Epist.', 'Epit.', 'Equip.', 'Esd.', 'Ess.', 'Essent.', 'Establ.', 'Esth.', 'Ethnol.', 'Etymol.', 'Eval.', 'Evang.', 'Even.', 'Evid.', 'Evol.', 'Ex. doc.', 'Exalt.', 'Exam.', 'Exch.', 'Exec.', 'Exerc.', 'Exhib.', 'Exod.', 'Exped.', 'Exper.', 'Explan.', 'Explic.', 'Explor.', 'Expos.', 'Ezek.', 'Fab.', 'Fam.', 'Farew.', 'Feb.', 'Ff.', 'Fifesh.', 'Footpr.', 'Forfarsh.', 'Fortif.', 'Fortn.', 'Found.', 'Fr.', 'Fragm.', 'Fratern.', 'Friendsh.', 'Fund.', 'Furnit.', 'Gal.', 'Gard.', 'Gastron.', 'Gaz.', 'Gd.', 'Gen.', 'Geo.', 'Geog.', 'Geogr.', 'Geol.', 'Geom.', 'Geomorphol.', 'Ger.', 'Glac.', 'Glasg.', 'Glos.', 'Gloss.', 'Glouc.', 'Gloucestersh.', 'Gosp.', 'Gov.', 'Govt.', 'Gr.', 'Gram.', 'Gramm. Analysis', 'Gt.', 'Gynaecol.', 'Hab.', 'Haematol.', 'Hag.', 'Hampsh.', 'Handbk.', 'Hants.', 'Heb.', 'Hebr.', 'Hen.', 'Her.', 'Herb.', 'Heref.', 'Hereford.', 'Herefordsh.', 'Hertfordsh.', 'Hierogl.', 'Hist.', 'Histol.', 'Hom.', 'Horol.', 'Hort.', 'Hos.', 'Hosp.', 'Househ.', 'Housek.', 'Husb.', 'Hydraul.', 'Hydrol.', 'Ichth.', 'Icthyol.', 'Ideol.', 'Idol.', 'Illustr.', 'Imag.', 'Immunol.', 'Impr.', 'Inaug.', 'Inc.', 'Inclos.', 'Ind.', 'Industr.', 'Industr. Rel.', 'Infl.', 'Innoc.', 'Inorg.', 'Inq.', 'Inst.', 'Instr.', 'Intell.', 'Intellect.', 'Interc.', 'Interl.', 'Internat.', 'Interpr.', 'Intro.', 'Introd.', 'Inv.', 'Invent.', 'Invert. Zool.', 'Invertebr.', 'Investig.', 'Investm.', 'Invoc.', 'Ir.', 'Irel.', 'Isa.', 'Ital.', 'Jahrb.', 'Jam.', 'Jan.', 'Jap.', 'Jas.', 'Jer.', 'Josh.', 'Jrnl.', 'Jrnls.', 'Jud.', 'Judg.', 'Jul.', 'Jun.', 'Jurisd.', 'Jurisdict.', 'Jurispr.', 'Justif.', 'Justific.', 'Kent.', 'Kgs.', 'Kingd.', 'King’s Bench Div.', 'Knowl.', 'Kpr.', 'LXX', 'Lab.', 'Lam.', 'Lament', 'Lament.', 'Lanc.', 'Lancash.', 'Lancs.', 'Lang.', 'Langs.', 'Lat.', 'Ld.', 'Lds.', 'Lect.', 'Leechd.', 'Leg.', 'Leicest.', 'Leicester.', 'Leicestersh.', 'Leics.', 'Let.', 'Lett.', 'Lev.', 'Lex.', 'Libr.', 'Limnol.', 'Lincolnsh.', 'Lincs.', 'Ling.', 'Linn.', 'Lit.', 'Lithogr.', 'Lithol.', 'Liturg.', 'Lond.', 'MS.', 'MSS.', 'Macc.', 'Mach.', 'Mag.', 'Magn.', 'Mal.', 'Man.', 'Managem.', 'Manch.', 'Manip.', 'Manuf.', 'Mar.', 'Mass.', 'Math.', 'Matt.', 'Meas.', 'Measurem.', 'Mech.', 'Med.', 'Medit.', 'Mem.', 'Merc.', 'Merch.', 'Metall.', 'Metallif.', 'Metallogr.', 'Metamorph.', 'Metaph.', 'Meteorol.', 'Meth.', 'Metrop.', 'Mex.', 'Mic.', 'Mich.', 'Microbiol.', 'Microsc.', 'Mil.', 'Milit.', 'Min.', 'Mineral.', 'Misc.', 'Miscell.', 'Mod.', 'Monum.', 'Morphol.', 'Mt.', 'Mtg.', 'Mts.', 'Munic.', 'Munif.', 'Munim.', 'Mus.', 'Myst.', 'Myth.', 'Mythol.', 'N. Afr.', 'N. Amer.', 'N. Carolina', 'N. Dakota', 'N. Ir.', 'N. Irel.', 'N.E.', 'N.E.D.', 'N.S. Wales', 'N.S.W.', 'N.T.', 'N.W.', 'N.Y.', 'N.Z.', 'Nah.', 'Narr.', 'Narrat.', 'Nat.', 'Nat. Hist.', 'Nat. Philos.', 'Nat. Sci.', 'Naut.', 'Nav.', 'Navig.', 'Neh.', 'Neighb.', 'Nerv.', 'Neurol.', 'Neurosurg.', 'New Hampsh.', 'Newc.', 'Newspr.', 'No.', 'Non-conf.', 'Nonconf.', 'Norf.', 'Northamptonsh.', 'Northants.', 'Northumb.', 'Northumbld.', 'Northumbr.', 'Norw.', 'Norweg.', 'Notts.', 'Nov.', 'Nucl.', 'Num.', 'Numism.', 'O.E.D.', 'O.T.', 'OE', 'Obad.', 'Obed.', 'Obj.', 'Obs.', 'Observ.', 'Obstet.', 'Obstetr.', 'Obstetr. Med.', 'Occas.', 'Occup.', 'Occurr.', 'Oceanogr.', 'Oct.', 'Off.', 'Offic.', 'Okla.', 'Ont.', 'Ophthalm.', 'Ophthalmol.', 'Oppress.', 'Opt.', 'Orac.', 'Ord.', 'Org.', 'Org. Chem.', 'Organ. Chem.', 'Orig.', 'Orkn.', 'Ornith.', 'Ornithol.', 'Orthogr.', 'Outl.', 'Oxf.', 'Oxfordsh.', 'Oxon.', 'P. R.', 'Pa.', 'Palaeobot.', 'Palaeogr.', 'Palaeont.', 'Palaeontol.', 'Paraphr.', 'Parasitol.', 'Parl.', 'Parnass.', 'Path.', 'Pathol.', 'Peculat.', 'Penins.', 'Perf.', 'Periodontol.', 'Pers.', 'Persec.', 'Perthsh.', 'Pet.', 'Petrogr.', 'Petrol.', 'Pharm.', 'Pharmaceut.', 'Pharmacol.', 'Phil.', 'Philad.', 'Philem.', 'Philipp.', 'Philol.', 'Philos.', 'Phoen.', 'Phonol.', 'Photog.', 'Photogr.', 'Phrenol.', 'Phys.', 'Physical Chem.', 'Physical Geogr.', 'Physiogr.', 'Physiol.', 'Pict.', 'Poet.', 'Pol.', 'Pol. Econ.', 'Polit.', 'Polytechn.', 'Pop.', 'Porc.', 'Port.', 'Posth.', 'Postm.', 'Pott.', 'Pract.', 'Predict.', 'Pref.', 'Preh.', 'Prehist.', 'Prerog.', 'Pres.', 'Presb.', 'Preserv.', 'Prim.', 'Princ.', 'Print.', 'Probab.', 'Probl.', 'Proc.', 'Prod.', 'Prol.', 'Pronunc.', 'Prop.', 'Pros.', 'Prov.', 'Provid.', 'Provinc.', 'Provis.', 'Ps.', 'Psych.', 'Psychoanal.', 'Psychoanalyt.', 'Psychol.', 'Psychopathol.', 'Pt.', 'Publ.', 'Purg.', 'Q. Eliz.', 'Qld.', 'Quantum Mech.', 'Queen’s Bench Div.', 'R.A.F.', 'R.C.', 'R.C. Church', 'R.N.', 'Radiol.', 'Reas.', 'Reb.', 'Rebell.', 'Rec.', 'Reclam.', 'Recoll.', 'Redempt.', 'Ref.', 'Refl.', 'Refus.', 'Refut.', 'Reg.', 'Regic.', 'Regist.', 'Regr.', 'Rel.', 'Relig.', 'Reminisc.', 'Remonstr.', 'Renfrewsh.', 'Rep.', 'Reprod.', 'Rept.', 'Repub.', 'Res.', 'Resid.', 'Ret.', 'Retrosp.', 'Rev.', 'Revol.', 'Rhet.', 'Rhode Isl.', 'Rich.', 'Rom.', 'Rom. Antiq.', 'Ross-sh.', 'Roxb.', 'Roy.', 'Rudim.', 'Russ.', 'S. Afr.', 'S. Carolina', 'S. Dakota', 'S.E.', 'S.T.S.', 'S.W.', 'SS.', 'Sam.', 'Sask.', 'Sat.', 'Sax.', 'Sc.', 'Scand.', 'Sch.', 'Sci.', 'Scot.', 'Scotl.', 'Script.', 'Sculpt.', 'Seismol.', 'Sel.', 'Sel. comm.', 'Select.', 'Sept.', 'Ser.', 'Serm.', 'Sess.', 'Settlem.', 'Sev.', 'Shakes.', 'Shaks.', 'Sheph.', 'Shetl.', 'Shropsh.', 'Soc.', 'Sociol.', 'Som.', 'Song Sol.', 'Song of Sol.', 'Sonn.', 'Span.', 'Spec.', 'Specif.', 'Specim.', 'Spectrosc.', 'St.', 'Staff.', 'Stafford.', 'Staffordsh.', 'Staffs.', 'Stand.', 'Stat.', 'Statist.', 'Stock Exch.', 'Stratigr.', 'Struct.', 'Stud.', 'Subj.', 'Subscr.', 'Subscript.', 'Suff.', 'Suppl.', 'Supplic.', 'Suppress.', 'Surg.', 'Surv.', 'Sus.', 'Symmetr.', 'Symp.', 'Syst.', 'Taxon.', 'Techn.', 'Technol.', 'Tel.', 'Telecomm.', 'Telegr.', 'Teleph.', 'Teratol.', 'Terminol.', 'Terrestr.', 'Test.', 'Textbk.', 'Theat.', 'Theatr.', 'Theol.', 'Theoret.', 'Thermonucl.', 'Thes.', 'Thess.', 'Tim.', 'Tit.', 'Topogr.', 'Trad.', 'Trag.', 'Trans.', 'Transl.', 'Transubstant.', 'Trav.', 'Treas.', 'Treat.', 'Treatm.', 'Trib.', 'Trig.', 'Trigonom.', 'Trop.', 'Troub.', 'Troubl.', 'Typog.', 'Typogr.', 'U.K.', 'U.S.', 'U.S.A.F.', 'U.S.S.R.', 'Univ.', 'Unnat.', 'Unoffic.', 'Urin.', 'Utilit.', 'Va.', 'Vac.', 'Valedict.', 'Veg.', 'Veg. Phys.', 'Veg. Physiol.', 'Venet.', 'Vertebr.', 'Vet.', 'Vet. Med.', 'Vet. Path.', 'Vet. Sci.', 'Vet. Surg.', 'Vic.', 'Vict.', 'Vind.', 'Vindic.', 'Virg.', 'Virol.', 'Voc.', 'Vocab.', 'Vol.', 'Vols.', 'Voy.', 'Vulg.', 'W. Afr.', 'W. Ind.', 'W. Indies', 'W. Va.', 'Warwicksh.', 'Wd.', 'Westm.', 'Westmld.', 'Westmorld.', 'Westmrld.', 'Will.', 'Wilts.', 'Wiltsh.', 'Wis.', 'Wisd.', 'Wk.', 'Wkly.', 'Wks.', 'Wonderf.', 'Worc.', 'Worcestersh.', 'Worcs.', 'Writ.', 'Yearbk.', 'Yng.', 'Yorks.', 'Yorksh.', 'Yr.', 'Yrs.', 'Zech.', 'Zeitschr.', 'Zeph.', 'Zoogeogr.', 'Zool.', 'abbrev.', 'abl.', 'abs.', 'absol.', 'abstr.', 'acc.', 'accus.', 'act.', 'ad.', 'adj.', 'adj. phr.', 'adjs.', 'adv.', 'advb.', 'advs.', 'agst.', 'alt.', 'aphet.', 'app.', 'appos.', 'arch.', 'art.', 'attrib.', 'bef.', 'betw.', 'cent.', 'cf.', 'cl.', 'cogn. w.', 'collect.', 'colloq.', 'comb. form', 'comp.', 'compar.', 'compl.', 'conc.', 'concr.', 'conj.', 'cons.', 'const.', 'contempt.', 'contr.', 'corresp.', 'cpd.', 'dat.', 'def.', 'dem.', 'deriv.', 'derog.', 'dial.', 'dim.', 'dyslog.', 'e. midl.', 'eOE', 'east.', 'ed.', 'ellipt.', 'emph.', 'erron.', 'esp.', 'etym.', 'etymol.', 'euphem.', 'exc.', 'fam.', 'famil.', 'fem.', 'fig.', 'fl.', 'freq.', 'fut.', 'gen.', 'gerund.', 'hist.', 'imit.', 'imp.', 'imperf.', 'impers.', 'impf.', 'improp.', 'inc.', 'ind.', 'indef.', 'indic.', 'indir.', 'infin.', 'infl.', 'instr.', 'int.', 'interj.', 'interrog.', 'intr.', 'intrans.', 'iron.', 'irreg.', 'joc.', 'lOE', 'lit.', 'll.', 'masc.', 'med.', 'metaphor.', 'metr. gr.', 'midl.', 'mispr.', 'mod.', 'n.e.', 'n.w.', 'no.', 'nom.', 'nonce-wd.', 'north.', 'nr.', 'ns.', 'obj.', 'obl.', 'obs.', 'occas.', 'opp.', 'orig.', 'p.', 'pa.', 'pa. pple.', 'pa. t.', 'pass.', 'perf.', 'perh.', 'pers.', 'personif.', 'pf.', 'phonet.', 'phr.', 'pl.', 'plur.', 'poet.', 'pop.', 'poss.', 'ppl.', 'ppl. a.', 'ppl. adj.', 'ppl. adjs.', 'pple.', 'pples.', 'pr.', 'pr. pple.', 'prec.', 'pred.', 'predic.', 'pref.', 'prep.', 'pres.', 'pres. pple.', 'priv.', 'prob.', 'pron.', 'pronunc.', 'prop.', 'propr.', 'prov.', 'pseudo-Sc.', 'pseudo-arch.', 'pseudo-dial.', 'q.v.', 'quot.', 'quots.', 'redupl.', 'refash.', 'refl.', 'reg.', 'rel.', 'repr.', 'rhet.', 's.e.', 's.v.', 's.w.', 'sc.', 'sing.', 'south.', 'sp.', 'spec.', 'str.', 'subj.', 'subjunct.', 'subord.', 'subord. cl.', 'subseq.', 'subst.', 'suff.', 'superl.', 'syll.', 'techn.', 'tr.', 'trans.', 'transf.', 'transl.', 'ult.', 'unkn.', 'unstr.', 'usu.', 'v.r.', 'v.rr.', 'var.', 'varr.', 'vars.', 'vb.', 'vbl.', 'vbl. ns.', 'vbl.n.', 'vbs.', 'viz.', 'vulg.', 'wd.', 'west.', 'wk.'}
articles = ['a', 'the', 'an']
conjunctions = ['for', 'and', 'nor', 'but', 'or', 'yet', 'so']
static init()
pronouns = {'I', 'all', 'another', 'any', 'anybody', 'anyone', 'anything', 'both', 'each', 'each other', 'either', 'enough', 'everybody', 'everyone', 'everything', 'few', 'he', 'her', 'hers', 'herself', 'him', 'himself', 'his', 'i', 'it', 'itself', 'little', 'many', 'me', 'mine', 'more', 'most', 'much', 'myself', 'neither', 'no one', 'nobody', 'none', 'nothing', 'one', 'one another', 'other', 'others', 'ours', 'ourselves', 'several', 'she', 'some', 'somebody', 'someone', 'something', 'such', 'that', 'theirs', 'them', 'themselves', 'these', 'they', 'this', 'those', 'us', 'we', 'what', 'whatever', 'which', 'whichever', 'who', 'whoever', 'whom', 'whomever', 'whose', 'you', 'yours', 'yourself'}

lexnlp.extract.en.geoentities module

Geo Entity extraction for English.

This module implements extraction functionality for geo entities in English, including formal names, abbreviations, and aliases.

lexnlp.extract.en.geoentities.get_geoentities()

Searches for geo entities from the provided config list and yields pairs of (entity, alias). Entity is: (entity_id, name, [list of aliases]) Alias is: (alias_text, lang, is_abbrev, alias_id)

This method uses general searching routines for dictionary entities from dict_entities.py module. Methods of dict_entities module can be used for comfortable creating the config: entity_config(), entity_alias(), add_aliases_to_entity(). :param text: :param geo_config_list: List of all possible known geo entities in the form of tuples (id, name, [(alias, lang, is_abbrev, alias_id), …]). :param priority: If two entities found with the totally equal matching aliases - then use the one with the greatest priority field. :param priority_by_id: If two entities found with the totally equal matching aliases - then use the one with the lowest id. :param text_languages: Language(s) of the source text. If a language is specified then only aliases of this language will be searched for. For example: this allows ignoring “Island” - a German language

alias of Iceland for English texts.
Parameters:
  • min_alias_len – Minimal length of geo entity aliases to search for.
  • prepared_alias_black_list – List of aliases to exclude from searching in the form: dict of lang -> (list of normalized non-abbreviation aliases, list of normalized abbreviation aliases). Use dict_entities.prepare_alias_blacklist_dict() for preparing this dict.
Returns:

Generates tuples: (entity, alias)

lexnlp.extract.en.geoentities.get_geoentity_annotations(text: str, geo_config_list: List[Tuple[int, str, List[Tuple[str, str, bool, int]]]], priority: bool = False, priority_by_id: bool = False, text_languages: List[str] = None, min_alias_len: int = 2, prepared_alias_black_list: Union[None, Dict[str, Tuple[List[str], List[str]]]] = None) → Generator[[lexnlp.extract.common.annotations.geo_annotation.GeoAnnotation, None], None]

See get_geoentities

lexnlp.extract.en.geoentities.load_entities_dict_by_path(entities_fn: str, aliases_fn: str)

lexnlp.extract.en.introductory_words_detector module

class lexnlp.extract.en.introductory_words_detector.IntroductoryWordsDetector

Bases: object

INTRODUCTORY_POS = [[('RB', {'so', 'also'}), ('VBN', {'called', 'named', 'known'})], [('RB', {'so', 'also'}), ('JJ', {'called', 'named', 'known'})], [('VBN', {'called', 'named', 'known'})]]
INTRO_ADVERBS = {'also', 'so'}
INTRO_VERBS = {'called', 'known', 'named'}
MAX_INTRO_LEN = 2
PUNCTUATION_POS = {'\t', '!', '"', '$', '%', '&', "'", '(', ')', '*', ',', '-', '.', '/', ':', ';', '?', '@', '\\', ']', '^', '``', '{', '}['}
static remove_term_introduction(term: str, term_pos: List[Tuple[str, str, int, int]]) → str

so called “champerty’ => “champerty’ :param term: source phrase :param term_pos: sourse phrase

lexnlp.extract.en.money module

Money extraction for English.

This module implements basic money extraction functionality in English.

Todo:
  • Improved unit tests and case coverage
lexnlp.extract.en.money.get_money(text: str, return_sources=False, float_digits=4) → Generator
lexnlp.extract.en.money.get_money_annotations(text: str, float_digits=4) → Generator[[lexnlp.extract.common.annotations.money_annotation.MoneyAnnotation, None], None]

lexnlp.extract.en.percents module

Percent extraction for English.

This module implements percent extraction functionality in English.

Todo:

lexnlp.extract.en.percents.get_percent_annotations(text: str, float_digits=4) → Generator[[lexnlp.extract.common.annotations.percent_annotation.PercentAnnotation, None], None]

Get percent usages within text.

lexnlp.extract.en.percents.get_percents(text: str, return_sources=False, float_digits=4) → Generator

Get percent usages within text. :param text: :param return_sources: :param float_digits: :return:

lexnlp.extract.en.pii module

PII extraction for English.

This module implements PII extraction functionality in English.

Todo:
lexnlp.extract.en.pii.get_pii(text: str, return_sources=False) → Generator

Find possible PII references in the text. :param text: :param return_sources: :return:

lexnlp.extract.en.pii.get_pii_annotations(text: str) → Generator[[lexnlp.extract.common.annotations.text_annotation.TextAnnotation, None], None]

Find possible PII references in the text.

lexnlp.extract.en.pii.get_ssn_annotations(text: str) → Generator[[lexnlp.extract.common.annotations.ssn_annotation.SsnAnnotation, None], None]
lexnlp.extract.en.pii.get_ssns(text, return_sources=False) → Generator

Find possible SSN references in the text.

lexnlp.extract.en.pii.get_us_phone_annotations(text: str) → Generator[[lexnlp.extract.common.annotations.phone_annotation.PhoneAnnotation, None], None]

Find possible telephone numbers in the text.

lexnlp.extract.en.pii.get_us_phones(text: str, return_sources=False) → Generator

Find possible telephone numbers in the text.

lexnlp.extract.en.ratios module

Ratio extraction for English.

This module implements ratio extraction functionality in English.

Todo:
  • Improved unit tests and case coverage
lexnlp.extract.en.ratios.get_ratio_annotations(text: str, float_digits=4) → Generator[[lexnlp.extract.common.annotations.ratio_annotation.RatioAnnotation, None], None]
lexnlp.extract.en.ratios.get_ratios(text: str, return_sources=False, float_digits=4) → Generator

lexnlp.extract.en.regulations module

Regulation extraction for English.

This module implements regulation extraction functionality in English.

Todo:
  • Improved unit tests and case coverage
lexnlp.extract.en.regulations.get_regulation_annotations(text: str) → Generator[[lexnlp.extract.common.annotations.regulation_annotation.RegulationAnnotation, None], None]

Get regulations. :param text: :param return_source: :param as_dict: :return: tuple or dict (volume, reporter, reporter_full_name, page, page2, court, year[, source text])

lexnlp.extract.en.regulations.get_regulations(text, return_source=False, as_dict=False) → Generator

Get regulations. :param text: :param return_source: :param as_dict: :return: tuple or dict (volume, reporter, reporter_full_name, page, page2, court, year[, source text])

lexnlp.extract.en.trademarks module

Trademark extraction for English using NLTK and NLTK pre-trained maximum entropy classifier.

This module implements basic Trademark extraction functionality in English relying on the pre-trained NLTK functionality, including POS tagger and NE (fuzzy) chunkers.

Todo: -

lexnlp.extract.en.trademarks.get_trademark_annotations(text: str) → Generator[[lexnlp.extract.common.annotations.trademark_annotation.TrademarkAnnotation, None], None]

Find trademarks in text.

lexnlp.extract.en.trademarks.get_trademarks(text: str) → Generator[[str, None], None]

Find trademarks in text.

lexnlp.extract.en.urls module

Urls extraction for English using NLTK and NLTK pre-trained maximum entropy classifier.

This module implements basic urls extraction functionality in English relying on the pre-trained NLTK functionality, including POS tagger and NE (fuzzy) chunkers.

Todo: -

lexnlp.extract.en.urls.get_url_annotations(text: str) → Generator[[lexnlp.extract.common.annotations.url_annotation.UrlAnnotation, None], None]

Find urls in text.

lexnlp.extract.en.urls.get_urls(text: str) → Generator[[str, None], None]

Find urls in text.

lexnlp.extract.en.utils module

Extraction utilities for English.

class lexnlp.extract.en.utils.NPExtractor(grammar=None)

Bases: object

cleanup_leaves(leaves)
exception_pos = ['IN', 'CC']
exception_sym = ['&', 'and', 'of']
get_np(text: str) → Generator[[str, None], None]
get_np_with_coords(text: str) → List[Tuple[str, int, int]]
get_tokenizer()
join(np_items)
sep(n, current_pos, last_pos)
static strip_np(np)
sym_with_space = ['(', '&']
sym_without_space = ['!', '"', '#', '$', '%', "'", ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', "'s"]
lexnlp.extract.en.utils.strip_unicode_punctuation(text, valid_punctuation=None)

This method strips all unicode punctuation that is not whitelisted. :param text: text to strip :param valid_punctuation: valid punctuation to whitelist :return:

Module contents