lexnlp.extract.en package¶
Subpackages¶
- lexnlp.extract.en.addresses package
- lexnlp.extract.en.contracts package
- lexnlp.extract.en.entities package
- lexnlp.extract.en.preprocessing package
- lexnlp.extract.en.tests package
- Submodules
- lexnlp.extract.en.tests.test_acts module
- lexnlp.extract.en.tests.test_amounts module
- lexnlp.extract.en.tests.test_amounts_plain module
- lexnlp.extract.en.tests.test_citations module
- lexnlp.extract.en.tests.test_citations_plain module
- lexnlp.extract.en.tests.test_conditions module
- lexnlp.extract.en.tests.test_conditions_plain module
- lexnlp.extract.en.tests.test_constraints module
- lexnlp.extract.en.tests.test_constraints_plain module
- lexnlp.extract.en.tests.test_copyright module
- lexnlp.extract.en.tests.test_copyright_plain module
- lexnlp.extract.en.tests.test_courts module
- lexnlp.extract.en.tests.test_cusip module
- lexnlp.extract.en.tests.test_dates module
- lexnlp.extract.en.tests.test_dates_plain module
- lexnlp.extract.en.tests.test_definitions module
- lexnlp.extract.en.tests.test_definitions_template module
- lexnlp.extract.en.tests.test_dict_entities module
- lexnlp.extract.en.tests.test_distance module
- lexnlp.extract.en.tests.test_distances_plain module
- lexnlp.extract.en.tests.test_durations module
- lexnlp.extract.en.tests.test_durations_plain module
- lexnlp.extract.en.tests.test_geoentities module
- lexnlp.extract.en.tests.test_geoentities_plain module
- lexnlp.extract.en.tests.test_introductory_words_detector module
- lexnlp.extract.en.tests.test_money module
- lexnlp.extract.en.tests.test_money_plain module
- lexnlp.extract.en.tests.test_parsing_speed module
- lexnlp.extract.en.tests.test_percent_plain module
- lexnlp.extract.en.tests.test_percents module
- lexnlp.extract.en.tests.test_phone_plain module
- lexnlp.extract.en.tests.test_pii module
- lexnlp.extract.en.tests.test_ratios module
- lexnlp.extract.en.tests.test_ratios_plain module
- lexnlp.extract.en.tests.test_regulations module
- lexnlp.extract.en.tests.test_regulations_plain module
- lexnlp.extract.en.tests.test_span_tokenizer module
- lexnlp.extract.en.tests.test_ssn_plain module
- lexnlp.extract.en.tests.test_trademarks module
- lexnlp.extract.en.tests.test_trademarks_plain module
- lexnlp.extract.en.tests.test_urls module
- lexnlp.extract.en.tests.test_urls_plain module
- Module contents
Submodules¶
lexnlp.extract.en.acts module¶
-
lexnlp.extract.en.acts.
get_act_list
(*args, **kwargs) → List[Dict[str, str]]¶
-
lexnlp.extract.en.acts.
get_acts
(text: str) → Generator[[Dict[str, Any], None], None]¶
-
lexnlp.extract.en.acts.
get_acts_annotations
(text: str) → Generator[[lexnlp.extract.common.annotations.act_annotation.ActAnnotation, None], None]¶
-
lexnlp.extract.en.acts.
get_acts_annotations_list
(text: str) → List[lexnlp.extract.common.annotations.act_annotation.ActAnnotation]¶
lexnlp.extract.en.amounts module¶
Amount extraction for English.
This module implements basic amount extraction functionality in English.
This module supports converting: - numbers with comma delimiter: “25,000.00”, “123,456,000” - written numbers: “Seven Hundred Eighty” - mixed written numbers: “5 million” or “2.55 BILLION” - written ordinal numbers: “twenty-fifth” - fractions (non written): “1/33”, “25/100”; where 1 < numerator < 99; 1 < denominator < 999 - fraction No/100 wil be treated as 00/100 - written numbers and fractions: “twenty one AND 5/100” - written fractions: “one-third”, “three tenths”, “ten ninety-ninths”, “twenty AND one-hundredths”,
- “2 hundred and one-thousandth”;
- where 1 < numerator < 99 and 2 < denominator < 99 and numerator < denominator; or 1 < numerator < 99 and denominator == 100, i.e. 1/99 - 99/100; or 1 < numerator < 99 and denominator == 1000, i.e. 1/1000 - 99/1000;
- floats starting with “.” (dot): “.5 million”
- “dozen”: “twenty-two DOZEN”
- “half”: “Six and a HALF Billion”, “two and a half”
- “quarter”: “five and one-quarter”, “5 and one-quarter”, “three-quartes”
- multiple numbers: “$25,400, 1 million people and 3.5 tons”
Avoids: - skip: “5.3.1.”, “1/1/2010”
-
lexnlp.extract.en.amounts.
get_amount_annotations
(text: str, extended_sources: bool = True, float_digits: int = 4) → Generator[[lexnlp.extract.common.annotations.amount_annotation.AmountAnnotation, None], None]¶ Find possible amount references in the text. :param text: text :param extended_sources: return data around amount itself :param float_digits: round float to N digits, don’t round if None :return: list of amounts
-
lexnlp.extract.en.amounts.
get_amounts
(text: str, return_sources: bool = False, extended_sources: bool = True, float_digits: int = 4) → Generator[[Union[decimal.Decimal, Tuple[decimal.Decimal, str]], None], None]¶ Find possible amount references in the text. :param text: text :param return_sources: return amount AND source text :param extended_sources: return data around amount itself :param float_digits: round float to N digits, don’t round if None :return: list of amounts
-
lexnlp.extract.en.amounts.
get_np
(text) → Generator[[Tuple[str, str], None], None]¶
-
lexnlp.extract.en.amounts.
quantize_by_float_digit
(amount: decimal.Decimal, float_digits: int) → decimal.Decimal¶
-
lexnlp.extract.en.amounts.
text2num
(s: str, search_fraction: bool = True) → Optional[decimal.Decimal]¶ Convert written amount into Decimal. :param s: written number :param search_fraction: extract fraction :return: Decimal or None
lexnlp.extract.en.citations module¶
Citation extraction for English.
This module implements citation extraction functionality in English.
- Todo:
- Improved unit tests and case coverage
-
lexnlp.extract.en.citations.
get_citation_annotations
(text: str) → Generator[[lexnlp.extract.common.annotations.citation_annotation.CitationAnnotation, None], None]¶ Get citations. :param text: :param return_source: :param as_dict: :return: tuple or dict (volume, reporter, reporter_full_name, page, page2, court, year[, source text])
-
lexnlp.extract.en.citations.
get_citations
(text: str, return_source=False, as_dict=False) → Generator¶ Get citations. :param text: :param return_source: :param as_dict: :return: tuple or dict (volume, reporter, reporter_full_name, page, page2, court, year[, source text])
lexnlp.extract.en.conditions module¶
Condition extraction for English.
This module implements basic condition extraction functionality in English.
- Todo:
- Improved unit tests and case coverage
-
lexnlp.extract.en.conditions.
create_condition_pattern
(condition_pattern_template, condition_phrases)¶ Create condition pattern. :param condition_pattern_template: :param condition_phrases: :return:
-
lexnlp.extract.en.conditions.
get_condition_annotations
(text: str, strict=True) → Generator[[lexnlp.extract.common.annotations.condition_annotation.ConditionAnnotation, None], None]¶ Find possible conditions in natural language. :param text: :param strict: :return:
-
lexnlp.extract.en.conditions.
get_conditions
(text, strict=True) → Generator¶
lexnlp.extract.en.constraints module¶
Constraint extraction for English.
This module implements basic constraint extraction functionality in English.
- Todo:
- Improved unit tests and case coverage
-
lexnlp.extract.en.constraints.
create_constraint_pattern
(constraint_pattern_template, constraint_phrases)¶ Create constraint pattern. :param constraint_pattern_template: :param constraint_phrases: :return:
-
lexnlp.extract.en.constraints.
get_constraint_annotations
(text: str, strict=False) → Generator[[lexnlp.extract.common.annotations.constraint_annotation.ConstraintAnnotation, None], None]¶ Find possible constraints in natural language. :param text: :param strict: :return:
-
lexnlp.extract.en.constraints.
get_constraints
(text: str, strict=False) → Generator¶ Find possible constraints in natural language. :param text: :param strict: :return:
lexnlp.extract.en.copyright module¶
Copyright extraction for English using NLTK and NLTK pre-trained maximum entropy classifier.
This module implements basic Copyright extraction functionality in English relying on the pre-trained NLTK functionality, including POS tagger and NE (fuzzy) chunkers.
Todo: -
-
class
lexnlp.extract.en.copyright.
CopyrightEnParser
¶ Bases:
lexnlp.extract.common.copyrights.copyright_en_style_parser.CopyrightEnStyleParser
-
classmethod
extract_phrases_with_coords
(sentence: str) → List[Tuple[str, int, int]]¶
-
classmethod
-
class
lexnlp.extract.en.copyright.
CopyrightNPExtractor
(grammar=None)¶ Bases:
lexnlp.extract.en.utils.NPExtractor
-
allowed_pos
= ['IN', 'CC', 'NN']¶
-
allowed_sym
= ['&', 'and', 'of', '©']¶
-
static
strip_np
(np)¶
-
-
lexnlp.extract.en.copyright.
get_copyright
(text: str, return_sources: bool = False) → Generator[[Union[Tuple[str, str, str], Tuple[str, str, str, str]], None], None]¶
-
lexnlp.extract.en.copyright.
get_copyright_annotations
(text: str, return_sources=False) → Generator[[lexnlp.extract.common.annotations.copyright_annotation.CopyrightAnnotation, None], None]¶
lexnlp.extract.en.courts module¶
Court extraction for English.
This module implements extraction functionality for courts in English, including formal names, abbreviations, and aliases.
- Todo:
- Add utilities for loading court data
-
lexnlp.extract.en.courts.
get_court_annotations
(text: str, language: str = None) → Generator[[lexnlp.extract.common.annotations.court_annotation.CourtAnnotation, None], None]¶
-
lexnlp.extract.en.courts.
get_courts
(text: str, court_config_list: List[lexnlp.extract.en.dict_entities.DictionaryEntry], priority: bool = False, text_languages: List[str] = None, simplified_normalization: bool = False) → Generator[[Tuple[lexnlp.extract.en.dict_entities.DictionaryEntry, lexnlp.extract.en.dict_entities.DictionaryEntryAlias], Any], Any]¶ Searches for courts from the provided config list and yields tuples of (court_config, court_alias). Court config is: (court_id, court_name, [list of aliases]) Alias is: (alias_text, language, is_abbrev, alias_id)
This method uses general searching routines for dictionary entities from dict_entities.py module. Methods of dict_entities module can be used for comfortable creating the config: entity_config(), entity_alias(), add_aliases_to_entity(). :param text: :param court_config_list: List list of all possible known courts in the form of tuples:
(id, name, [(alias, lang, is_abbrev], …).Parameters: - return_source –
- priority – If two courts found with the totally equal matching aliases - then use the one with the lowest id.
- text_languages – Language(s) of the source text. If a language is specified then only aliases of this
- language will be searched for. For example: this allows ignoring “Island” - a German language
- alias of Iceland for English texts.
Parameters: simplified_normalization – don’t use NLTK for just “normalizing” the text Returns: Generates tuples: (court entity, court alias)
-
lexnlp.extract.en.courts.
setup_en_parser
()¶
lexnlp.extract.en.cusip module¶
Ratio extraction for English.
This module implements ratio extraction functionality in English.
- Todo:
- Improved unit tests and case coverage
-
lexnlp.extract.en.cusip.
get_cusip
(text: str) → Generator[[Dict[str, Any], None], None]¶
-
lexnlp.extract.en.cusip.
get_cusip_annotations
(text: str) → Generator[[lexnlp.extract.common.annotations.cusip_annotation.CusipAnnotation, None], None]¶
-
lexnlp.extract.en.cusip.
get_cusip_list
(text)¶
-
lexnlp.extract.en.cusip.
is_cusip_valid
(code, return_checksum=False)¶
lexnlp.extract.en.date_model module¶
Date extraction for English.
This module implements date extraction functionality in English.
-
lexnlp.extract.en.date_model.
get_date_features
(text: str, start_index: int, end_index: int, include_bigrams: bool = True, window: int = 5, characters=None, norm: bool = True) → Dict[str, int]¶ Get features to use for classification of date as false positive. :param text: raw text around potential date :param start_index: date start index :param end_index: date end index :param include_bigrams: whether to include bigram/bicharacter features :param window: window around match :param characters: characters to use for feature generation, e.g., digits only, alpha only :param norm: whether to norm, i.e., transform to proportion :return:
lexnlp.extract.en.dates module¶
Date extraction for English.
This module implements date extraction functionality in English.
-
class
lexnlp.extract.en.dates.
DateFeaturesDataframeBuilder
¶ Bases:
object
-
classmethod
build_feature_df
(dic: Dict[str, float]) → pandas.core.frame.DataFrame¶
-
feature_df_by_key_count
= {}¶
-
classmethod
-
class
lexnlp.extract.en.dates.
FeatureTemplate
(df: pandas.core.frame.DataFrame = None, keys: List[str] = None)¶ Bases:
object
-
lexnlp.extract.en.dates.
build_date_model
(input_examples, output_file, verbose=True)¶ Build a sklearn model for classifying date strings as potential false positives. :param input_examples: :param output_file: :param verbose: :return:
-
lexnlp.extract.en.dates.
check_date_parts_are_in_date
(date: datetime.datetime, date_props: Dict[str, List[Any]]) → bool¶ Checks that when we transformed “possible date” into date, we found place for each “token” from the initial phrase :param date: :param date_string: “13.2 may” :param date_props: {‘time’: [], ‘hours’: [] … ‘digits’: [‘13’, ‘2’] …} :return: True if date is OK
-
lexnlp.extract.en.dates.
get_date_annotations
(text: str, strict=False, base_date=None, threshold=0.5) → Generator[[lexnlp.extract.common.annotations.date_annotation.DateAnnotation, None], None]¶ Find dates after cleaning false positives. :param text: raw text to search :param strict: whether to return only complete or strict matches :param base_date: base date to use for implied or partial matches :param return_source: whether to return raw text around date :param threshold: probability threshold to use for false positive classifier :return:
-
lexnlp.extract.en.dates.
get_date_features
(text, start_index, end_index, include_bigrams=True, window=5, characters=None, norm=True)¶ Get features to use for classification of date as false positive. :param text: raw text around potential date :param start_index: date start index :param end_index: date end index :param include_bigrams: whether to include bigram/bicharacter features :param window: window around match :param characters: characters to use for feature generation, e.g., digits only, alpha only :param norm: whether to norm, i.e., transform to proportion :return:
-
lexnlp.extract.en.dates.
get_dates
(text: str, strict=False, base_date=None, return_source=False, threshold=0.5) → Generator¶ Find dates after cleaning false positives. :param text: raw text to search :param strict: whether to return only complete or strict matches :param base_date: base date to use for implied or partial matches :param return_source: whether to return raw text around date :param threshold: probability threshold to use for false positive classifier :return:
-
lexnlp.extract.en.dates.
get_dates_list
(text, **kwargs) → List¶
-
lexnlp.extract.en.dates.
get_month_by_name
()¶
-
lexnlp.extract.en.dates.
get_raw_date_list
(text, strict=False, base_date=None, return_source=False) → List¶
-
lexnlp.extract.en.dates.
get_raw_dates
(text, strict=False, base_date=None, return_source=False) → Generator¶ Find “raw” or potential date matches prior to false positive classification. :param text: raw text to search :param strict: whether to return only complete or strict matches :param base_date: base date to use for implied or partial matches :param return_source: whether to return raw text around date :return:
-
lexnlp.extract.en.dates.
train_default_model
(save=True)¶ Train default model. :return:
lexnlp.extract.en.definition_parsing_methods module¶
Definition extraction for English.
This module implements basic definition extraction functionality in English.
- Todo:
- Improved unit tests and case coverage
-
class
lexnlp.extract.en.definition_parsing_methods.
DefinitionCaught
(name: str, text: str, coords: Tuple[int, int])¶ Bases:
object
Each definition is stored in this class with its name, full text and “coords” within the whole document
-
coords
¶
-
does_consume_target
(target) → int¶ Parameters: target – a definition that is, probably, “consumed” by the current one Returns: 1 if self consumes the target, -1 if the target consumes self, overwise 0
-
name
¶
-
text
¶
-
-
lexnlp.extract.en.definition_parsing_methods.
does_term_are_service_words
(term_pos: List[Tuple[str, str, int, int]]) → bool¶ Does term consist of service words only?
-
lexnlp.extract.en.definition_parsing_methods.
filter_definitions_for_self_repeating
(definitions: List[lexnlp.extract.en.definition_parsing_methods.DefinitionCaught]) → List[lexnlp.extract.en.definition_parsing_methods.DefinitionCaught]¶ Parameters: definitions – Returns: excludes definitions that are “overlapped”, leaves unique definitions only
-
lexnlp.extract.en.definition_parsing_methods.
get_definition_list_in_sentence
(sentence_coords: Tuple[int, int, str], decode_unicode=True) → List[lexnlp.extract.en.definition_parsing_methods.DefinitionCaught]¶ Find possible definitions in natural language in a single sentence. :param sentence_coords: sentence, sentence start, end :param decode_unicode: :return:
-
lexnlp.extract.en.definition_parsing_methods.
get_quotes_count_in_string
(text: str) → int¶ Parameters: text – text to calculate quotes within Returns: calculates count of quotes within the text passed
-
lexnlp.extract.en.definition_parsing_methods.
join_collection
(collection)¶
-
lexnlp.extract.en.definition_parsing_methods.
regex_matches_to_word_coords
(pattern: Pattern[str], text: str, phrase_start: int = 0) → List[Tuple[str, int, int]]¶ Parameters: - pattern – pattern for searching for matches within the text
- text – text to search for matches
- phrase_start – a value to be add to start / end
Returns: tuples of (match_text, start, end) out of the regex (pattern) matches in text
-
lexnlp.extract.en.definition_parsing_methods.
split_definitions_inside_term
(term: str, src_with_coords: Tuple[int, int, str], term_start: int, term_end: int) → List[Tuple[str, int, int]]¶ The whole phrase can be considered definition (“MSRB”, “we”, “us” or “our”), but in fact the phrase can be a collection of definitions. Here we split definition phrase to a list of definitions.
Source string could be pre-processed, that’s why we search for each sub-phrase’s coordinates (PhrasePositionFinder) :param term: a definition or, probably, a set of definitions (“MSRB”, “we”, “us” or “our”) :param src_with_coords: a sentence (probably), containing the term + its coords :param term_start: “term” start coordinate within the source sentence :param term_end: “term” end coordinate within the source sentence :return: [(definition, def_start, def_end), …]
-
lexnlp.extract.en.definition_parsing_methods.
trim_defined_term
(term: str, start: int, end: int) → Tuple[str, int, int, bool]¶ Remove pair of quotes / brackets framing text Replace N-grams of spaces with single spaces Replace line breaks with spaces :param term: a phrase that may contain excess framing symbols :param start: original term’s start position, may be changed :param end: original term’s end position, may be changed :return: updated term, start, end and the flag indicating that the whole phrase was inside quotes
lexnlp.extract.en.definitions module¶
-
lexnlp.extract.en.definitions.
get_definition_annotations
(text: str, decode_unicode=True, locator_type: lexnlp.extract.common.annotation_locator_type.AnnotationLocatorType = <AnnotationLocatorType.RegexpBased: 1>) → Generator[[lexnlp.extract.common.annotations.definition_annotation.DefinitionAnnotation, None], None]¶
-
lexnlp.extract.en.definitions.
get_definition_objects_list
(text, decode_unicode=True) → List[lexnlp.extract.en.definition_parsing_methods.DefinitionCaught]¶ Parameters: - text – text to search for definitions
- decode_unicode –
Returns: a list of found definitions - objects of class DefinitionCaught
-
lexnlp.extract.en.definitions.
get_definitions
(text: str, return_sources=False, decode_unicode=True, return_coords=False, locator_type: lexnlp.extract.common.annotation_locator_type.AnnotationLocatorType = <AnnotationLocatorType.RegexpBased: 1>) → Generator¶ Find possible definitions in natural language in text. The text will be split to sentences first. :param return_coords: returns a (x, y) tuple in each record. x - definition’s text start, y - definition’s text end :param decode_unicode: :param return_sources: returns a tuple with the extracted term and the source sentence :param text: the input text :param locator_type: use default (Regexp-based) or ML-based locator :return: Generator[name] or Generator[name, text] or Generator[name, text, coords]
-
lexnlp.extract.en.definitions.
get_definitions_explicit
(text, decode_unicode=True, locator_type: lexnlp.extract.common.annotation_locator_type.AnnotationLocatorType = <AnnotationLocatorType.RegexpBased: 1>) → Generator¶
-
lexnlp.extract.en.definitions.
get_definitions_in_sentence
(sentence: str, return_sources=False, decode_unicode=True) → Generator¶
lexnlp.extract.en.dict_entities module¶
Universal extraction of entities for which we have full dictionaries of possible names and aliases from English text.
Example: Courts - we have the full dictionary of known courts with their names and aliases and are able to search the text for each possible court.
Geo entities - we have the full set of known geo entities and can search any text for their occurrences.
Search methods of this module require lists of possible entities with their ids, names and sets of aliases in different languages. To allow using these methods in Celery and especially for allowing building these configuration lists once and using them in multiple Celery tasks it is required to allow their easy and fast serialization. By default Celery uses JSON serialization starting from v. 4 and does not allow serializing objects of custom classes out of the box. So we will have to use either dicts or tuples to avoid requiring special configuration for Celery. Tuples are faster.
To avoid typos in development and utilize typization hints in IDE there are few methods in this module for operating tuples which represent entities and aliases. They accept named parameters lists and return tuples.
-
class
lexnlp.extract.en.dict_entities.
AliasBanList
(aliases: Optional[List[str]] = None, abbreviations: Optional[List[str]] = None)¶ Bases:
object
-
class
lexnlp.extract.en.dict_entities.
AliasBanRecord
(alias: str = '', lang: Optional[str] = '', is_abbrev: bool = False)¶ Bases:
object
-
class
lexnlp.extract.en.dict_entities.
DictionaryEntity
(entity: Any, coords: Tuple[int, int])¶ Bases:
object
-
class
lexnlp.extract.en.dict_entities.
DictionaryEntry
(id: int = 0, name: str = '', priority: int = 0, name_is_alias: bool = True, aliases: Optional[List[lexnlp.extract.en.dict_entities.DictionaryEntryAlias]] = None)¶ Bases:
object
-
class
lexnlp.extract.en.dict_entities.
DictionaryEntryAlias
(alias: str = '', language: str = '', is_abbreviation: bool = False, alias_id: Optional[int] = None, normalized_alias: str = '')¶ Bases:
object
-
classmethod
entity_alias
(alias: str, language: str = None, is_abbreviation: bool = False, alias_id: int = None) → lexnlp.extract.en.dict_entities.DictionaryEntryAlias¶
-
classmethod
-
class
lexnlp.extract.en.dict_entities.
SearchResultPosition
(entity: lexnlp.extract.en.dict_entities.DictionaryEntry, alias: lexnlp.extract.en.dict_entities.DictionaryEntryAlias, start: int, end: int, source_text: str = '')¶ Bases:
object
Represents a position in the normalized source text at which one or more entities have been detected. One or more entities having equal aliases can be detected on a position in the text.
-
add_entity
(entity: lexnlp.extract.en.dict_entities.DictionaryEntry, alias: lexnlp.extract.en.dict_entities.DictionaryEntryAlias) → lexnlp.extract.en.dict_entities.SearchResultPosition¶
-
alias_text
¶
-
end
¶
-
entities_dict
¶
-
get_entities_aliases
() → List[Tuple[lexnlp.extract.en.dict_entities.DictionaryEntry, lexnlp.extract.en.dict_entities.DictionaryEntryAlias]]¶
-
overlaps
(other: lexnlp.extract.en.dict_entities.SearchResultPosition) → bool¶
-
source_text
¶
-
start
¶
-
-
lexnlp.extract.en.dict_entities.
alias_is_banlisted
(alias_ban_list: Optional[Dict[str, lexnlp.extract.en.dict_entities.AliasBanList]], norm_alias: str, alias_lang: str, is_abbrev: bool) → bool¶
-
lexnlp.extract.en.dict_entities.
conflicts_take_first_by_id
(conflicting_entities_aliases: List[Tuple[lexnlp.extract.en.dict_entities.DictionaryEntry, lexnlp.extract.en.dict_entities.DictionaryEntryAlias]]) → List[Tuple[lexnlp.extract.en.dict_entities.DictionaryEntry, lexnlp.extract.en.dict_entities.DictionaryEntryAlias]]¶ Default conflict resolving function for dropping all entities detected at the same position excepting the one having the smallest id. To be used in find_dict_entities() method.
-
lexnlp.extract.en.dict_entities.
conflicts_top_by_priority
(conflicting_entities_aliases: List[Tuple[lexnlp.extract.en.dict_entities.DictionaryEntry, lexnlp.extract.en.dict_entities.DictionaryEntryAlias]]) → List[Tuple[lexnlp.extract.en.dict_entities.DictionaryEntry, lexnlp.extract.en.dict_entities.DictionaryEntryAlias]]¶ Default conflict resolving function for dropping all entities detected at the same position excepting the one having the smallest id. To be used in find_dict_entities() method.
-
lexnlp.extract.en.dict_entities.
find_dict_entities
(text: str, all_possible_entities: List[lexnlp.extract.en.dict_entities.DictionaryEntry], text_languages: Union[List[str], Tuple[str], Set[str]] = None, conflict_resolving_func: Callable[List[Tuple[lexnlp.extract.en.dict_entities.DictionaryEntry, lexnlp.extract.en.dict_entities.DictionaryEntryAlias]], List[Tuple[lexnlp.extract.en.dict_entities.DictionaryEntry, lexnlp.extract.en.dict_entities.DictionaryEntryAlias]]] = None, use_stemmer: bool = False, remove_time_am_pm: bool = True, min_alias_len: int = None, prepared_alias_ban_list: Optional[Dict[str, lexnlp.extract.en.dict_entities.AliasBanList]] = None, simplified_normalization: bool = False) → Generator[[lexnlp.extract.en.dict_entities.DictionaryEntity, None], None]¶ Find all entities defined in the ‘all_possible_entities’ list appeared in the source text. This method takes care of leaving only the longest matching search result for the case of multiple entities having aliases - one being a substring of another. This method takes care of the language of the text and aliases - if language is specified both for the text and for the alias - then this alias is used only if they are the same. This method may detect multiple possibly matching entities at a position in the text - because there can be entites having the same aliases in the same language. To resolve such conflicts a special resolving function can be specified. This method takes care of time AM/PM components which possibly can appear in the aliases of some entities - it tries to detect minutes/seconds/milliseconds before AM/PM and ignore them in such cases.
Algorithm of this method: 1. Normalize the source text (we need lowercase and non-lowercase versions for abbrev searches). 2. Create a shared search context - a map of position -> (alias text + list of matching entities) 3. For each possible entity do search using the shared context:
- 3.1. For each alias of the entity:
- 3.1.1. Iteratively search for all occurrences of the alias taking into account its language, abbrev status.
- For each found occurrence of the alias - check if there is already found another alias and entity at this position and leave only the one having the longest alias (“Something” vs “Something Bigger”) If there is already a found different entity on this position having totally equal alias with the same language - then store them both for this position in the text.
4. Now we have a map filled with: position -> (alias text + list of entities having this alias). After sorting the items of this dict by position we will be able to get rid of overlaping of longer and shorter aliases being one a substirng of another (“Bankr. E.D.N.Y.” vs “E.D.N.Y.”). 5. For each next position check if it overlaps with the next one [position; position + len(alias)]. If overlaps - then leave the longest alias and drop the shorter.
Main complexity of this algorithm is caused by the requirement to detect the longest match for each piece of text while the longer match can start at the earlier position then the shorter match and there can be multiple aliases of different entities matching the same piece of text.
Another algorithm for this function can be based on the idea that or-kind regexp returns the longest matching group. We could form regexps containing the possible aliases and apply them to the source text: r’alias1|alias2|longer alias2|…’
TODO Compare to other algorithms for time and memory complexity
Parameters: - text –
- all_possible_entities – list of dict or list of DictEntity - all possible entities to search for
- min_alias_len – Minimal length of alias/name to search for. Can be used to ignore too short aliases like “M.”
while searching. :param prepared_alias_ban_list: List of aliases to remove from searching. Can be used to ignore concrete aliases. Prepared ban list of aliases to exclude from search. Should be: dict of language -> tuple (list of normalized non-abbreviations, list of normalized abbreviations) :param text_languages: If set - then only aliases of these languages will be searched for. :param conflict_resolving_func: A function for resolving conflicts when there are multiple entities detected at the same position in the source text and their detected aliases are of the same length. The function takes a list of conflicting entities and should return a list of one or more entities which should be returned. :param use_stemmer: Use stemmer instead of tokenizer. Stemmer converts words to their simple form (singular number, e.t.c.). Stemmer works better for searching for “tables”, “developers”, … Tokenizer fits for “United States”, “Mississippi”, … :param remove_time_am_pm: Remove from final results AM/PM abbreviations which look like end part of time strings - 11:45 am, 10:00 pm. :param simplified_normalization: Don’t use NLTK for text “normalization” :return:
-
lexnlp.extract.en.dict_entities.
normalize_text
(text: str, spaces_on_start_end: bool = True, spaces_after_dots: bool = True, lowercase: bool = True, use_stemmer: bool = False, simple_tokenization: bool = False) → str¶ Normalizes text for substring search operations - extracts tokens, joins them back with spaces, adds missing spaces after dots for abbreviations, e.t.c. Overall aim of this method is to weaken substring matching conditions by normalizing both the text and the substring being searched by the same way removing obsolete differences between them (case, punctuation, …). :param text: :param spaces_on_start_end: :param spaces_after_dots: :param lowercase: :param simple_tokenization: don’t use nltk, just split text by space characters :param use_stemmer: Use stemmer instead of tokenizer. When using stemmer all words will be converted to singular number (or to some the most plain form) before matching. When using tokenizer - the words are compared as is. Using tokenizer should be enough for searches for entities which exist in a single number in the real world - geo entities, courts, …. Stemmer is required for searching for some common objects - table, pen, developer, … :return: “normazlied” string
-
lexnlp.extract.en.dict_entities.
normalize_text_with_map
(text: str, spaces_on_start_end: bool = True, spaces_after_dots: bool = True, lowercase: bool = True, use_stemmer: bool = False, simple_tokenization: bool = False) → Tuple[str, List[int]]¶ Almost like normalize_text, but also returns source-to-resulted char index map: map[i] = I, where i is the character coordinate within the source text,
I is the same character’s coordinate within the resulted text
-
lexnlp.extract.en.dict_entities.
prepare_alias_banlist_dict
(alias_banlist: List[lexnlp.extract.en.dict_entities.AliasBanRecord], use_stemmer: bool = False) → Optional[Dict[str, lexnlp.extract.en.dict_entities.AliasBanList]]¶ Prepare alias ban list for providing it to find_dict_entities() function. :param alias_banlist: Non-normalized form of the banlist: [(alias, lang, is_abbrev), …] :param use_stemmer: Use stemmer for alias normalization. Otherwise - tokenizer only. :return:
-
lexnlp.extract.en.dict_entities.
reverse_src_to_dest_map
(conv_map: List[int], normalized_text_len=0) → List[int]¶ 1 2 3 4 5 012345678901234567890123456789012345678901234567890 One one Bankr. E.D.N.C. two two two.
One one Bankr . E . D . N . C . two two two .
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11,12,13,15,16,17,19 <- map [0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11,12,12,13,13 <- reversed
lexnlp.extract.en.distances module¶
Distance extraction for English.
This module implements basic distance extraction functionality in English.
-
lexnlp.extract.en.distances.
get_distance_annotations
(text: str, float_digits: int = 4) → Generator[[lexnlp.extract.common.annotations.distance_annotation.DistanceAnnotation, None], None]¶
-
lexnlp.extract.en.distances.
get_distances
(text: str, return_sources: bool = False, float_digits: int = 4) → Generator[[Union[Tuple[decimal.Decimal, str], Tuple[decimal.Decimal, str, str]], None], None]¶
lexnlp.extract.en.durations module¶
This module implements duration extraction functionality in English.
-
class
lexnlp.extract.en.durations.
EnDurationParser
¶ Bases:
lexnlp.extract.common.durations.durations_parser.DurationParser
-
DURATION_MAP
= {'anniversaries': Fraction(365, 1), 'anniversary': Fraction(365, 1), 'annum': Fraction(365, 1), 'day': Fraction(1, 1), 'hour': Fraction(1, 24), 'minute': Fraction(1, 1440), 'month': Fraction(30, 1), 'quarter': Fraction(365, 4), 'second': Fraction(1, 86400), 'week': Fraction(7, 1), 'year': Fraction(365, 1)}¶
-
DURATION_PTN
= '\n ((\n(?:(?:(?:(?:(?:[\\.\\d][\\d\\.,]*\\s*|\\W|^)\n(?:(?:seventeenths|seventeenth|thirteenths|fourteenths|eighteenths|nineteenths|seventieths|thirteenth|fourteenth|eighteenth|nineteenth|seventieth|fifteenths|sixteenths|twentieths|thirtieths|eightieths|ninetieths|seventeen|fifteenth|sixteenth|twentieth|thirtieth|eightieth|ninetieth|elevenths|fortieths|fiftieths|sixtieths|thirteen|fourteen|eighteen|nineteen|eleventh|fortieth|fiftieth|sixtieth|sevenths|twelfths|fifteen|sixteen|seventy|seventh|twelfth|fourths|eighths|eleven|twelve|twenty|thirty|eighty|ninety|zeroth|second|fourth|eighth|thirds|fifths|sixths|ninths|tenths|three|seven|eight|forty|fifty|sixty|first|third|fifth|sixth|ninth|tenth|zero|four|five|nine|one|two|six|ten|thousandths|thousandth|thousand|trillion|million|billion|trill|bil|mm|k|m|b\n|hundred(?:th(?:s)?)?|dozen|and|a\\s+half|quarters?)[\\s-]*)+)\n(?:(?:no|\\d{1,2})/100)?)|(?<=\\W|^)(?:[\\.\\d][\\d\\.,/]*))(?:\\W|$))(?:\\s{0,2}[½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅐⅛⅜⅝⅞⅑⅒]+)*)\n (?:\\s*(?:calendar|business|actual))?[\\s-]*\n (second|minute|hour|day|week|month|quarter|year|annum|anniversary|anniversaries)s?)(?:\\W|$)\n '¶
-
DURATION_PTN_RE
= regex.Regex('\n ((\n(?:(?:(?:(?:(?:[\\.\\d][\\d\\.,]*\\s*|\\W|^)\n(?:(?:seventeenths|seventeenth|thirteenths|fourteenths|eighteenths|nineteenths|seventieths|thirteenth|fourteenth|eighteenth|nineteenth|seventieth|fifteenths|sixteenths|twentieths|thirtieths|eightieths|ninetieths|seventeen|fifteenth|sixteenth|twentieth|thirtieth|eightieth|ninetieth|elevenths|fortieths|fiftieths|sixtieths|thirteen|fourteen|eighteen|nineteen|eleventh|fortieth|fiftieth|sixtieth|sevenths|twelfths|fifteen|sixteen|seventy|seventh|twelfth|fourths|eighths|eleven|twelve|twenty|thirty|eighty|ninety|zeroth|second|fourth|eighth|thirds|fifths|sixths|ninths|tenths|three|seven|eight|forty|fifty|sixty|first|third|fifth|sixth|ninth|tenth|zero|four|five|nine|one|two|six|ten|thousandths|thousandth|thousand|trillion|million|billion|trill|bil|mm|k|m|b\n|hundred(?:th(?:s)?)?|dozen|and|a\\s+half|quarters?)[\\s-]*)+)\n(?:(?:no|\\d{1,2})/100)?)|(?<=\\W|^)(?:[\\.\\d][\\d\\.,/]*))(?:\\W|$))(?:\\s{0,2}[½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅐⅛⅜⅝⅞⅑⅒]+)*)\n (?:\\s*(?:calendar|business|actual))?[\\s-]*\n (second|minute|hour|day|week|month|quarter|year|annum|anniversary|anniversaries)s?)(?:\\W|$)\n ', flags=regex.S | regex.I | regex.M | regex.X | regex.V0)¶
-
INNER_CONJUNCTIONS
= ['and', 'plus']¶
-
INNER_PUNCTUATION
= regex.Regex('[\\s\\,]', flags=regex.V0)¶
-
classmethod
get_all_annotations
(text: str, float_digits: int = 4) → List[lexnlp.extract.common.annotations.duration_annotation.DurationAnnotation]¶
-
-
lexnlp.extract.en.durations.
get_duration_annotations
(text: str, float_digits=4) → Generator[[lexnlp.extract.common.annotations.duration_annotation.DurationAnnotation, None], None]¶
-
lexnlp.extract.en.durations.
get_duration_annotations_list
(text: str, float_digits=4) → List[lexnlp.extract.common.annotations.duration_annotation.DurationAnnotation]¶
-
lexnlp.extract.en.durations.
get_durations
(text: str, return_sources=False, float_digits=4) → Generator¶
lexnlp.extract.en.en_language_tokens module¶
-
class
lexnlp.extract.en.en_language_tokens.
EnLanguageTokens
¶ Bases:
object
-
abbreviations
= {'A.D.', 'A.V.', 'Abbrev.', 'Abd.', 'Aberd.', 'Aberdeensh.', 'Abol.', 'Aborig.', 'Abp.', 'Abr.', 'Abridg.', 'Abridgem.', 'Absol.', 'Abst.', 'Abstr.', 'Acad.', 'Acc.', 'Accept.', 'Accomm.', 'Accompl.', 'Accs.', 'Acct.', 'Accts.', 'Achievem.', 'Add.', 'Addit.', 'Addr.', 'Adm.', 'Admin.', 'Admir.', 'Admon.', 'Admonit.', 'Adv.', 'Advancem.', 'Advert.', 'Advoc.', 'Advt.', 'Advts.', 'Aerodynam.', 'Aeronaut.', 'Aff.', 'Affect.', 'Afr.', 'Agric.', 'Alch.', 'Alg.', 'Alleg.', 'Allit.', 'Alm.', 'Alph.', 'Amer.', 'Anal.', 'Analyt.', 'Anat.', 'Anc.', 'Anecd.', 'Ang.', 'Angl.', 'Anglo-Ind.', 'Anim.', 'Ann.', 'Anniv.', 'Annot.', 'Anon.', 'Answ.', 'Ant.', 'Anthrop.', 'Anthropol.', 'Antiq.', 'Apoc.', 'Apol.', 'App.', 'Appl.', 'Applic.', 'Apr.', 'Arab.', 'Arb.', 'Arch.', 'Archaeol.', 'Archipel.', 'Archit.', 'Argt.', 'Arith.', 'Arithm.', 'Arrangem.', 'Artic.', 'Artific.', 'Artill.', 'Ashm.', 'Assemb.', 'Assoc.', 'Assoc. Football', 'Assyriol.', 'Astr.', 'Astrol.', 'Astron.', 'Astronaut.', 'Att.', 'Attrib.', 'Aug.', 'Austral.', 'Auth.', 'Autobiog.', 'Autobiogr.', 'Ayrsh.', 'B.C.', 'BNC', 'Bacteriol.', 'Bedford.', 'Bedfordsh.', 'Bel & Dr.', 'Belg.', 'Berks.', 'Berksh.', 'Berw.', 'Berwicksh.', 'Bibliogr.', 'Biochem.', 'Biog.', 'Biogr.', 'Biol.', 'Bk.', 'Bks.', 'Bord.', 'Bot.', 'Bp.', 'Braz.', 'Brit.', 'Bucks.', 'Build.', 'Bull.', 'Bur.', 'Cal.', 'Calc.', 'Calend.', 'Calif.', 'Calligr.', 'Camb.', 'Cambr.', 'Campanol.', 'Canad.', 'Canterb.', 'Capt.', 'Cartogr.', 'Catal.', 'Catech.', 'Cath.', 'Cent.', 'Ceram.', 'Cert.', 'Certif.', 'Ch.', 'Ch. Hist.', 'Chamb.', 'Char.', 'Charac.', 'Chas.', 'Chem.', 'Chem. Engin.', 'Chesh.', 'Chr.', 'Chron.', 'Chronol.', 'Chrons.', 'Cinematogr.', 'Circ.', 'Civ. Law', 'Civil Engin.', 'Cl.', 'Class.', 'Class. Antiq.', 'Classif.', 'Climatol.', 'Clin.', 'Col.', 'Coll.', 'Collect.', 'Colloq.', 'Coloss.', 'Com.', 'Comb.', 'Combs.', 'Comm.', 'Comm. Law', 'Commandm.', 'Commend.', 'Commerc.', 'Commiss.', 'Commonw.', 'Communic.', 'Comp.', 'Comp. Anat.', 'Compan.', 'Compar.', 'Compend.', 'Compl.', 'Compos.', 'Conc.', 'Conch.', 'Concl.', 'Conf.', 'Confid.', 'Confl.', 'Confut.', 'Congr.', 'Congreg.', 'Congress.', 'Conn.', 'Consc.', 'Consecr.', 'Consid.', 'Consol.', 'Constit.', 'Constit. Hist.', 'Constr.', 'Contemp.', 'Contempl.', 'Contend.', 'Content.', 'Contin.', 'Contradict.', 'Contrib.', 'Controv.', 'Conv.', 'Convent.', 'Conversat.', 'Convoc.', 'Cor.', 'Cornw.', 'Coron.', 'Corr.', 'Corresp.', 'Counc.', 'Courtsh.', 'Craniol.', 'Craniom.', 'Crim.', 'Crim. Law', 'Crit.', 'Crt.', 'Crts.', 'Cryptogr.', 'Crystallogr.', 'Ct.', 'Cumb.', 'Cumberld.', 'Cumbld.', 'Cycl.', 'Cytol.', 'D.C.', 'Dan.', 'Dau.', 'Deb.', 'Dec.', 'Declar.', 'Ded.', 'Def.', 'Deliv.', 'Demonstr.', 'Dep.', 'Depred.', 'Depredat.', 'Dept.', 'Derbysh.', 'Descr.', 'Deut.', 'Devel.', 'Devonsh.', 'Dial.', 'Dict.', 'Diffic.', 'Direct.', 'Dis.', 'Disc.', 'Discipl.', 'Discov.', 'Discrim.', 'Discuss.', 'Diss.', 'Dist.', 'Distemp.', 'Distill.', 'Distrib.', 'Div.', 'Divers.', 'Dk.', 'Doc.', 'Doctr.', 'Domest.', 'Durh.', 'E. Afr.', 'E. Angl.', 'E. Anglian', 'E. Ind.', 'E.D.D.', 'E.E.T.S.', 'East Ind.', 'Eccl.', 'Eccl. Hist.', 'Eccl. Law', 'Eccles.', 'Ecclus.', 'Ecol.', 'Econ.', 'Ed.', 'Edin.', 'Edinb.', 'Educ.', 'Edw.', 'Egypt.', 'Egyptol.', 'Electr.', 'Electr. Engin.', 'Electro-magn.', 'Electro-physiol.', 'Elem.', 'Eliz.', 'Elizab.', 'Emb.', 'Embryol.', 'Encycl.', 'Encycl. Brit.', 'Encycl. Metrop.', 'Eng.', 'Engin.', 'Englishw.', 'Enq.', 'Ent.', 'Enthus.', 'Entom.', 'Entomol.', 'Enzymol.', 'Ep.', 'Eph.', 'Ephes.', 'Epil.', 'Episc.', 'Epist.', 'Epit.', 'Equip.', 'Esd.', 'Ess.', 'Essent.', 'Establ.', 'Esth.', 'Ethnol.', 'Etymol.', 'Eval.', 'Evang.', 'Even.', 'Evid.', 'Evol.', 'Ex. doc.', 'Exalt.', 'Exam.', 'Exch.', 'Exec.', 'Exerc.', 'Exhib.', 'Exod.', 'Exped.', 'Exper.', 'Explan.', 'Explic.', 'Explor.', 'Expos.', 'Ezek.', 'Fab.', 'Fam.', 'Farew.', 'Feb.', 'Ff.', 'Fifesh.', 'Footpr.', 'Forfarsh.', 'Fortif.', 'Fortn.', 'Found.', 'Fr.', 'Fragm.', 'Fratern.', 'Friendsh.', 'Fund.', 'Furnit.', 'Gal.', 'Gard.', 'Gastron.', 'Gaz.', 'Gd.', 'Gen.', 'Geo.', 'Geog.', 'Geogr.', 'Geol.', 'Geom.', 'Geomorphol.', 'Ger.', 'Glac.', 'Glasg.', 'Glos.', 'Gloss.', 'Glouc.', 'Gloucestersh.', 'Gosp.', 'Gov.', 'Govt.', 'Gr.', 'Gram.', 'Gramm. Analysis', 'Gt.', 'Gynaecol.', 'Hab.', 'Haematol.', 'Hag.', 'Hampsh.', 'Handbk.', 'Hants.', 'Heb.', 'Hebr.', 'Hen.', 'Her.', 'Herb.', 'Heref.', 'Hereford.', 'Herefordsh.', 'Hertfordsh.', 'Hierogl.', 'Hist.', 'Histol.', 'Hom.', 'Horol.', 'Hort.', 'Hos.', 'Hosp.', 'Househ.', 'Housek.', 'Husb.', 'Hydraul.', 'Hydrol.', 'Ichth.', 'Icthyol.', 'Ideol.', 'Idol.', 'Illustr.', 'Imag.', 'Immunol.', 'Impr.', 'Inaug.', 'Inc.', 'Inclos.', 'Ind.', 'Industr.', 'Industr. Rel.', 'Infl.', 'Innoc.', 'Inorg.', 'Inq.', 'Inst.', 'Instr.', 'Intell.', 'Intellect.', 'Interc.', 'Interl.', 'Internat.', 'Interpr.', 'Intro.', 'Introd.', 'Inv.', 'Invent.', 'Invert. Zool.', 'Invertebr.', 'Investig.', 'Investm.', 'Invoc.', 'Ir.', 'Irel.', 'Isa.', 'Ital.', 'Jahrb.', 'Jam.', 'Jan.', 'Jap.', 'Jas.', 'Jer.', 'Josh.', 'Jrnl.', 'Jrnls.', 'Jud.', 'Judg.', 'Jul.', 'Jun.', 'Jurisd.', 'Jurisdict.', 'Jurispr.', 'Justif.', 'Justific.', 'Kent.', 'Kgs.', 'Kingd.', 'King’s Bench Div.', 'Knowl.', 'Kpr.', 'LXX', 'Lab.', 'Lam.', 'Lament', 'Lament.', 'Lanc.', 'Lancash.', 'Lancs.', 'Lang.', 'Langs.', 'Lat.', 'Ld.', 'Lds.', 'Lect.', 'Leechd.', 'Leg.', 'Leicest.', 'Leicester.', 'Leicestersh.', 'Leics.', 'Let.', 'Lett.', 'Lev.', 'Lex.', 'Libr.', 'Limnol.', 'Lincolnsh.', 'Lincs.', 'Ling.', 'Linn.', 'Lit.', 'Lithogr.', 'Lithol.', 'Liturg.', 'Lond.', 'MS.', 'MSS.', 'Macc.', 'Mach.', 'Mag.', 'Magn.', 'Mal.', 'Man.', 'Managem.', 'Manch.', 'Manip.', 'Manuf.', 'Mar.', 'Mass.', 'Math.', 'Matt.', 'Meas.', 'Measurem.', 'Mech.', 'Med.', 'Medit.', 'Mem.', 'Merc.', 'Merch.', 'Metall.', 'Metallif.', 'Metallogr.', 'Metamorph.', 'Metaph.', 'Meteorol.', 'Meth.', 'Metrop.', 'Mex.', 'Mic.', 'Mich.', 'Microbiol.', 'Microsc.', 'Mil.', 'Milit.', 'Min.', 'Mineral.', 'Misc.', 'Miscell.', 'Mod.', 'Monum.', 'Morphol.', 'Mt.', 'Mtg.', 'Mts.', 'Munic.', 'Munif.', 'Munim.', 'Mus.', 'Myst.', 'Myth.', 'Mythol.', 'N. Afr.', 'N. Amer.', 'N. Carolina', 'N. Dakota', 'N. Ir.', 'N. Irel.', 'N.E.', 'N.E.D.', 'N.S. Wales', 'N.S.W.', 'N.T.', 'N.W.', 'N.Y.', 'N.Z.', 'Nah.', 'Narr.', 'Narrat.', 'Nat.', 'Nat. Hist.', 'Nat. Philos.', 'Nat. Sci.', 'Naut.', 'Nav.', 'Navig.', 'Neh.', 'Neighb.', 'Nerv.', 'Neurol.', 'Neurosurg.', 'New Hampsh.', 'Newc.', 'Newspr.', 'No.', 'Non-conf.', 'Nonconf.', 'Norf.', 'Northamptonsh.', 'Northants.', 'Northumb.', 'Northumbld.', 'Northumbr.', 'Norw.', 'Norweg.', 'Notts.', 'Nov.', 'Nucl.', 'Num.', 'Numism.', 'O.E.D.', 'O.T.', 'OE', 'Obad.', 'Obed.', 'Obj.', 'Obs.', 'Observ.', 'Obstet.', 'Obstetr.', 'Obstetr. Med.', 'Occas.', 'Occup.', 'Occurr.', 'Oceanogr.', 'Oct.', 'Off.', 'Offic.', 'Okla.', 'Ont.', 'Ophthalm.', 'Ophthalmol.', 'Oppress.', 'Opt.', 'Orac.', 'Ord.', 'Org.', 'Org. Chem.', 'Organ. Chem.', 'Orig.', 'Orkn.', 'Ornith.', 'Ornithol.', 'Orthogr.', 'Outl.', 'Oxf.', 'Oxfordsh.', 'Oxon.', 'P. R.', 'Pa.', 'Palaeobot.', 'Palaeogr.', 'Palaeont.', 'Palaeontol.', 'Paraphr.', 'Parasitol.', 'Parl.', 'Parnass.', 'Path.', 'Pathol.', 'Peculat.', 'Penins.', 'Perf.', 'Periodontol.', 'Pers.', 'Persec.', 'Perthsh.', 'Pet.', 'Petrogr.', 'Petrol.', 'Pharm.', 'Pharmaceut.', 'Pharmacol.', 'Phil.', 'Philad.', 'Philem.', 'Philipp.', 'Philol.', 'Philos.', 'Phoen.', 'Phonol.', 'Photog.', 'Photogr.', 'Phrenol.', 'Phys.', 'Physical Chem.', 'Physical Geogr.', 'Physiogr.', 'Physiol.', 'Pict.', 'Poet.', 'Pol.', 'Pol. Econ.', 'Polit.', 'Polytechn.', 'Pop.', 'Porc.', 'Port.', 'Posth.', 'Postm.', 'Pott.', 'Pract.', 'Predict.', 'Pref.', 'Preh.', 'Prehist.', 'Prerog.', 'Pres.', 'Presb.', 'Preserv.', 'Prim.', 'Princ.', 'Print.', 'Probab.', 'Probl.', 'Proc.', 'Prod.', 'Prol.', 'Pronunc.', 'Prop.', 'Pros.', 'Prov.', 'Provid.', 'Provinc.', 'Provis.', 'Ps.', 'Psych.', 'Psychoanal.', 'Psychoanalyt.', 'Psychol.', 'Psychopathol.', 'Pt.', 'Publ.', 'Purg.', 'Q. Eliz.', 'Qld.', 'Quantum Mech.', 'Queen’s Bench Div.', 'R.A.F.', 'R.C.', 'R.C. Church', 'R.N.', 'Radiol.', 'Reas.', 'Reb.', 'Rebell.', 'Rec.', 'Reclam.', 'Recoll.', 'Redempt.', 'Ref.', 'Refl.', 'Refus.', 'Refut.', 'Reg.', 'Regic.', 'Regist.', 'Regr.', 'Rel.', 'Relig.', 'Reminisc.', 'Remonstr.', 'Renfrewsh.', 'Rep.', 'Reprod.', 'Rept.', 'Repub.', 'Res.', 'Resid.', 'Ret.', 'Retrosp.', 'Rev.', 'Revol.', 'Rhet.', 'Rhode Isl.', 'Rich.', 'Rom.', 'Rom. Antiq.', 'Ross-sh.', 'Roxb.', 'Roy.', 'Rudim.', 'Russ.', 'S. Afr.', 'S. Carolina', 'S. Dakota', 'S.E.', 'S.T.S.', 'S.W.', 'SS.', 'Sam.', 'Sask.', 'Sat.', 'Sax.', 'Sc.', 'Scand.', 'Sch.', 'Sci.', 'Scot.', 'Scotl.', 'Script.', 'Sculpt.', 'Seismol.', 'Sel.', 'Sel. comm.', 'Select.', 'Sept.', 'Ser.', 'Serm.', 'Sess.', 'Settlem.', 'Sev.', 'Shakes.', 'Shaks.', 'Sheph.', 'Shetl.', 'Shropsh.', 'Soc.', 'Sociol.', 'Som.', 'Song Sol.', 'Song of Sol.', 'Sonn.', 'Span.', 'Spec.', 'Specif.', 'Specim.', 'Spectrosc.', 'St.', 'Staff.', 'Stafford.', 'Staffordsh.', 'Staffs.', 'Stand.', 'Stat.', 'Statist.', 'Stock Exch.', 'Stratigr.', 'Struct.', 'Stud.', 'Subj.', 'Subscr.', 'Subscript.', 'Suff.', 'Suppl.', 'Supplic.', 'Suppress.', 'Surg.', 'Surv.', 'Sus.', 'Symmetr.', 'Symp.', 'Syst.', 'Taxon.', 'Techn.', 'Technol.', 'Tel.', 'Telecomm.', 'Telegr.', 'Teleph.', 'Teratol.', 'Terminol.', 'Terrestr.', 'Test.', 'Textbk.', 'Theat.', 'Theatr.', 'Theol.', 'Theoret.', 'Thermonucl.', 'Thes.', 'Thess.', 'Tim.', 'Tit.', 'Topogr.', 'Trad.', 'Trag.', 'Trans.', 'Transl.', 'Transubstant.', 'Trav.', 'Treas.', 'Treat.', 'Treatm.', 'Trib.', 'Trig.', 'Trigonom.', 'Trop.', 'Troub.', 'Troubl.', 'Typog.', 'Typogr.', 'U.K.', 'U.S.', 'U.S.A.F.', 'U.S.S.R.', 'Univ.', 'Unnat.', 'Unoffic.', 'Urin.', 'Utilit.', 'Va.', 'Vac.', 'Valedict.', 'Veg.', 'Veg. Phys.', 'Veg. Physiol.', 'Venet.', 'Vertebr.', 'Vet.', 'Vet. Med.', 'Vet. Path.', 'Vet. Sci.', 'Vet. Surg.', 'Vic.', 'Vict.', 'Vind.', 'Vindic.', 'Virg.', 'Virol.', 'Voc.', 'Vocab.', 'Vol.', 'Vols.', 'Voy.', 'Vulg.', 'W. Afr.', 'W. Ind.', 'W. Indies', 'W. Va.', 'Warwicksh.', 'Wd.', 'Westm.', 'Westmld.', 'Westmorld.', 'Westmrld.', 'Will.', 'Wilts.', 'Wiltsh.', 'Wis.', 'Wisd.', 'Wk.', 'Wkly.', 'Wks.', 'Wonderf.', 'Worc.', 'Worcestersh.', 'Worcs.', 'Writ.', 'Yearbk.', 'Yng.', 'Yorks.', 'Yorksh.', 'Yr.', 'Yrs.', 'Zech.', 'Zeitschr.', 'Zeph.', 'Zoogeogr.', 'Zool.', 'abbrev.', 'abl.', 'abs.', 'absol.', 'abstr.', 'acc.', 'accus.', 'act.', 'ad.', 'adj.', 'adj. phr.', 'adjs.', 'adv.', 'advb.', 'advs.', 'agst.', 'alt.', 'aphet.', 'app.', 'appos.', 'arch.', 'art.', 'attrib.', 'bef.', 'betw.', 'cent.', 'cf.', 'cl.', 'cogn. w.', 'collect.', 'colloq.', 'comb. form', 'comp.', 'compar.', 'compl.', 'conc.', 'concr.', 'conj.', 'cons.', 'const.', 'contempt.', 'contr.', 'corresp.', 'cpd.', 'dat.', 'def.', 'dem.', 'deriv.', 'derog.', 'dial.', 'dim.', 'dyslog.', 'e. midl.', 'eOE', 'east.', 'ed.', 'ellipt.', 'emph.', 'erron.', 'esp.', 'etym.', 'etymol.', 'euphem.', 'exc.', 'fam.', 'famil.', 'fem.', 'fig.', 'fl.', 'freq.', 'fut.', 'gen.', 'gerund.', 'hist.', 'imit.', 'imp.', 'imperf.', 'impers.', 'impf.', 'improp.', 'inc.', 'ind.', 'indef.', 'indic.', 'indir.', 'infin.', 'infl.', 'instr.', 'int.', 'interj.', 'interrog.', 'intr.', 'intrans.', 'iron.', 'irreg.', 'joc.', 'lOE', 'lit.', 'll.', 'masc.', 'med.', 'metaphor.', 'metr. gr.', 'midl.', 'mispr.', 'mod.', 'n.e.', 'n.w.', 'no.', 'nom.', 'nonce-wd.', 'north.', 'nr.', 'ns.', 'obj.', 'obl.', 'obs.', 'occas.', 'opp.', 'orig.', 'p.', 'pa.', 'pa. pple.', 'pa. t.', 'pass.', 'perf.', 'perh.', 'pers.', 'personif.', 'pf.', 'phonet.', 'phr.', 'pl.', 'plur.', 'poet.', 'pop.', 'poss.', 'ppl.', 'ppl. a.', 'ppl. adj.', 'ppl. adjs.', 'pple.', 'pples.', 'pr.', 'pr. pple.', 'prec.', 'pred.', 'predic.', 'pref.', 'prep.', 'pres.', 'pres. pple.', 'priv.', 'prob.', 'pron.', 'pronunc.', 'prop.', 'propr.', 'prov.', 'pseudo-Sc.', 'pseudo-arch.', 'pseudo-dial.', 'q.v.', 'quot.', 'quots.', 'redupl.', 'refash.', 'refl.', 'reg.', 'rel.', 'repr.', 'rhet.', 's.e.', 's.v.', 's.w.', 'sc.', 'sing.', 'south.', 'sp.', 'spec.', 'str.', 'subj.', 'subjunct.', 'subord.', 'subord. cl.', 'subseq.', 'subst.', 'suff.', 'superl.', 'syll.', 'techn.', 'tr.', 'trans.', 'transf.', 'transl.', 'ult.', 'unkn.', 'unstr.', 'usu.', 'v.r.', 'v.rr.', 'var.', 'varr.', 'vars.', 'vb.', 'vbl.', 'vbl. ns.', 'vbl.n.', 'vbs.', 'viz.', 'vulg.', 'wd.', 'west.', 'wk.'}¶
-
articles
= ['a', 'the', 'an']¶
-
conjunctions
= ['for', 'and', 'nor', 'but', 'or', 'yet', 'so']¶
-
static
init
()¶
-
pronouns
= {'I', 'all', 'another', 'any', 'anybody', 'anyone', 'anything', 'both', 'each', 'each other', 'either', 'enough', 'everybody', 'everyone', 'everything', 'few', 'he', 'her', 'hers', 'herself', 'him', 'himself', 'his', 'i', 'it', 'itself', 'little', 'many', 'me', 'mine', 'more', 'most', 'much', 'myself', 'neither', 'no one', 'nobody', 'none', 'nothing', 'one', 'one another', 'other', 'others', 'ours', 'ourselves', 'several', 'she', 'some', 'somebody', 'someone', 'something', 'such', 'that', 'theirs', 'them', 'themselves', 'these', 'they', 'this', 'those', 'us', 'we', 'what', 'whatever', 'which', 'whichever', 'who', 'whoever', 'whom', 'whomever', 'whose', 'you', 'yours', 'yourself'}¶
-
lexnlp.extract.en.geoentities module¶
Geo Entity extraction for English.
This module implements extraction functionality for geo entities in English, including formal names, abbreviations, and aliases.
-
lexnlp.extract.en.geoentities.
get_geoentities
(text: str, geo_config_list: List[lexnlp.extract.en.dict_entities.DictionaryEntry], priority: bool = False, priority_by_id: bool = False, text_languages: List[str] = None, min_alias_len: int = 2, prepared_alias_ban_list: Union[None, Dict[str, Tuple[List[str], List[str]]]] = None, simplified_normalization: bool = False) → Generator[[Tuple[lexnlp.extract.en.dict_entities.DictionaryEntry, lexnlp.extract.en.dict_entities.DictionaryEntryAlias], Any], Any]¶ Searches for geo entities from the provided config list and yields pairs of (entity, alias). Entity is: (entity_id, name, [list of aliases]) Alias is: (alias_text, lang, is_abbrev, alias_id)
This method uses general searching routines for dictionary entities from dict_entities.py module. Methods of dict_entities module can be used for comfortable creating the config: entity_config(), entity_alias(), add_aliases_to_entity(). :param text: :param geo_config_list: List of all possible known geo entities in the form of tuples (id, name, [(alias, lang, is_abbrev, alias_id), …]). :param priority: If two entities found with the totally equal matching aliases - then use the one with the greatest priority field. :param priority_by_id: If two entities found with the totally equal matching aliases - then use the one with the lowest id. :param text_languages: Language(s) of the source text. If a language is specified then only aliases of this language will be searched for. For example: this allows ignoring “Island” - a German language
alias of Iceland for English texts.Parameters: - min_alias_len – Minimal length of geo entity aliases to search for.
- prepared_alias_ban_list – List of aliases to exclude from searching in the form: dict of lang -> (list of normalized non-abbreviation aliases, list of normalized abbreviation aliases). Use dict_entities.prepare_alias_banlist_dict() for preparing this dict.
- simplified_normalization – don’t use NLTK for “normalizing” text
Returns: Generates tuples: (entity, alias)
-
lexnlp.extract.en.geoentities.
get_geoentity_annotations
(text: str, geo_config_list: List[lexnlp.extract.en.dict_entities.DictionaryEntry], priority: bool = False, priority_by_id: bool = False, text_languages: List[str] = None, min_alias_len: int = 2, prepared_alias_ban_list: Union[None, Dict[str, Tuple[List[str], List[str]]]] = None, simplified_normalization: bool = False) → Generator[[lexnlp.extract.common.annotations.geo_annotation.GeoAnnotation, None], None]¶ See get_geoentities
-
lexnlp.extract.en.geoentities.
load_entities_dict_by_path
(entities_fn: str, aliases_fn: str)¶
lexnlp.extract.en.introductory_words_detector module¶
-
class
lexnlp.extract.en.introductory_words_detector.
IntroductoryWordsDetector
¶ Bases:
object
-
INTRODUCTORY_POS
= [[('RB', {'also', 'so'}), ('VBN', {'known', 'called', 'named'})], [('RB', {'also', 'so'}), ('JJ', {'known', 'called', 'named'})], [('VBN', {'known', 'called', 'named'})]]¶
-
INTRO_ADVERBS
= {'also', 'so'}¶
-
INTRO_VERBS
= {'called', 'known', 'named'}¶
-
MAX_INTRO_LEN
= 2¶
-
PUNCTUATION_POS
= {'\t', '!', '"', '$', '%', '&', "'", '(', ')', '*', ',', '-', '.', '/', ':', ';', '?', '@', '\\', ']', '^', '``', '{', '}['}¶
-
static
remove_term_introduction
(term: str, term_pos: List[Tuple[str, str, int, int]]) → str¶ so called “champerty’ => “champerty’ :param term: source phrase :param term_pos: sourse phrase
-
lexnlp.extract.en.money module¶
Money extraction for English.
This module implements basic money extraction functionality in English.
- Todo:
- Improved unit tests and case coverage
-
lexnlp.extract.en.money.
get_money
(text: str, return_sources: bool = False, float_digits: int = 4) → Generator¶
-
lexnlp.extract.en.money.
get_money_annotations
(text: str, float_digits: int = 4) → Generator[[lexnlp.extract.common.annotations.money_annotation.MoneyAnnotation, None], None]¶
lexnlp.extract.en.percents module¶
Percent extraction for English.
This module implements percent extraction functionality in English.
-
lexnlp.extract.en.percents.
get_percent_annotations
(text: str, float_digits: int = 4) → Generator[[lexnlp.extract.common.annotations.percent_annotation.PercentAnnotation, None], None]¶ Get percent usages within text.
-
lexnlp.extract.en.percents.
get_percents
(text: str, return_sources: bool = False, float_digits: int = 4) → Generator[[Union[Tuple[str, decimal.Decimal, decimal.Decimal], Tuple[str, decimal.Decimal, decimal.Decimal, str]], None], None]¶ Get percent usages within text. :param text: :param return_sources: :param float_digits: :return:
lexnlp.extract.en.pii module¶
PII extraction for English.
This module implements PII extraction functionality in English.
-
lexnlp.extract.en.pii.
get_pii
(text: str, return_sources=False) → Generator¶ Find possible PII references in the text. :param text: :param return_sources: :return:
-
lexnlp.extract.en.pii.
get_pii_annotations
(text: str) → Generator[[lexnlp.extract.common.annotations.text_annotation.TextAnnotation, None], None]¶ Find possible PII references in the text.
-
lexnlp.extract.en.pii.
get_ssn_annotations
(text: str) → Generator[[lexnlp.extract.common.annotations.ssn_annotation.SsnAnnotation, None], None]¶
-
lexnlp.extract.en.pii.
get_ssns
(text, return_sources=False) → Generator¶ Find possible SSN references in the text.
-
lexnlp.extract.en.pii.
get_us_phone_annotations
(text: str) → Generator[[lexnlp.extract.common.annotations.phone_annotation.PhoneAnnotation, None], None]¶ Find possible telephone numbers in the text.
-
lexnlp.extract.en.pii.
get_us_phones
(text: str, return_sources=False) → Generator¶ Find possible telephone numbers in the text.
lexnlp.extract.en.ratios module¶
Ratio extraction for English.
This module implements ratio extraction functionality in English.
- Todo:
- Improved unit tests and case coverage
-
lexnlp.extract.en.ratios.
get_ratio_annotations
(text: str, float_digits: int = 4) → Generator[[lexnlp.extract.common.annotations.ratio_annotation.RatioAnnotation, None], None]¶
-
lexnlp.extract.en.ratios.
get_ratios
(text: str, return_sources: bool = False, float_digits: int = 4) → Generator[[Union[Tuple[decimal.Decimal, decimal.Decimal, decimal.Decimal], Tuple[decimal.Decimal, decimal.Decimal, decimal.Decimal, str]], None], None]¶
lexnlp.extract.en.regulations module¶
Regulation extraction for English.
This module implements regulation extraction functionality in English.
- Todo:
- Improved unit tests and case coverage
-
lexnlp.extract.en.regulations.
get_regulation_annotations
(text: str) → Generator[[lexnlp.extract.common.annotations.regulation_annotation.RegulationAnnotation, None], None]¶ Get regulations. :param text: :param return_source: :param as_dict: :return: tuple or dict (volume, reporter, reporter_full_name, page, page2, court, year[, source text])
-
lexnlp.extract.en.regulations.
get_regulations
(text, return_source=False, as_dict=False) → Generator¶ Get regulations. :param text: :param return_source: :param as_dict: :return: tuple or dict (volume, reporter, reporter_full_name, page, page2, court, year[, source text])
lexnlp.extract.en.trademarks module¶
Trademark extraction for English using NLTK and NLTK pre-trained maximum entropy classifier.
This module implements basic Trademark extraction functionality in English relying on the pre-trained NLTK functionality, including POS tagger and NE (fuzzy) chunkers.
Todo: -
-
lexnlp.extract.en.trademarks.
get_trademark_annotations
(text: str) → Generator[[lexnlp.extract.common.annotations.trademark_annotation.TrademarkAnnotation, None], None]¶ Find trademarks in text.
-
lexnlp.extract.en.trademarks.
get_trademarks
(text: str) → Generator[[str, None], None]¶ Find trademarks in text.
lexnlp.extract.en.urls module¶
Urls extraction for English using NLTK and NLTK pre-trained maximum entropy classifier.
This module implements basic urls extraction functionality in English relying on the pre-trained NLTK functionality, including POS tagger and NE (fuzzy) chunkers.
Todo: -
-
lexnlp.extract.en.urls.
get_url_annotations
(text: str) → Generator[[lexnlp.extract.common.annotations.url_annotation.UrlAnnotation, None], None]¶ Find urls in text.
-
lexnlp.extract.en.urls.
get_urls
(text: str) → Generator[[str, None], None]¶ Find urls in text.
lexnlp.extract.en.utils module¶
Extraction utilities for English.
-
class
lexnlp.extract.en.utils.
NPExtractor
(grammar=None)¶ Bases:
object
-
cleanup_leaves
(leaves)¶
-
exception_pos
= ['IN', 'CC']¶
-
exception_sym
= ['&', 'and', 'of']¶
-
get_np
(text: str) → Generator[[str, None], None]¶
-
get_np_with_coords
(text: str) → List[Tuple[str, int, int]]¶
-
get_tokenizer
()¶
-
join
(np_items)¶
-
replace
(text, back=False)¶
-
replacements
= [[('(\\w)&(\\w)', '\\1-=AND=-\\2'), ('-=AND=-', '&')]]¶
-
sep
(n, current_pos, last_pos)¶
-
static
strip_np
(np)¶
-
sym_with_space
= ['(', '&']¶
-
sym_without_space
= ['!', '"', '#', '$', '%', "'", ')', '*', '+', ',', '-', '.', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', "'s"]¶
-
-
lexnlp.extract.en.utils.
strip_unicode_punctuation
(text, valid_punctuation=None)¶ This method strips all unicode punctuation that is not accepted. :param text: text to strip :param valid_punctuation: valid punctuation to accept :return: