lexnlp.extract.common package

Subpackages

Submodules

lexnlp.extract.common.annotation_locator_type module

class lexnlp.extract.common.annotation_locator_type.AnnotationLocatorType

Bases: enum.Enum

An enumeration.

MlWordVectorBased = 2
RegexpBased = 1

lexnlp.extract.common.annotation_type module

class lexnlp.extract.common.annotation_type.AnnotationType

Bases: enum.Enum

An enumeration.

act = 1
amount = 2
citation = 3
condition = 4
constraint = 5
copyright = 6
court = 7
court_citation = 8
cusip = 9
date = 10
definition = 11
distance = 12
duration = 13
geoentity = 14
laws = 24
money = 15
percent = 16
phone = 18
pii = 17
ratio = 20
regulation = 21
ssn = 19
trademark = 22
url = 23

lexnlp.extract.common.base_path module

lexnlp.extract.common.dates module

lexnlp.extract.common.fact_extracting module

lexnlp.extract.common.language_dictionary_reader module

class lexnlp.extract.common.language_dictionary_reader.LanguageDictionaryReader

Bases: object

This class reads text files, where values are separated by <line_breaks>, strips the values if needed and returns them as List or Dict.

We use this class, e.g., while reading De locale common abbreviations.

static read_str_set(file_path: str, encoding='utf8', strip_symbols=' ') → Set[str]

lexnlp.extract.common.pattern_found module

class lexnlp.extract.common.pattern_found.PatternFound

Bases: object

used inside EsDefinitionsParser and SpanishParsingMethods to store intermediate parsing results

pattern_worse_than_target(p, text: str) → bool

check what pattern is better then 2 patterns are considered duplicated “text” may be used in derived classes

lexnlp.extract.common.special_characters module

class lexnlp.extract.common.special_characters.SpecialCharacters

Bases: object

punctuation = {'!', '"', '$', '%', '&', "'", '(', ')', '*', ',', '-', '.', '/', ':', ';', '?', '@', '\\', ']', '^', '{', '}['}

lexnlp.extract.common.text_beautifier module

class lexnlp.extract.common.text_beautifier.TextBeautifier

Bases: object

APOS_SEPARATORS = {'\t', ' ', '(', ')', ',', '.', ';', '[', ']', '{', '}'}
BRACES_C = {')', ']', '}'}
BRACES_O = {'(', '[', '{'}
BRACE_CL_BY_OP = {'(': ')', '[': ']', '{': '}'}
PAIR_BRACES = {'""', "''", '()', '[]', '``', '{}', '“”'}
PROPER_CLOSE_QUOTE = {'"': '"', '“': '”'}
QUOTES = {'"', '“', '”'}
TRANSFORMED_WORDS = {"''": ['"', '``', '“', '”'], '(': ['(', '[', '{'], ')': [')', ']', '}'], ':': [':', ';', '|'], '``': ['"', '``', '“', '”']}
static find_pair_among_apostrophe(text: str, apos_coords: List[int], quote: Tuple[str, int]) → int
static find_transformed_word(txt: str, word: str, offset: int) → Optional[Tuple[str, int]]

Searches for transformed word into text, returns transformed words with its start position

static lstrip_string_coords(text: str, start: int, end: int, trim_symbols: Optional[str] = None) → Tuple[str, int, int]
static normalize_smb_preserve_len(text: str) → str

Normalize some of the string characters, preserving original length :param text: string to normalize :return: normalized string

static rstrip_string_coords(text: str, start: int, end: int, trim_symbols: Optional[str] = None) → Tuple[str, int, int]
static strip_pair_symbols(term_coords: Union[str, Tuple[str, int, int]]) → Union[str, Tuple[str, int, int]]
static strip_string_coords(text: str, start: int, end: int, trim_symbols: Optional[str] = None) → Tuple[str, int, int]
static unify_quotes_braces(text: str, empty_replacement: str = '') → str
static unify_quotes_braces_coords(text: str, start: int, end: int, empty_replacement: str = '') → Tuple[str, int, int]
static unify_quotes_braces_unsafe(text: str, start: int, end: int, empty_replacement: str = '') → Tuple[str, int, int]
Parameters:
  • text – source text to “beautify”
  • start – start coordinate of the text
  • end – end coordinate of the text
  • empty_replacement – replace unbalanced braces / quotes with this substring
Returns:

str with all quotes and braces replaced with their “normal” forms

lexnlp.extract.common.text_pattern_collector module

class lexnlp.extract.common.text_pattern_collector.TextPatternCollector(parsing_functions: List[Callable[str, List[lexnlp.extract.common.pattern_found.PatternFound]]], split_params: lexnlp.utils.lines_processing.line_processor.LineSplitParams)

Bases: object

basic_line_processor = <lexnlp.utils.lines_processing.line_processor.LineProcessor object>

EsDefinitionsParser searches for definitions in text according to the rules of Spanish. See the “parse” method

choose_best_matches(matches: List[lexnlp.extract.common.pattern_found.PatternFound]) → List[lexnlp.extract.common.pattern_found.PatternFound]
choose_more_precise_matches(matches: List[lexnlp.extract.common.pattern_found.PatternFound], text: str) → List[lexnlp.extract.common.pattern_found.PatternFound]

look for a match “consumed” by other matches and spare the consuming! matches

static estimate_match_quality(match: lexnlp.extract.common.pattern_found.PatternFound) → int
make_annotation_from_pattrn(locale: str, ptrn: lexnlp.extract.common.pattern_found.PatternFound, phrase: lexnlp.utils.lines_processing.line_processor.LineOrPhrase) → lexnlp.extract.common.annotations.text_annotation.TextAnnotation
parse(text: str, locale: str = None) → List[lexnlp.extract.common.annotations.text_annotation.TextAnnotation]
Parameters:
  • locale – ‘En’, ‘De’, ‘Es’, …
  • text – En este acuerdo, el término “Software” se refiere a: (i) el programa informático
Returns:

{ “attrs”: {“start”: 28, “end”: 82}, “tags”: {“Extracted Entity Type”: “definition”, “Extracted Entity Definition Name”: “Software”, “Extracted Entity Text”: “”Software” se refiere a: (i) el programa informático”} }

remove_prohibited_words(matches: List[lexnlp.extract.common.pattern_found.PatternFound]) → List[lexnlp.extract.common.pattern_found.PatternFound]

lexnlp.extract.common.universal_court_parser module

class lexnlp.extract.common.universal_court_parser.MatchFound(subset, entry_start: int, entry_end: int, text: str)

Bases: object

make_sort_key()
class lexnlp.extract.common.universal_court_parser.ParserInitParams

Bases: object

UniversalCourtsParser initialization parameters

class lexnlp.extract.common.universal_court_parser.UniversalCourtsParser(ptrs: lexnlp.extract.common.universal_court_parser.ParserInitParams)

Bases: object

The class describes a “constructor” for building locale (and region) specific parsers, that find reference to courts within the text.

Use the parse() method to find all reference to courts from the text provided. Each reference is a dictionary with two keys: - “attrs” key leads to the “coordinates” (starting and ending characters) of the

occurrence within the provided text
  • “tags” key leads to another dictionary, which contains: - court official name - court’s jurisdiction …

In order to parse the text you are supposed to create your locale (or region) specific instance of UniversalCourtsParser. See the constructor below:

add_annotation(match: lexnlp.extract.common.universal_court_parser.MatchFound)
find_court_by_any_key(phrase: lexnlp.utils.lines_processing.line_processor.LineOrPhrase)
find_court_by_key_column(phrase: lexnlp.utils.lines_processing.line_processor.LineOrPhrase, phrase_finder: lexnlp.utils.lines_processing.phrase_finder.PhraseFinder, column: str) → Tuple[lexnlp.extract.common.universal_court_parser.MatchFound, List[Tuple[str, int, int]]]
find_court_by_name(phrase: lexnlp.utils.lines_processing.line_processor.LineOrPhrase) → List[lexnlp.extract.common.universal_court_parser.MatchFound]
find_court_by_type_and_jurisdiction(phrase: lexnlp.utils.lines_processing.line_processor.LineOrPhrase) → List[lexnlp.extract.common.universal_court_parser.MatchFound]
find_courts_by_alias_in_whole_text(text: str) → None
static get_unique_col_values(col_values)
load_courts(dataframe_paths: List[str])
parse(text: str, locale: str = None) → List[lexnlp.extract.common.annotations.court_annotation.CourtAnnotation]
Parameters:
  • text – the text being processed
  • locale – ‘En’, ‘Es’, …
Returns:

annotations - List[dict]

Here is an example of the method’s call: ret = processor.parse(“Bei dir läuft, deine Verfassungsgerichtshof des Freistaates Sachsen rauchen Joints vor der Kamera”)

ret[0][‘attrs’] = {‘start’: 14, ‘end’: 97} ret[0][‘tags’] = {‘Extracted Entity Type’: ‘court’,

‘Extracted Entity Court Name’: ‘Verfassungsgerichtshof des Freistaates Sachsen’, ‘Extracted Entity Court Type’: ‘Verfassungsgericht’, ‘Extracted Entity Court Jurisdiction’: ‘Sachsen’}

lexnlp.extract.common.year_parser module

class lexnlp.extract.common.year_parser.YearParser

Bases: object

finds years in the string passed

check_year_ok(year: int, min_year: int = 1800, max_year=0)
get_years_with_coords_from_string(text: str, min_year: int = 1800, max_year=0) → List[Tuple[int, int, int]]

Module contents