lexnlp.extract.common.copyrights package

Submodules

lexnlp.extract.common.copyrights.copyright_en_style_parser module

Copyright extraction for English using NLTK and NLTK pre-trained maximum entropy classifier.

This module implements basic Copyright extraction functionality in English relying on the pre-trained NLTK functionality, including POS tagger and NE (fuzzy) chunkers.

class lexnlp.extract.common.copyrights.copyright_en_style_parser.CopyrightEnStyleParser

Bases: object

copyright_dates_re = regex.Regex('\\d{2,}', flags=regex.V0)
copyright_ptn = '((Copyright\\W\\s*|\\(\\s*[Cc]\\s*\\)\\s*|©)+\\s*(\\d{4}(?:\\s*[-,–]\\s*\\d{4})?)?\\s*(.+))'
copyright_ptn_re = regex.Regex('((Copyright\\W\\s*|\\(\\s*[Cc]\\s*\\)\\s*|©)+\\s*(\\d{4}(?:\\s*[-,–]\\s*\\d{4})?)?\\s*(.+))', flags=regex.V0)
classmethod derive_company_name(ant: lexnlp.extract.common.annotations.copyright_annotation.CopyrightAnnotation, phrase: str) → None
classmethod extract_phrases_with_coords(sentence: str) → List[Tuple[str, int, int]]

Find copyright in text. :param text: :param return_sources: :return:

reg_company_name = regex.Regex('[\\p{Lu}]+[\\p{L}\\s]*', flags=regex.V0)
reg_valid_company_name = regex.Regex('\\p{L}[\\p{L}\\s,]+', flags=regex.V0)
classmethod take_best_company_name(names: List[str]) → str
year_ptn = '(\\d{4}(?:\\s*[-,–]\\s*\\d{4})?)'
year_ptn_re = regex.Regex('(\\d{4}(?:\\s*[-,–]\\s*\\d{4})?)$', flags=regex.V0)

lexnlp.extract.common.copyrights.copyright_parser module

class lexnlp.extract.common.copyrights.copyright_parser.CopyrightParser(parsing_functions: List[Callable[str, List[lexnlp.extract.common.pattern_found.PatternFound]]], split_params: lexnlp.utils.lines_processing.line_processor.LineSplitParams)

Bases: lexnlp.extract.common.text_pattern_collector.TextPatternCollector

get_annotations_as_dictionaries() → List[dict]
make_annotation_from_pattrn(locale: str, ptrn: lexnlp.extract.common.pattern_found.PatternFound, phrase: lexnlp.utils.lines_processing.line_processor.LineOrPhrase) → lexnlp.extract.common.annotations.text_annotation.TextAnnotation

lexnlp.extract.common.copyrights.copyright_parsing_methods module

class lexnlp.extract.common.copyrights.copyright_parsing_methods.CopyrightParsingMethods

Bases: object

get_company_name_from_match(text: str, company_search_options: str, years: List[Tuple[int, int, int]]) → str
init_regexes()
init_trigger_words()
match_c_years_word(phrase: str) → List[lexnlp.extract.common.pattern_found.PatternFound]
Parameters:phrase – Copyright 1996 – 2019, Siemens
Returns:{name: ‘1996 – 2019, Siemens’, probability: 100, …}
match_word_c_years(phrase: str) → List[lexnlp.extract.common.pattern_found.PatternFound]
Parameters:phrase – © Siemens 1996 – 2019
Returns:{name: ‘© Siemens 1996 – 2019’, probability: 100, …}
pre_process_found_matches(matches: List[lexnlp.extract.common.pattern_found.PatternFound], company_search_options: str) → List[lexnlp.extract.common.copyrights.copyright_pattern_found.CopyrightPatternFound]

lexnlp.extract.common.copyrights.copyright_pattern_found module

class lexnlp.extract.common.copyrights.copyright_pattern_found.CopyrightPatternFound(ptrn: lexnlp.extract.common.pattern_found.PatternFound = None)

Bases: lexnlp.extract.common.pattern_found.PatternFound

get_detalization_level(text: str) → int
get_length() → int
pattern_worse_than_target(p, text: str) → bool

check what pattern is better then 2 patterns are considered duplicated “text” may be used in derived classes

reg_uppercase = regex.Regex('[\\p{Lu}]+', flags=regex.V0)

Module contents