lexnlp.extract.common.copyrights package¶
Submodules¶
lexnlp.extract.common.copyrights.copyright_en_style_parser module¶
Copyright extraction for English using NLTK and NLTK pre-trained maximum entropy classifier.
This module implements basic Copyright extraction functionality in English relying on the pre-trained NLTK functionality, including POS tagger and NE (fuzzy) chunkers.
-
class
lexnlp.extract.common.copyrights.copyright_en_style_parser.
CopyrightEnStyleParser
¶ Bases:
object
-
copyright_dates_re
= regex.Regex('\\d{2,}', flags=regex.V0)¶
-
copyright_ptn
= '((Copyright\\W\\s*|\\(\\s*[Cc]\\s*\\)\\s*|©)+\\s*(\\d{4}(?:\\s*[-,–]\\s*\\d{4})?)?\\s*(.+))'¶
-
copyright_ptn_re
= regex.Regex('((Copyright\\W\\s*|\\(\\s*[Cc]\\s*\\)\\s*|©)+\\s*(\\d{4}(?:\\s*[-,–]\\s*\\d{4})?)?\\s*(.+))', flags=regex.V0)¶
-
classmethod
derive_company_name
(ant: lexnlp.extract.common.annotations.copyright_annotation.CopyrightAnnotation, phrase: str) → None¶
-
classmethod
extract_phrases_with_coords
(sentence: str) → List[Tuple[str, int, int]]¶
-
static
get_copyright
(text: str, return_sources=False) → Generator[[lexnlp.extract.common.annotations.copyright_annotation.CopyrightAnnotation, None], None]¶
-
classmethod
get_copyright_annotations
(text: str, return_sources=False) → Generator[[lexnlp.extract.common.annotations.copyright_annotation.CopyrightAnnotation, None], None]¶ Find copyright in text. :param text: :param return_sources: :return:
-
reg_company_name
= regex.Regex('[\\p{Lu}]+[\\p{L}\\s]*', flags=regex.V0)¶
-
reg_valid_company_name
= regex.Regex('\\p{L}[\\p{L}\\s,]+', flags=regex.V0)¶
-
classmethod
split_copyright_date
(ant: lexnlp.extract.common.annotations.copyright_annotation.CopyrightAnnotation) → None¶
-
classmethod
take_best_company_name
(names: List[str]) → str¶
-
year_ptn
= '(\\d{4}(?:\\s*[-,–]\\s*\\d{4})?)'¶
-
year_ptn_re
= regex.Regex('(\\d{4}(?:\\s*[-,–]\\s*\\d{4})?)$', flags=regex.V0)¶
-
lexnlp.extract.common.copyrights.copyright_parser module¶
-
class
lexnlp.extract.common.copyrights.copyright_parser.
CopyrightParser
(parsing_functions: List[Callable[str, List[lexnlp.extract.common.pattern_found.PatternFound]]], split_params: lexnlp.utils.lines_processing.line_processor.LineSplitParams)¶ Bases:
lexnlp.extract.common.text_pattern_collector.TextPatternCollector
-
get_annotations_as_dictionaries
() → List[dict]¶
-
make_annotation_from_pattrn
(locale: str, ptrn: lexnlp.extract.common.pattern_found.PatternFound, phrase: lexnlp.utils.lines_processing.line_processor.LineOrPhrase) → lexnlp.extract.common.annotations.text_annotation.TextAnnotation¶
-
lexnlp.extract.common.copyrights.copyright_parsing_methods module¶
-
class
lexnlp.extract.common.copyrights.copyright_parsing_methods.
CopyrightParsingMethods
¶ Bases:
object
-
get_company_name_from_match
(text: str, company_search_options: str, years: List[Tuple[int, int, int]]) → str¶
-
init_regexes
()¶
-
init_trigger_words
()¶
-
match_c_years_word
(phrase: str) → List[lexnlp.extract.common.pattern_found.PatternFound]¶ Parameters: phrase – Copyright 1996 – 2019, Siemens Returns: {name: ‘1996 – 2019, Siemens’, probability: 100, …}
-
match_word_c_years
(phrase: str) → List[lexnlp.extract.common.pattern_found.PatternFound]¶ Parameters: phrase – © Siemens 1996 – 2019 Returns: {name: ‘© Siemens 1996 – 2019’, probability: 100, …}
-
pre_process_found_matches
(matches: List[lexnlp.extract.common.pattern_found.PatternFound], company_search_options: str) → List[lexnlp.extract.common.copyrights.copyright_pattern_found.CopyrightPatternFound]¶
-
lexnlp.extract.common.copyrights.copyright_pattern_found module¶
-
class
lexnlp.extract.common.copyrights.copyright_pattern_found.
CopyrightPatternFound
(ptrn: lexnlp.extract.common.pattern_found.PatternFound = None)¶ Bases:
lexnlp.extract.common.pattern_found.PatternFound
-
get_detalization_level
(text: str) → int¶
-
get_length
() → int¶
-
pattern_worse_than_target
(p, text: str) → bool¶ check what pattern is better then 2 patterns are considered duplicated “text” may be used in derived classes
-
reg_uppercase
= regex.Regex('[\\p{Lu}]+', flags=regex.V0)¶
-