lexnlp.utils.lines_processing package¶

Submodules¶

lexnlp.utils.lines_processing.line_processor module¶

class lexnlp.utils.lines_processing.line_processor.LineOrPhrase(text='', start=0)¶

Bases: object

get_end()¶

class lexnlp.utils.lines_processing.line_processor.LineProcessor(allow_breaks_in_phrase: bool = True, line_split_params: lexnlp.utils.lines_processing.line_processor.LineSplitParams = None)¶

Bases: object

check_phrase_starts_with_phrase(src_phrase: List[lexnlp.utils.lines_processing.line_processor.SingleWord], check_start: int, checking_phrases) → bool¶

default_length = 95¶

default_split_params = <lexnlp.utils.lines_processing.line_processor.LineSplitParams object>¶

determine_line_length(text: str) → None¶

get_abbreviations_in_text(text: str) → List[Tuple[int, int]]¶

line_tail_percent = 60¶

max_possible_length = 200¶

min_possible_length = 50¶

split_text_on_line_with_endings(text: str, line_split_ptrs: lexnlp.utils.lines_processing.line_processor.LineSplitParams = None) → Generator[[lexnlp.utils.lines_processing.line_processor.LineOrPhrase, None], None]¶

split_text_on_words(text: str) → List[lexnlp.utils.lines_processing.line_processor.SingleWord]¶

word_separator_pattern = regex.Regex('[\\w]+[\\w-]*', flags=regex.V0)¶

words_to_lowercase(words: List[lexnlp.utils.lines_processing.line_processor.SingleWord])¶

class lexnlp.utils.lines_processing.line_processor.LineSplitParams¶: Bases: object

class lexnlp.utils.lines_processing.line_processor.SingleWord(text: str = '', start: int = 0, is_separator: bool = False)¶

Bases: object

get_end()¶

lexnlp.utils.lines_processing.parsed_text_corrector module¶

class lexnlp.utils.lines_processing.parsed_text_corrector.ParsedTextCorrector¶

Bases: object

Class “corrects” the text given if the ParsedTextQualityEstimator class’ instance points to one of possible violation types.

For now the only possible “violation” is a number of “unnecessary” line breaks (

)

correct_if_corrupted(text: str) → str¶

Checks the text and correct if corrupted. Let’s assume the text is:

1.1 Etymology

Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at

Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered

the undoubtable source.

param text: a text containing a number of

sequences, see above

return:

the same text without 2 double line breaks: 1.1 Etymology

Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source.

correct_line_breaks(text: str, estimator: lexnlp.utils.lines_processing.parsed_text_quality_estimator.ParsedTextQualityEstimator = None) → str¶

normalize_line_ending(line: lexnlp.utils.lines_processing.line_processor.LineOrPhrase)¶

lexnlp.utils.lines_processing.parsed_text_quality_estimator module¶

class lexnlp.utils.lines_processing.parsed_text_quality_estimator.LineType¶

Bases: enum.Enum

An enumeration.

header = 2¶

paragraph_start = 3¶

regular = 1¶

class lexnlp.utils.lines_processing.parsed_text_quality_estimator.ParsedTextQualityEstimate¶

Bases: object

A complex estimate for a text fragment - corrupted_prob: the probability of the text being “corrupted”.

Currently equals to extra_line_breaks_prob.

extra_line_breaks_prob: the probability (0..100) of the text containing unnecessary line breaks (

).

an average line length withing the text (in characters).

class lexnlp.utils.lines_processing.parsed_text_quality_estimator.ParsedTextQualityEstimator¶

Bases: object

Estimates the probability of the text passed being somewhat corrupted

check_line_followed_by_unnecessary_break(line_index: int) → bool¶

determine_line_type(line: lexnlp.utils.lines_processing.parsed_text_quality_estimator.TypedLineOrPhrase)¶

estimate_extra_line_breaks()¶

estimate_line_is_header_prob(line: str) → int¶

estimate_line_is_paragraph_start_prob(line: str) → int¶

estimate_text(text: str) → lexnlp.utils.lines_processing.parsed_text_quality_estimator.ParsedTextQualityEstimate¶

Let’s assume the text is:

Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at

Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered

the undoubtable source.

param text: a text containing a number of

sequences, see above

return:	ParsedTextQualityEstimate: {‘avg_line_length’: 103, ‘extra_line_breaks_prob’: 66, ‘corrupted_prob’: 66}

minimal_paragraph_line_length = 250¶

reg_numered_header = re.compile('(^[\\s]*\\(?[a-zA-Z]\\)?\\s)|(^[\\s]*[0-9\\.]+[\\)]?\\s)')¶

reg_paragraph_start = re.compile('(^\\s{2})|(^\\t)')¶

sentence_break_chars = {'!', ',', '.', ';', '?'}¶

split_text_on_lines(text: str)¶

class lexnlp.utils.lines_processing.parsed_text_quality_estimator.TypedLineOrPhrase¶

Bases: lexnlp.utils.lines_processing.line_processor.LineOrPhrase

Extends the LineOrPhrase class (text - ending - position) Adds LineType attribute specifying the line’s “role” within the text

static wrap_line(l: lexnlp.utils.lines_processing.line_processor.LineOrPhrase)¶

lexnlp.utils.lines_processing.phrase_finder module¶

class lexnlp.utils.lines_processing.phrase_finder.PhraseFinder(phrase_set: List[str], extra_format_function=None)¶

Bases: object

The class contains a collection of short string (usually 1 or 2 or 3 words) PhraseFinder searches for these strings (phrases) in the text given, either ignoring or regarding the case

find_word(phrase: str, ignore_case: bool = True) → List[Tuple[str, int, int]]¶

Parameters:	phrase – “Tis better using France than trusting France: let us be back’d with God and with the seas” ignore_case – True
Returns:	[ (‘better’, 4, 9), (‘ let us ‘, 46, 52) ]

PhraseFinder instance had been initialized like: PhraseFinder([‘ let us ‘, ‘better’, ‘the sea’])

word_to_regex(word: str, ignore_case: bool) → str¶

lexnlp.utils.lines_processing package¶

Submodules¶

lexnlp.utils.lines_processing.line_processor module¶

lexnlp.utils.lines_processing.parsed_text_corrector module¶

lexnlp.utils.lines_processing.parsed_text_quality_estimator module¶

lexnlp.utils.lines_processing.phrase_finder module¶

Module contents¶

LexNLP

Navigation

Related Topics