lexnlp.utils.lines_processing package

Submodules

lexnlp.utils.lines_processing.line_processor module

class lexnlp.utils.lines_processing.line_processor.LineOrPhrase(text='', start=0)

Bases: object

get_end()
class lexnlp.utils.lines_processing.line_processor.LineProcessor(allow_breaks_in_phrase: bool = True, line_split_params: lexnlp.utils.lines_processing.line_processor.LineSplitParams = None)

Bases: object

check_phrase_starts_with_phrase(src_phrase: List[lexnlp.utils.lines_processing.line_processor.SingleWord], check_start: int, checking_phrases) → bool
default_length = 95
default_split_params = <lexnlp.utils.lines_processing.line_processor.LineSplitParams object>
determine_line_length(text: str) → None
get_abbreviations_in_text(text: str) → List[Tuple[int, int]]
line_tail_percent = 60
max_possible_length = 200
min_possible_length = 50
split_text_on_line_with_endings(text: str, line_split_ptrs: lexnlp.utils.lines_processing.line_processor.LineSplitParams = None) → Generator[[lexnlp.utils.lines_processing.line_processor.LineOrPhrase, None], None]
split_text_on_words(text: str) → List[lexnlp.utils.lines_processing.line_processor.SingleWord]
word_separator_pattern = regex.Regex('[\\w]+[\\w-]*', flags=regex.V0)
words_to_lowercase(words: List[lexnlp.utils.lines_processing.line_processor.SingleWord])
class lexnlp.utils.lines_processing.line_processor.LineSplitParams

Bases: object

class lexnlp.utils.lines_processing.line_processor.SingleWord(text: str = '', start: int = 0, is_separator: bool = False)

Bases: object

get_end()

lexnlp.utils.lines_processing.parsed_text_corrector module

class lexnlp.utils.lines_processing.parsed_text_corrector.ParsedTextCorrector

Bases: object

Class “corrects” the text given if the ParsedTextQualityEstimator class’ instance points to one of possible violation types.

For now the only possible “violation” is a number of “unnecessary” line breaks (

)

correct_if_corrupted(text: str) → str

Checks the text and correct if corrupted. Let’s assume the text is:

1.1 Etymology

Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at

Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered

the undoubtable source.

param text:a text containing a number of
sequences, see above
return:

the same text without 2 double line breaks: 1.1 Etymology

Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source.

correct_line_breaks(text: str, estimator: lexnlp.utils.lines_processing.parsed_text_quality_estimator.ParsedTextQualityEstimator = None) → str
normalize_line_ending(line: lexnlp.utils.lines_processing.line_processor.LineOrPhrase)

lexnlp.utils.lines_processing.parsed_text_quality_estimator module

class lexnlp.utils.lines_processing.parsed_text_quality_estimator.LineType

Bases: enum.Enum

An enumeration.

header = 2
paragraph_start = 3
regular = 1
class lexnlp.utils.lines_processing.parsed_text_quality_estimator.ParsedTextQualityEstimate

Bases: object

A complex estimate for a text fragment - corrupted_prob: the probability of the text being “corrupted”.

Currently equals to extra_line_breaks_prob.
  • extra_line_breaks_prob: the probability (0..100) of the text containing unnecessary line breaks (
).
  • an average line length withing the text (in characters).
class lexnlp.utils.lines_processing.parsed_text_quality_estimator.ParsedTextQualityEstimator

Bases: object

Estimates the probability of the text passed being somewhat corrupted

check_line_followed_by_unnecessary_break(line_index: int) → bool
determine_line_type(line: lexnlp.utils.lines_processing.parsed_text_quality_estimator.TypedLineOrPhrase)
estimate_extra_line_breaks()
estimate_line_is_header_prob(line: str) → int
estimate_line_is_paragraph_start_prob(line: str) → int
estimate_text(text: str) → lexnlp.utils.lines_processing.parsed_text_quality_estimator.ParsedTextQualityEstimate
Let’s assume the text is:

Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at

Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered

the undoubtable source.

param text:a text containing a number of
sequences, see above
return:ParsedTextQualityEstimate: {‘avg_line_length’: 103, ‘extra_line_breaks_prob’: 66, ‘corrupted_prob’: 66}
minimal_paragraph_line_length = 250
reg_numered_header = re.compile('(^[\\s]*\\(?[a-zA-Z]\\)?\\s)|(^[\\s]*[0-9\\.]+[\\)]?\\s)')
reg_paragraph_start = re.compile('(^\\s{2})|(^\\t)')
sentence_break_chars = {'!', ',', '.', ';', '?'}
split_text_on_lines(text: str)
class lexnlp.utils.lines_processing.parsed_text_quality_estimator.TypedLineOrPhrase

Bases: lexnlp.utils.lines_processing.line_processor.LineOrPhrase

Extends the LineOrPhrase class (text - ending - position) Adds LineType attribute specifying the line’s “role” within the text

static wrap_line(l: lexnlp.utils.lines_processing.line_processor.LineOrPhrase)

lexnlp.utils.lines_processing.phrase_finder module

class lexnlp.utils.lines_processing.phrase_finder.PhraseFinder(phrase_set: List[str], extra_format_function=None)

Bases: object

The class contains a collection of short string (usually 1 or 2 or 3 words) PhraseFinder searches for these strings (phrases) in the text given, either ignoring or regarding the case

find_word(phrase: str, ignore_case: bool = True) → List[Tuple[str, int, int]]
Parameters:
  • phrase – “Tis better using France than trusting France: let us be back’d with God and with the seas”
  • ignore_case – True
Returns:

[ (‘better’, 4, 9), (‘ let us ‘, 46, 52) ]

PhraseFinder instance had been initialized like
PhraseFinder([‘ let us ‘, ‘better’, ‘the sea’])
word_to_regex(word: str, ignore_case: bool) → str

Module contents