lexnlp.utils.lines_processing package¶
Submodules¶
lexnlp.utils.lines_processing.line_processor module¶
-
class
lexnlp.utils.lines_processing.line_processor.
LineOrPhrase
(text='', start=0)¶ Bases:
object
-
get_end
()¶
-
-
class
lexnlp.utils.lines_processing.line_processor.
LineProcessor
(allow_breaks_in_phrase: bool = True, line_split_params: lexnlp.utils.lines_processing.line_processor.LineSplitParams = None)¶ Bases:
object
-
check_phrase_starts_with_phrase
(src_phrase: List[lexnlp.utils.lines_processing.line_processor.SingleWord], check_start: int, checking_phrases) → bool¶
-
default_length
= 95¶
-
default_split_params
= <lexnlp.utils.lines_processing.line_processor.LineSplitParams object>¶
-
determine_line_length
(text: str) → None¶
-
get_abbreviations_in_text
(text: str) → List[Tuple[int, int]]¶
-
line_tail_percent
= 60¶
-
max_possible_length
= 200¶
-
min_possible_length
= 50¶
-
split_text_on_line_with_endings
(text: str, line_split_ptrs: lexnlp.utils.lines_processing.line_processor.LineSplitParams = None) → Generator[[lexnlp.utils.lines_processing.line_processor.LineOrPhrase, None], None]¶
-
split_text_on_words
(text: str) → List[lexnlp.utils.lines_processing.line_processor.SingleWord]¶
-
word_separator_pattern
= regex.Regex('[\\w]+[\\w-]*', flags=regex.V0)¶
-
words_to_lowercase
(words: List[lexnlp.utils.lines_processing.line_processor.SingleWord])¶
-
-
class
lexnlp.utils.lines_processing.line_processor.
LineSplitParams
¶ Bases:
object
lexnlp.utils.lines_processing.parsed_text_corrector module¶
-
class
lexnlp.utils.lines_processing.parsed_text_corrector.
ParsedTextCorrector
¶ Bases:
object
Class “corrects” the text given if the ParsedTextQualityEstimator class’ instance points to one of possible violation types.
For now the only possible “violation” is a number of “unnecessary” line breaks (
)
-
correct_if_corrupted
(text: str) → str¶ Checks the text and correct if corrupted. Let’s assume the text is:
1.1 Etymology
Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at
Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered
the undoubtable source.
param text: a text containing a number of - sequences, see above
return: the same text without 2 double line breaks: 1.1 Etymology
Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source.
-
correct_line_breaks
(text: str, estimator: lexnlp.utils.lines_processing.parsed_text_quality_estimator.ParsedTextQualityEstimator = None) → str¶
-
normalize_line_ending
(line: lexnlp.utils.lines_processing.line_processor.LineOrPhrase)¶
-
lexnlp.utils.lines_processing.parsed_text_quality_estimator module¶
-
class
lexnlp.utils.lines_processing.parsed_text_quality_estimator.
LineType
¶ Bases:
enum.Enum
An enumeration.
-
header
= 2¶
-
paragraph_start
= 3¶
-
regular
= 1¶
-
-
class
lexnlp.utils.lines_processing.parsed_text_quality_estimator.
ParsedTextQualityEstimate
¶ Bases:
object
A complex estimate for a text fragment - corrupted_prob: the probability of the text being “corrupted”.
Currently equals to extra_line_breaks_prob.- extra_line_breaks_prob: the probability (0..100) of the text containing unnecessary line breaks (
- ).
- an average line length withing the text (in characters).
-
class
lexnlp.utils.lines_processing.parsed_text_quality_estimator.
ParsedTextQualityEstimator
¶ Bases:
object
Estimates the probability of the text passed being somewhat corrupted
-
check_line_followed_by_unnecessary_break
(line_index: int) → bool¶
-
determine_line_type
(line: lexnlp.utils.lines_processing.parsed_text_quality_estimator.TypedLineOrPhrase)¶
-
estimate_extra_line_breaks
()¶
-
estimate_line_is_header_prob
(line: str) → int¶
-
estimate_line_is_paragraph_start_prob
(line: str) → int¶
-
estimate_text
(text: str) → lexnlp.utils.lines_processing.parsed_text_quality_estimator.ParsedTextQualityEstimate¶ - Let’s assume the text is:
Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at
Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered
the undoubtable source.
param text: a text containing a number of - sequences, see above
return: ParsedTextQualityEstimate: {‘avg_line_length’: 103, ‘extra_line_breaks_prob’: 66, ‘corrupted_prob’: 66}
-
minimal_paragraph_line_length
= 250¶
-
reg_numered_header
= re.compile('(^[\\s]*\\(?[a-zA-Z]\\)?\\s)|(^[\\s]*[0-9\\.]+[\\)]?\\s)')¶
-
reg_paragraph_start
= re.compile('(^\\s{2})|(^\\t)')¶
-
sentence_break_chars
= {'!', ',', '.', ';', '?'}¶
-
split_text_on_lines
(text: str)¶
-
-
class
lexnlp.utils.lines_processing.parsed_text_quality_estimator.
TypedLineOrPhrase
¶ Bases:
lexnlp.utils.lines_processing.line_processor.LineOrPhrase
Extends the LineOrPhrase class (text - ending - position) Adds LineType attribute specifying the line’s “role” within the text
-
static
wrap_line
(l: lexnlp.utils.lines_processing.line_processor.LineOrPhrase)¶
-
static
lexnlp.utils.lines_processing.phrase_finder module¶
-
class
lexnlp.utils.lines_processing.phrase_finder.
PhraseFinder
(phrase_set: List[str], extra_format_function=None)¶ Bases:
object
The class contains a collection of short string (usually 1 or 2 or 3 words) PhraseFinder searches for these strings (phrases) in the text given, either ignoring or regarding the case
-
find_word
(phrase: str, ignore_case: bool = True) → List[Tuple[str, int, int]]¶ Parameters: - phrase – “Tis better using France than trusting France: let us be back’d with God and with the seas”
- ignore_case – True
Returns: [ (‘better’, 4, 9), (‘ let us ‘, 46, 52) ]
- PhraseFinder instance had been initialized like
- PhraseFinder([‘ let us ‘, ‘better’, ‘the sea’])
-
word_to_regex
(word: str, ignore_case: bool) → str¶
-