lexnlp.nlp.en.segments package¶
Submodules¶
lexnlp.nlp.en.segments.pages module¶
lexnlp.nlp.en.segments.paragraphs module¶
lexnlp.nlp.en.segments.sections module¶
lexnlp.nlp.en.segments.sentences module¶
Sentence segmentation for English.
This module implements sentence segmentation in English using simple machine learning classifiers.
- Todo:
- Standardize model (re-)generation
-
lexnlp.nlp.en.segments.sentences.
build_sentence_model
(text, extra_abbrevs=None)¶ Build a sentence model from text with optional extra abbreviations to include. :param text: :param extra_abbrevs: :return:
-
lexnlp.nlp.en.segments.sentences.
get_sentence_list
(text)¶ Get sentences from text. :param text: :return:
-
lexnlp.nlp.en.segments.sentences.
get_sentence_span
(text: str) → Generator[[Tuple[int, int, str], Any], Any]¶ Given a text, returns a list of the (start, end) spans of sentences in the text.
-
lexnlp.nlp.en.segments.sentences.
get_sentence_span_list
(text) → List[Tuple[int, int, str]]¶ Given a text, generates (start, end) spans of sentences in the text.
-
lexnlp.nlp.en.segments.sentences.
normalize_text
(text: str) → str¶ Simple text pre-processing: replacing “not-quite unicode” symbols by their common equivalents for better parsing sentences with get_sentence_span function. :param text: “U.S. Person” means any Person :return: “U.S. Person” means any Person
-
lexnlp.nlp.en.segments.sentences.
post_process_sentence
(text: str, sent_span: Tuple[int, int]) → Generator[[Tuple[int, int], Any], Any]¶ Post-process sentence span detected by PunktSentenceTokenizer by additionally extracting titles, table of contents entries and other short strings stayed separately between empty lines into separate sentences. :param text: :param sent_span: :return:
-
lexnlp.nlp.en.segments.sentences.
pre_process_document
(text: str) → str¶ Pre-process text of the specified document before splitting it to the sentences. Removes obsolete formatting, page-splitting markers, page numbers e.t.c. :param text: :return:
lexnlp.nlp.en.segments.titles module¶
lexnlp.nlp.en.segments.utils module¶
Utility methods for segmentation classifiers
This module implements utility methods for segmentation, such as shared methods to generate document character distributions.
Todo:
-
lexnlp.nlp.en.segments.utils.
build_document_distribution
(text, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', norm=True)¶ Build document character distribution based on fixed character, optionally norming. :param text: :param characters: :param norm: :return:
-
lexnlp.nlp.en.segments.utils.
build_document_line_distribution
(text, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', norm=True)¶ Build document and line character distribution for section segmenting based on fixed character, optionally normalizing vector.