lexnlp.nlp.en.segments package¶

Submodules¶

lexnlp.nlp.en.segments.pages module¶

lexnlp.nlp.en.segments.paragraphs module¶

lexnlp.nlp.en.segments.sections module¶

lexnlp.nlp.en.segments.sentences module¶

Sentence segmentation for English.

This module implements sentence segmentation in English using simple machine learning classifiers.

Todo:

Standardize model (re-)generation

lexnlp.nlp.en.segments.sentences.build_sentence_model(text, extra_abbrevs=None)¶: Build a sentence model from text with optional extra abbreviations to include. :param text: :param extra_abbrevs: :return:

lexnlp.nlp.en.segments.sentences.get_sentence_list(text)¶: Get sentences from text. :param text: :return:

lexnlp.nlp.en.segments.sentences.get_sentence_span(text: str) → Generator[[Tuple[int, int, str], Any], Any]¶: Given a text, returns a list of the (start, end) spans of sentences in the text.

lexnlp.nlp.en.segments.sentences.get_sentence_span_list(text) → List[Tuple[int, int, str]]¶: Given a text, generates (start, end) spans of sentences in the text.

lexnlp.nlp.en.segments.sentences.normalize_text(text: str) → str¶: Simple text pre-processing: replacing “not-quite unicode” symbols by their common equivalents for better parsing sentences with get_sentence_span function. :param text: “U.S. Person” means any Person :return: “U.S. Person” means any Person

lexnlp.nlp.en.segments.sentences.post_process_sentence(text: str, sent_span: Tuple[int, int]) → Generator[[Tuple[int, int], Any], Any]¶: Post-process sentence span detected by PunktSentenceTokenizer by additionally extracting titles, table of contents entries and other short strings stayed separately between empty lines into separate sentences. :param text: :param sent_span: :return:

lexnlp.nlp.en.segments.sentences.pre_process_document(text: str) → str¶: Pre-process text of the specified document before splitting it to the sentences. Removes obsolete formatting, page-splitting markers, page numbers e.t.c. :param text: :return:

lexnlp.nlp.en.segments.titles module¶

lexnlp.nlp.en.segments.utils module¶

Utility methods for segmentation classifiers

This module implements utility methods for segmentation, such as shared methods to generate document character distributions.

Todo:

lexnlp.nlp.en.segments.utils.build_document_distribution(text, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', norm=True)¶: Build document character distribution based on fixed character, optionally norming. :param text: :param characters: :param norm: :return:

lexnlp.nlp.en.segments.utils.build_document_line_distribution(text, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', norm=True)¶: Build document and line character distribution for section segmenting based on fixed character, optionally normalizing vector.

Module contents¶