lexnlp.nlp.en.segments package

Submodules

lexnlp.nlp.en.segments.pages module

lexnlp.nlp.en.segments.paragraphs module

lexnlp.nlp.en.segments.sections module

lexnlp.nlp.en.segments.sentences module

Sentence segmentation for English.

This module implements sentence segmentation in English using simple machine learning classifiers.

Todo:
  • Standardize model (re-)generation
lexnlp.nlp.en.segments.sentences.build_sentence_model(text, extra_abbrevs=None)

Build a sentence model from text with optional extra abbreviations to include. :param text: :param extra_abbrevs: :return:

lexnlp.nlp.en.segments.sentences.get_sentence_list(text)

Get sentences from text. :param text: :return:

lexnlp.nlp.en.segments.sentences.get_sentence_span(text: str) → Generator[[Tuple[int, int, str], Any], Any]

Given a text, returns a list of the (start, end) spans of sentences in the text.

lexnlp.nlp.en.segments.sentences.get_sentence_span_list(text) → List[Tuple[int, int, str]]

Given a text, generates (start, end) spans of sentences in the text.

lexnlp.nlp.en.segments.sentences.normalize_text(text: str) → str

Simple text pre-processing: replacing “not-quite unicode” symbols by their common equivalents for better parsing sentences with get_sentence_span function. :param text: “U.S. Person” means any Person :return: “U.S. Person” means any Person

lexnlp.nlp.en.segments.sentences.post_process_sentence(text: str, sent_span: Tuple[int, int]) → Generator[[Tuple[int, int], Any], Any]

Post-process sentence span detected by PunktSentenceTokenizer by additionally extracting titles, table of contents entries and other short strings stayed separately between empty lines into separate sentences. :param text: :param sent_span: :return:

lexnlp.nlp.en.segments.sentences.pre_process_document(text: str) → str

Pre-process text of the specified document before splitting it to the sentences. Removes obsolete formatting, page-splitting markers, page numbers e.t.c. :param text: :return:

lexnlp.nlp.en.segments.titles module

lexnlp.nlp.en.segments.utils module

Utility methods for segmentation classifiers

This module implements utility methods for segmentation, such as shared methods to generate document character distributions.

Todo:

lexnlp.nlp.en.segments.utils.build_document_distribution(text, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', norm=True)

Build document character distribution based on fixed character, optionally norming. :param text: :param characters: :param norm: :return:

lexnlp.nlp.en.segments.utils.build_document_line_distribution(text, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', norm=True)

Build document and line character distribution for section segmenting based on fixed character, optionally normalizing vector.

Module contents