lexnlp.nlp.en.segments package

Submodules

lexnlp.nlp.en.segments.pages module

Page segmentation for English.

This module implements page segmentation in English using simple machine learning classifiers.

Todo:
  • Standardize model (re-)generation
lexnlp.nlp.en.segments.pages.build_page_break_features(lines, line_id, line_window_pre, line_window_post, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', include_doc=None)

Build a feature vector for a given line ID with given parameters. :param lines: :param line_id: :param line_window_pre: :param line_window_post: :param characters: :param include_doc: :return:

lexnlp.nlp.en.segments.pages.get_pages(text, window_pre=3, window_post=3, score_threshold=0.5) → Generator

Get pages from text. :param text: :param window_pre: :param window_post: :param score_threshold: :return:

lexnlp.nlp.en.segments.paragraphs module

Paragraph segmentation for English.

This module implements paragraph segmentation in English using simple machine learning classifiers.

Todo:
  • Standardize model (re-)generation
lexnlp.nlp.en.segments.paragraphs.build_paragraph_break_features(lines, line_id, line_window_pre, line_window_post, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', include_doc=None)

Build a feature vector for a given line ID with given parameters.

Parameters:
  • lines
  • line_id
  • line_window_pre
  • line_window_post
  • characters
  • include_doc
Returns:

lexnlp.nlp.en.segments.paragraphs.get_paragraphs(text: str, window_pre=3, window_post=3, score_threshold=0.5, return_spans: bool = False) → Generator

Get paragraphs.

lexnlp.nlp.en.segments.paragraphs.splitlines_with_spans(text: str) → Tuple[List[str], List[Tuple[int, int]]]

lexnlp.nlp.en.segments.sections module

Section segmentation for English.

This module implements section segmentation in English using simple machine learning classifiers.

Todo:
  • Standardize model (re-)generation
class lexnlp.nlp.en.segments.sections.SectionLevelParser(sections_hierarchy=None)

Bases: object

DEFAULT_SECTION_HIERARCHY = ['(?i:(appendix|exhibit|schedule|part|title)\\s+\\S+)', '(?i:subtitle\\s+\\S+)', '(?i:section\\s+\\S+)', '(?i:subsection\\s+\\S+)', '(?i:article\\s+\\S+)', '\\p{Lu}+(?:-\\d+(?:\\.\\d+)?)?', '[\\d\\.]+', '\\p{L}+(?:-\\d+(?:\\.\\d+)?)?', '\\([\\p{L}\\d]+\\)']
current_sections_hierarchy
detect(title)
get_from_default()
get_from_detected()
lexnlp.nlp.en.segments.sections.build_section_break_features(lines, line_id, line_window_pre, line_window_post, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', include_doc=None)

Build a feature vector for a given line ID with given parameters.

Parameters:
  • lines
  • line_id
  • line_window_pre
  • line_window_post
  • characters
  • include_doc
Returns:

lexnlp.nlp.en.segments.sentences module

Sentence segmentation for English.

This module implements sentence segmentation in English using simple machine learning classifiers.

Todo:
  • Standardize model (re-)generation
lexnlp.nlp.en.segments.sentences.build_sentence_model(text, extra_abbrevs=None)

Build a sentence model from text with optional extra abbreviations to include. :param text: :param extra_abbrevs: :return:

lexnlp.nlp.en.segments.sentences.get_sentence_list(text)

Get sentences from text. :param text: :return:

lexnlp.nlp.en.segments.sentences.get_sentence_span(text: str) → Generator[[Tuple[int, int, str], Any], Any]

Given a text, returns a list of the (start, end) spans of sentences in the text.

lexnlp.nlp.en.segments.sentences.get_sentence_span_list(text) → List[Tuple[int, int, str]]

Given a text, generates (start, end) spans of sentences in the text.

lexnlp.nlp.en.segments.sentences.normalize_text(text: str) → str

Simple text pre-processing: replacing “not-quite unicode” symbols by their common equivalents for better parsing sentences with get_sentence_span function. :param text: “U.S. Person” means any Person :return: “U.S. Person” means any Person

lexnlp.nlp.en.segments.sentences.post_process_sentence(text: str, sent_span: Tuple[int, int]) → Generator[[Tuple[int, int], Any], Any]

Post-process sentence span detected by PunktSentenceTokenizer by additionally extracting titles, table of contents entries and other short strings stayed separately between empty lines into separate sentences. :param text: :param sent_span: :return:

lexnlp.nlp.en.segments.sentences.pre_process_document(text: str) → str

Pre-process text of the specified document before splitting it to the sentences. Removes obsolete formatting, page-splitting markers, page numbers e.t.c. :param text: :return:

lexnlp.nlp.en.segments.titles module

Title segmentation for English.

This module implements title segmentation/location in English using simple machine learning classifiers.

lexnlp.nlp.en.segments.titles.build_document_title_features(text, window_pre=3, window_post=3)

Get a document title given file text.

lexnlp.nlp.en.segments.titles.build_model(training_file_path)

Build a title extraction model given a training file path.

Parameters:training_file_path
Returns:
lexnlp.nlp.en.segments.titles.build_title_features(lines, line_id, line_window_pre, line_window_post, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', include_doc=None)

Build a feature vector for a given line ID with given parameters.

Parameters:
  • lines
  • line_id
  • line_window_pre
  • line_window_post
  • characters
  • include_doc
Returns:

lexnlp.nlp.en.segments.utils module

Utility methods for segmentation classifiers

This module implements utility methods for segmentation, such as shared methods to generate document character distributions.

Todo:

lexnlp.nlp.en.segments.utils.build_document_distribution(text, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', norm=True)

Build document character distribution based on fixed character, optionally norming. :param text: :param characters: :param norm: :return:

lexnlp.nlp.en.segments.utils.build_document_line_distribution(text, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', norm=True)

Build document and line character distribution for section segmenting based on fixed character, optionally normalizing vector.

Module contents