lexnlp.nlp.en.segments package¶

Submodules¶

lexnlp.nlp.en.segments.pages module¶

Page segmentation for English.

This module implements page segmentation in English using simple machine learning classifiers.

Todo:

Standardize model (re-)generation

lexnlp.nlp.en.segments.pages.build_page_break_features(lines, line_id, line_window_pre, line_window_post, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', include_doc=None)¶: Build a feature vector for a given line ID with given parameters. :param lines: :param line_id: :param line_window_pre: :param line_window_post: :param characters: :param include_doc: :return:

lexnlp.nlp.en.segments.pages.get_pages(text, window_pre=3, window_post=3, score_threshold=0.5) → Generator¶: Get pages from text. :param text: :param window_pre: :param window_post: :param score_threshold: :return:

lexnlp.nlp.en.segments.paragraphs module¶

Paragraph segmentation for English.

This module implements paragraph segmentation in English using simple machine learning classifiers.

Todo:

Standardize model (re-)generation

lexnlp.nlp.en.segments.paragraphs.build_paragraph_break_features(lines, line_id, line_window_pre, line_window_post, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', include_doc=None)¶

Build a feature vector for a given line ID with given parameters.

Parameters:	lines – line_id – line_window_pre – line_window_post – characters – include_doc –
Returns:

lexnlp.nlp.en.segments.paragraphs.get_paragraphs(text: str, window_pre=3, window_post=3, score_threshold=0.5, return_spans: bool = False) → Generator¶: Get paragraphs.

lexnlp.nlp.en.segments.paragraphs.splitlines_with_spans(text: str) → Tuple[List[str], List[Tuple[int, int]]]¶

lexnlp.nlp.en.segments.sections module¶

Section segmentation for English.

This module implements section segmentation in English using simple machine learning classifiers.

Todo:

Standardize model (re-)generation

class lexnlp.nlp.en.segments.sections.SectionLevelParser(sections_hierarchy=None)¶

Bases: object

DEFAULT_SECTION_HIERARCHY = ['(?i:(appendix|exhibit|schedule|part|title)\\s+\\S+)', '(?i:subtitle\\s+\\S+)', '(?i:section\\s+\\S+)', '(?i:subsection\\s+\\S+)', '(?i:article\\s+\\S+)', '\\p{Lu}+(?:-\\d+(?:\\.\\d+)?)?', '[\\d\\.]+', '\\p{L}+(?:-\\d+(?:\\.\\d+)?)?', '\$[\\p{L}\\d]+\$']¶

current_sections_hierarchy¶

detect(title)¶

get_from_default()¶

get_from_detected()¶

lexnlp.nlp.en.segments.sections.build_section_break_features(lines, line_id, line_window_pre, line_window_post, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', include_doc=None)¶

Build a feature vector for a given line ID with given parameters.

Parameters:	lines – line_id – line_window_pre – line_window_post – characters – include_doc –
Returns:

lexnlp.nlp.en.segments.sentences module¶

Sentence segmentation for English.

This module implements sentence segmentation in English using simple machine learning classifiers.

Todo:

Standardize model (re-)generation

lexnlp.nlp.en.segments.sentences.build_sentence_model(text, extra_abbrevs=None)¶: Build a sentence model from text with optional extra abbreviations to include. :param text: :param extra_abbrevs: :return:

lexnlp.nlp.en.segments.sentences.get_sentence_list(text)¶: Get sentences from text. :param text: :return:

lexnlp.nlp.en.segments.sentences.get_sentence_span(text: str) → Generator[[Tuple[int, int, str], Any], Any]¶: Given a text, returns a list of the (start, end) spans of sentences in the text.

lexnlp.nlp.en.segments.sentences.get_sentence_span_list(text) → List[Tuple[int, int, str]]¶: Given a text, generates (start, end) spans of sentences in the text.

lexnlp.nlp.en.segments.sentences.normalize_text(text: str) → str¶: Simple text pre-processing: replacing “not-quite unicode” symbols by their common equivalents for better parsing sentences with get_sentence_span function. :param text: “U.S. Person” means any Person :return: “U.S. Person” means any Person

lexnlp.nlp.en.segments.sentences.post_process_sentence(text: str, sent_span: Tuple[int, int]) → Generator[[Tuple[int, int], Any], Any]¶: Post-process sentence span detected by PunktSentenceTokenizer by additionally extracting titles, table of contents entries and other short strings stayed separately between empty lines into separate sentences. :param text: :param sent_span: :return:

lexnlp.nlp.en.segments.sentences.pre_process_document(text: str) → str¶: Pre-process text of the specified document before splitting it to the sentences. Removes obsolete formatting, page-splitting markers, page numbers e.t.c. :param text: :return:

lexnlp.nlp.en.segments.titles module¶

Title segmentation for English.

This module implements title segmentation/location in English using simple machine learning classifiers.

lexnlp.nlp.en.segments.titles.build_document_title_features(text, window_pre=3, window_post=3)¶: Get a document title given file text.

lexnlp.nlp.en.segments.titles.build_model(training_file_path)¶

Build a title extraction model given a training file path.

Parameters:	training_file_path –
Returns:

lexnlp.nlp.en.segments.titles.build_title_features(lines, line_id, line_window_pre, line_window_post, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', include_doc=None)¶

Build a feature vector for a given line ID with given parameters.

Parameters:	lines – line_id – line_window_pre – line_window_post – characters – include_doc –
Returns:

lexnlp.nlp.en.segments.utils module¶

Utility methods for segmentation classifiers

This module implements utility methods for segmentation, such as shared methods to generate document character distributions.

Todo:

lexnlp.nlp.en.segments.utils.build_document_distribution(text, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', norm=True)¶: Build document character distribution based on fixed character, optionally norming. :param text: :param characters: :param norm: :return:

lexnlp.nlp.en.segments.utils.build_document_line_distribution(text, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', norm=True)¶: Build document and line character distribution for section segmenting based on fixed character, optionally normalizing vector.

lexnlp.nlp.en.segments package¶

Submodules¶

lexnlp.nlp.en.segments.pages module¶

lexnlp.nlp.en.segments.paragraphs module¶

lexnlp.nlp.en.segments.sections module¶

lexnlp.nlp.en.segments.sentences module¶

lexnlp.nlp.en.segments.titles module¶

lexnlp.nlp.en.segments.utils module¶

Module contents¶

LexNLP

Navigation

Related Topics