lexnlp.nlp.en.segments package¶
Submodules¶
lexnlp.nlp.en.segments.pages module¶
Page segmentation for English.
This module implements page segmentation in English using simple machine learning classifiers.
- Todo:
- Standardize model (re-)generation
-
lexnlp.nlp.en.segments.pages.
build_page_break_features
(lines, line_id, line_window_pre, line_window_post, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', include_doc=None)¶ Build a feature vector for a given line ID with given parameters. :param lines: :param line_id: :param line_window_pre: :param line_window_post: :param characters: :param include_doc: :return:
-
lexnlp.nlp.en.segments.pages.
get_pages
(text, window_pre=3, window_post=3, score_threshold=0.5) → Generator¶ Get pages from text. :param text: :param window_pre: :param window_post: :param score_threshold: :return:
lexnlp.nlp.en.segments.paragraphs module¶
Paragraph segmentation for English.
This module implements paragraph segmentation in English using simple machine learning classifiers.
- Todo:
- Standardize model (re-)generation
-
lexnlp.nlp.en.segments.paragraphs.
build_paragraph_break_features
(lines, line_id, line_window_pre, line_window_post, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', include_doc=None)¶ Build a feature vector for a given line ID with given parameters.
Parameters: - lines –
- line_id –
- line_window_pre –
- line_window_post –
- characters –
- include_doc –
Returns:
-
lexnlp.nlp.en.segments.paragraphs.
get_paragraphs
(text: str, window_pre=3, window_post=3, score_threshold=0.5, return_spans: bool = False) → Generator¶ Get paragraphs.
-
lexnlp.nlp.en.segments.paragraphs.
splitlines_with_spans
(text: str) → Tuple[List[str], List[Tuple[int, int]]]¶
lexnlp.nlp.en.segments.sections module¶
Section segmentation for English.
This module implements section segmentation in English using simple machine learning classifiers.
- Todo:
- Standardize model (re-)generation
-
class
lexnlp.nlp.en.segments.sections.
SectionLevelParser
(sections_hierarchy=None)¶ Bases:
object
-
DEFAULT_SECTION_HIERARCHY
= ['(?i:(appendix|exhibit|schedule|part|title)\\s+\\S+)', '(?i:subtitle\\s+\\S+)', '(?i:section\\s+\\S+)', '(?i:subsection\\s+\\S+)', '(?i:article\\s+\\S+)', '\\p{Lu}+(?:-\\d+(?:\\.\\d+)?)?', '[\\d\\.]+', '\\p{L}+(?:-\\d+(?:\\.\\d+)?)?', '\\([\\p{L}\\d]+\\)']¶
-
current_sections_hierarchy
¶
-
detect
(title)¶
-
get_from_default
()¶
-
get_from_detected
()¶
-
-
lexnlp.nlp.en.segments.sections.
build_section_break_features
(lines, line_id, line_window_pre, line_window_post, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', include_doc=None)¶ Build a feature vector for a given line ID with given parameters.
Parameters: - lines –
- line_id –
- line_window_pre –
- line_window_post –
- characters –
- include_doc –
Returns:
lexnlp.nlp.en.segments.sentences module¶
Sentence segmentation for English.
This module implements sentence segmentation in English using simple machine learning classifiers.
- Todo:
- Standardize model (re-)generation
-
lexnlp.nlp.en.segments.sentences.
build_sentence_model
(text, extra_abbrevs=None)¶ Build a sentence model from text with optional extra abbreviations to include. :param text: :param extra_abbrevs: :return:
-
lexnlp.nlp.en.segments.sentences.
get_sentence_list
(text)¶ Get sentences from text. :param text: :return:
-
lexnlp.nlp.en.segments.sentences.
get_sentence_span
(text: str) → Generator[[Tuple[int, int, str], Any], Any]¶ Given a text, returns a list of the (start, end) spans of sentences in the text.
-
lexnlp.nlp.en.segments.sentences.
get_sentence_span_list
(text) → List[Tuple[int, int, str]]¶ Given a text, generates (start, end) spans of sentences in the text.
-
lexnlp.nlp.en.segments.sentences.
normalize_text
(text: str) → str¶ Simple text pre-processing: replacing “not-quite unicode” symbols by their common equivalents for better parsing sentences with get_sentence_span function. :param text: “U.S. Person” means any Person :return: “U.S. Person” means any Person
-
lexnlp.nlp.en.segments.sentences.
post_process_sentence
(text: str, sent_span: Tuple[int, int]) → Generator[[Tuple[int, int], Any], Any]¶ Post-process sentence span detected by PunktSentenceTokenizer by additionally extracting titles, table of contents entries and other short strings stayed separately between empty lines into separate sentences. :param text: :param sent_span: :return:
-
lexnlp.nlp.en.segments.sentences.
pre_process_document
(text: str) → str¶ Pre-process text of the specified document before splitting it to the sentences. Removes obsolete formatting, page-splitting markers, page numbers e.t.c. :param text: :return:
lexnlp.nlp.en.segments.titles module¶
Title segmentation for English.
This module implements title segmentation/location in English using simple machine learning classifiers.
-
lexnlp.nlp.en.segments.titles.
build_document_title_features
(text, window_pre=3, window_post=3)¶ Get a document title given file text.
-
lexnlp.nlp.en.segments.titles.
build_model
(training_file_path)¶ Build a title extraction model given a training file path.
Parameters: training_file_path – Returns:
-
lexnlp.nlp.en.segments.titles.
build_title_features
(lines, line_id, line_window_pre, line_window_post, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', include_doc=None)¶ Build a feature vector for a given line ID with given parameters.
Parameters: - lines –
- line_id –
- line_window_pre –
- line_window_post –
- characters –
- include_doc –
Returns:
lexnlp.nlp.en.segments.utils module¶
Utility methods for segmentation classifiers
This module implements utility methods for segmentation, such as shared methods to generate document character distributions.
Todo:
-
lexnlp.nlp.en.segments.utils.
build_document_distribution
(text, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', norm=True)¶ Build document character distribution based on fixed character, optionally norming. :param text: :param characters: :param norm: :return:
-
lexnlp.nlp.en.segments.utils.
build_document_line_distribution
(text, characters='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', norm=True)¶ Build document and line character distribution for section segmenting based on fixed character, optionally normalizing vector.