lexnlp.extract.en.entities package

Submodules

lexnlp.extract.en.entities.nltk_maxent module

Entity extraction for English using NLTK and NLTK pre-trained maximum entropy classifier.

This module implements basic entity extraction functionality in English relying on the pre-trained NLTK functionality, including POS tagger and NE (fuzzy) chunkers.

Todo:
  • Better define interface for sentences vs. raw text
  • Standardize generator vs list
class lexnlp.extract.en.entities.nltk_maxent.CompanyNPExtractor(grammar=None)

Bases: lexnlp.extract.en.utils.NPExtractor

cleanup_leaves(leaves)
get_tokenizer()
static strip_np(np)
lexnlp.extract.en.entities.nltk_maxent.contains_companies(person: str, companies) → bool
lexnlp.extract.en.entities.nltk_maxent.get_companies(text: str, strict: bool = False, use_gnp: bool = False, detail_type: bool = False, count_unique: bool = False, name_upper: bool = False, parse_name_abbr: bool = False, return_source: bool = False)

Find company names in text, optionally using the stricter article/prefix expression. :param text: :param strict: :param use_gnp: use get_noun_phrases or NPExtractor :param detail_type: return detailed type (type, unified type, label) vs type only :param name_upper: return company name in upper case. :param count_unique: return only unique companies - case insensitive. :param parse_name_abbr: return company abbreviated name if exists. :param return_source: :return:

lexnlp.extract.en.entities.nltk_maxent.get_company_annotations(text: str, strict: bool = False, use_gnp: bool = False, count_unique: bool = False, name_upper: bool = False) → Generator[[lexnlp.extract.common.annotations.company_annotation.CompanyAnnotation, None], None]

Find company names in text, optionally using the stricter article/prefix expression. :param parse_name_abbr: :param text: :param strict: :param use_gnp: use get_noun_phrases or NPExtractor :param name_upper: return company name in upper case. :param count_unique: return only unique companies - case insensitive. :return:

lexnlp.extract.en.entities.nltk_maxent.get_geopolitical(text, strict=False, return_source=False, window=2) → Generator

Get GPEs from text. :param window: :param return_source: :param strict: :param text: :return:

lexnlp.extract.en.entities.nltk_maxent.get_noun_phrases(text, strict=False, return_source=False, window=3, valid_punctuation=None) → Generator

Get NNP phrases from text. :param window: :param return_source: :param strict: :param text: :return:

lexnlp.extract.en.entities.nltk_maxent.get_persons(text, strict=False, return_source=False, window=2) → Generator

Get names from text. :param window: :param return_source: :param strict: :param text: :return:

lexnlp.extract.en.entities.nltk_re module

Entity extraction for English using NLTK and basic regular expressions with master data.

This module implements basic entity extraction functionality in English, but does NOT rely on the pre-trained NLTK maximum entropy classifier. Instead, it uses the NLTK English grammar in combination with regular expressions and tested master data re: company types and abbreviations (e.g., LLC).

Todo:
  • Better define interface for sentences vs. raw text
  • Standardize generator vs list
lexnlp.extract.en.entities.nltk_re.check_backtrack_catastrophy(text: str) → bool
lexnlp.extract.en.entities.nltk_re.create_company_pattern(company_pattern_template=None, company_name_pattern=None, company_type_list=None, company_description_list=None, article_pattern='')

Create a company pattern for regular expression. :param company_pattern_template: :param company_name_pattern: :param article_pattern: :param company_type_list: :param company_description_list: :return:

lexnlp.extract.en.entities.nltk_re.get_companies(text: str, use_article: bool = False, use_sentence_splitter: bool = True) → Generator[[lexnlp.extract.common.annotations.company_annotation.CompanyAnnotation, None], None]

Find company names in text, optionally using the stricter article/prefix expression.

lexnlp.extract.en.entities.nltk_re.get_company_description_pipe(company_description_list=None)
lexnlp.extract.en.entities.nltk_re.get_company_type_pipe(company_type_list=None)
lexnlp.extract.en.entities.nltk_re.get_parties_as(text: str, detail_type=False) → Generator[[Tuple[str, str, str, str], None], None]
Parameters:
  • text – source text to search for companies
  • detail_type – obsolete
Returns:

parties: [(name, company type, company description, party type), …]

lexnlp.extract.en.entities.nltk_tokenizer module

class lexnlp.extract.en.entities.nltk_tokenizer.NltkTokenizer(punctuation: Optional[List[Any]] = None, starting_quotes: Optional[Any] = None)

Bases: nltk.tokenize.treebank.TreebankWordTokenizer

It’s almost a copy of TreebankWordTokenizer, but NltkTokenizer allows changing punctuation and starting_quotes settings

tokenize(text, convert_parentheses=False, return_str=False)

Return a tokenized copy of s.

Return type:list of str

lexnlp.extract.en.entities.stanford_ner module

Entity extraction for English using Stanford Named Entity Recognition (NER).

This module implements basic entity extraction functionality in English relying on the pre-trained Stanford NLP NER classifiers.

Todo:
  • Better define interface for sentences vs. raw text
  • Standardize generator vs list
lexnlp.extract.en.entities.stanford_ner.get_locations(text, strict=False, return_source=False, window=2) → Generator

Get locations from text using Stanford libraries. :param window: :param return_source: :param strict: :param text: :return:

lexnlp.extract.en.entities.stanford_ner.get_model_file(language)

Return the appropriate model file for each language. :param language: :return:

lexnlp.extract.en.entities.stanford_ner.get_organizations(text, strict=False, return_source=False, window=2) → Generator

Get organizations from text using Stanford libraries. :param window: :param return_source: :param strict: :param text: :return:

lexnlp.extract.en.entities.stanford_ner.get_persons(text, strict=False, return_source=False, window=2) → Generator

Get persons from text using Stanford libraries. :param window: :param return_source: :param strict: :param text: :return:

Module contents