lexnlp.extract.ml.detector package

Submodules

lexnlp.extract.ml.detector.artifact_detector module

class lexnlp.extract.ml.detector.artifact_detector.ArtifactDetector

Bases: object

build_amount_tokens() → List[str]
load(file_path: str)
load_compressed(file_path: str)
load_from_stream(stream: Any)
predict(sample_df: pandas.core.frame.DataFrame, size_limit: int = 0) → Tuple[numpy.ndarray, numpy.ndarray]
predict_text(text: str, join_settings: lexnlp.extract.ml.detector.phrase_constructor.PhraseConstructorSettings = None, feature_mask: List[int] = None) → Generator[[Tuple[int, int], None], None]
process_sample(sample_df: pandas.core.frame.DataFrame, build_target_data: bool = False) → Union[numpy.ndarray, Tuple[numpy.ndarray, numpy.ndarray]]
read_sample_df(train_file: str, train_size: int) → pandas.core.frame.DataFrame
save_compressed_model(save_path)
save_model(save_path: str) → None
train_and_save(settings: lexnlp.extract.ml.detector.detecting_settings.DetectingSettings, train_file: str, train_size: int = -1, save_path: str = '', compress: bool = False) → None

Create a percent identification model using tokens. :param settings: Model settings :param train_file: File to load training samples from :param train_size: Number of records to use :param save_path: Output (pickle model) file path :param compress: Save compressed file

train_and_save_on_tokens(tokens: List[str], save_path: str, settings: lexnlp.extract.ml.detector.detecting_settings.DetectingSettings, train_sample_df: pandas.core.frame.DataFrame, punc_set: str = '.,/-', symbol_set: Optional[str] = None, string_checks: bool = False, compress: bool = False)

lexnlp.extract.ml.detector.detecting_settings module

class lexnlp.extract.ml.detector.detecting_settings.DetectingSettings(use_spacy: bool = False, pre_window: int = 0, post_window: int = 0, model_type: str = 'random_forest')

Bases: object

lexnlp.extract.ml.detector.phrase_constructor module

class lexnlp.extract.ml.detector.phrase_constructor.PhraseConstructor

Bases: object

Join “empty”, “start”, “middle” and “end” tokens into phrases.

DEFAULT_CONSTRUCTOR_SETTINGS = by class, strict=False
DEFAULT_TOKEN_CLASSES = <lexnlp.extract.ml.detector.phrase_constructor.PhraseTokenClasses object>
static join_tokens(tokens, predicted_class, feature_mask: List[int] = None, settings: lexnlp.extract.ml.detector.phrase_constructor.PhraseConstructorSettings = None, token_classes: lexnlp.extract.ml.detector.phrase_constructor.PhraseTokenClasses = None) → Generator[[Tuple[int, int], None], None]
static join_tokens_by_class(tokens, predicted_class, strict: bool = False, token_classes: lexnlp.extract.ml.detector.phrase_constructor.PhraseTokenClasses = None) → Generator[[Tuple[int, int], None], None]

Run model on text

static join_tokens_by_score(tokens, predicted_class, feature_mask: List[int] = None, max_zeros: int = 2, min_token_score: int = 2, token_classes: lexnlp.extract.ml.detector.phrase_constructor.PhraseTokenClasses = None) → Generator[[Tuple[int, int], None], None]

Run model on text

class lexnlp.extract.ml.detector.phrase_constructor.PhraseConstructorMethod

Bases: enum.Enum

An enumeration.

by_class = 1
by_score = 2
class lexnlp.extract.ml.detector.phrase_constructor.PhraseConstructorSettings(method: lexnlp.extract.ml.detector.phrase_constructor.PhraseConstructorMethod = <PhraseConstructorMethod.by_class: 1>, strict: bool = False, max_zeros: int = 2, min_token_score: int = 2)

Bases: object

class lexnlp.extract.ml.detector.phrase_constructor.PhraseTokenClasses(outer_class: int = 0, start_class: int = 1, inner_class: int = 2, end_class: int = 3)

Bases: object

lexnlp.extract.ml.detector.sample_processor module

lexnlp.extract.ml.detector.sample_processor.get_target_start_end_from_corgetes(_: str, column_name_formatted: str, row) → List[Tuple[int, int]]
lexnlp.extract.ml.detector.sample_processor.get_target_start_end_from_text(text: str, column_name_formatted: str, row) → List[Tuple[int, int]]
lexnlp.extract.ml.detector.sample_processor.process_sample(sample_df: pandas.core.frame.DataFrame, s: lexnlp.extract.ml.classifier.base_token_sequence_classifier_model.BaseTokenSequenceClassifierModel, build_target_data: bool = True, pre_alloc_multiple: int = 30, column_name_formatted: str = 'quantity_formatted', outer_class: int = 0, start_class: int = 1, inner_class: int = 2, end_class: int = 3, get_target_start_end: Callable[[str, str, Any], List[Tuple[int, int]]] = <function get_target_start_end_from_text>, feature_mask_column: Optional[str] = None) → Union[numpy.ndarray, Tuple[numpy.ndarray, numpy.ndarray]]

Process a sample file to create feature and target data. :param sample_df: dataframe with at least ‘sentence’ column :param s: TokenSequenceClassifierModel or SpacyTokenSequenceClassifierModel :param build_target_data: build target data vector (if true) :param pre_alloc_multiple: :param column_name_formatted: “quantity_formatted” or “noun_phrase_formatted” … :param outer_class: :param start_class: :param inner_class: :param end_class: :return: (feature_data, target_data) if build_target_data = True or just feature_data

Module contents