normalize_text¶

lexnlp.extract.en.dict_entities.normalize_text(text: str, spaces_on_start_end: bool = True, spaces_after_dots: bool = True, lowercase: bool = True, use_stemmer: bool = False) → str¶: Normalizes text for substring search operations - extracts tokens, joins them back with spaces, adds missing spaces after dots for abbreviations, e.t.c. Overall aim of this method is to weaken substring matching conditions by normalizing both the text and the substring being searched by the same way removing obsolete differences between them (case, punctuation, …). :param text: :param spaces_on_start_end: :param spaces_after_dots: :param lowercase: :param use_stemmer: Use stemmer instead of tokenizer. When using stemmer all words will be converted to singular number (or to some the most plain form) before matching. When using tokenizer - the words are compared as is. Using tokenizer should be enough for searches for entities which exist in a single number in the real world - geo entities, courts, …. Stemmer is required for searching for some common objects - table, pen, developer, … :return: