find_dict_entities

lexnlp.extract.en.dict_entities.find_dict_entities(text: str, all_possible_entities: typing.List[typing.Tuple[int, str, int, typing.List[typing.Tuple]]], text_languages: typing.Union[typing.List[str], typing.Tuple[str], typing.Set[str]] = None, conflict_resolving_func: typing.Callable[[typing.List[typing.Tuple[int, str, typing.List[typing.Tuple]]]], typing.Tuple[typing.List[typing.Tuple[int, str, typing.List[typing.Tuple]]], typing.Tuple]] = None, use_stemmer: bool = False, remove_time_am_pm: bool = True, min_alias_len: int = None, prepared_alias_black_list: typing.Union[NoneType, typing.Dict[str, typing.Tuple[typing.List[str], typing.List[str]]]] = None) → typing.Generator

Find all entities defined in the ‘all_possible_entities’ list appeared in the source text. This method takes care of leaving only the longest matching search result for the case of multiple entities having aliases - one being a substring of another. This method takes care of the language of the text and aliases - if language is specified both for the text and for the alias - then this alias is used only if they are the same. This method may detect multiple possibly matching entities at a position in the text - because there can be entites having the same aliases in the same language. To resolve such conflicts a special resolving function can be specified. This method takes care of time AM/PM components which possibly can appear in the aliases of some entities - it tries to detect minutes/seconds/milliseconds before AM/PM and ignore them in such cases.

Algorithm of this method: 1. Normalize the source text (we need lowercase and non-lowercase versions for abbrev searches). 2. Create a shared search context - a map of position -> (alias text + list of matching entities) 3. For each possible entity do search using the shared context:

3.1. For each alias of the entity:
3.1.1. Iteratively search for all occurrences of the alias taking into account its language, abbrev status.
For each found occurrence of the alias - check if there is already found another alias and entity at this position and leave only the one having the longest alias (“Something” vs “Something Bigger”) If there is already a found different entity on this position having totally equal alias with the same language - then store them both for this position in the text.

4. Now we have a map filled with: position -> (alias text + list of entities having this alias). After sorting the items of this dict by position we will be able to get rid of overlaping of longer and shorter aliases being one a substirng of another (“Bankr. E.D.N.Y.” vs “E.D.N.Y.”). 5. For each next position check if it overlaps with the next one [position; position + len(alias)]. If overlaps - then leave the longest alias and drop the shorter.

Main complexity of this algorithm is caused by the requirement to detect the longest match for each piece of text while the longer match can start at the earlier position then the shorter match and there can be multiple aliases of different entities matching the same piece of text.

Another algorithm for this function can be based on the idea that or-kind regexp returns the longest matching group. We could form regexps containing the possible aliases and apply them to the source text: r’alias1|alias2|longer alias2|…’

TODO Compare to other algorithms for time and memory complexity

Parameters:
  • text
  • all_possible_entities – list of dict or list of DictEntity - all possible entities to search for
  • min_alias_len – Minimal length of alias/name to search for. Can be used to ignore too short aliases like “M.”

while searching. :param prepared_alias_black_list: List of aliases to remove from searching. Can be used to ignore concrete aliases. Prepared black list of aliases to exclude from search. Should be: dict of language -> tuple (list of normalized non-abbreviations, list of normalized abbreviations) :param text_languages: If set - then only aliases of these languages will be searched for. :param conflict_resolving_func: A function for resolving conflicts when there are multiple entities detected at the same position in the source text and their detected aliases are of the same length. The function takes a list of conflicting entities and should return a list of one or more entities which should be returned. :param use_stemmer: Use stemmer instead of tokenizer. Stemmer converts words to their simple form (singular number, e.t.c.). Stemmer works better for searching for “tables”, “developers”, … Tokenizer fits for “United States”, “Mississippi”, … :param remove_time_am_pm: Remove from final results AM/PM abbreviations which look like end part of time strings - 11:45 am, 10:00 pm. :return: