lexnlp.extract.en.geoentities: Extracting geographic and geopolitical entities

The lexnlp.extract.en.geoentities module contains methods that allow for the extraction of geopolitical or geographic references from text.

Attention

The methods in this module rely heavily on data from the LexPredict Legal Dictionary repository: https://github.com/LexPredict/lexpredict-legal-dictionary

This data is governed by a separate Creative Commons Attribution Share Alike 4.0 license here: https://github.com/LexPredict/lexpredict-legal-dictionary/blob/master/LICENSE

The full list of current unit test cases can be found here: https://github.com/LexPredict/lexpredict-lexnlp/tree/master/test_data/lexnlp/extract/en/tests/test_geoentities

Extracting courts

lexnlp.extract.en.geoentities.get_geoentities()

Searches for geo entities from the provided config list and yields pairs of (entity, alias). Entity is: (entity_id, name, [list of aliases]) Alias is: (alias_text, lang, is_abbrev, alias_id)

This method uses general searching routines for dictionary entities from dict_entities.py module. Methods of dict_entities module can be used for comfortable creating the config: entity_config(), entity_alias(), add_aliases_to_entity(). :param text: :param geo_config_list: List of all possible known geo entities in the form of tuples (id, name, [(alias, lang, is_abbrev, alias_id), …]). :param priority: If two entities found with the totally equal matching aliases - then use the one with the greatest priority field. :param priority_by_id: If two entities found with the totally equal matching aliases - then use the one with the lowest id. :param text_languages: Language(s) of the source text. If a language is specified then only aliases of this language will be searched for. For example: this allows ignoring “Island” - a German language

alias of Iceland for English texts.
Parameters:
  • min_alias_len – Minimal length of geo entity aliases to search for.
  • prepared_alias_black_list – List of aliases to exclude from searching in the form: dict of lang -> (list of normalized non-abbreviation aliases, list of normalized abbreviation aliases). Use dict_entities.prepare_alias_blacklist_dict() for preparing this dict.
Returns:

Generates tuples: (entity, alias)

Note

For examples of loading and using entities from the LexPredict Legal Dictionary repository, please refer to this source code examples here: https://github.com/LexPredict/lexpredict-lexnlp/blob/master/lexnlp/extract/en/tests/test_geoentities.py