lexnlp.extract.en.distances
: Extracting distances¶
The lexnlp.extract.en.distances
module contains methods that allow for the extraction
of distance references from text. Distances that are covered by default in this module include:
km
kilometer
mile
miles
mi
The full list of current unit test cases can be found here: https://github.com/LexPredict/lexpredict-lexnlp/tree/master/test_data/lexnlp/extract/en/tests/test_distances
Extracting conditions¶
Example
>>> import lexnlp.extract.en.distances
>>> text = "Within 50 miles of office."
>>> print(list(lexnlp.extract.en.distances.get_distances(text)))
[(50.0, 'mile')]
Customizing distance extraction¶
Distance extraction can be customized. There are three key module variables that store the default configuration and one function used to create a matching instance:
DISTANCE_TOKEN_MAP: This Dictionary stores the map from tokens to standard distance types. See customization example below.
DISTANCE_SYMBOL_MAP: This Dictionary stores the map from abbreviations to standard distance types. See customization example below.
DISTANCE_PTN: This String defines the regular expression pattern used to match distances.
The default behavior of this module can be customized by overriding the value of DISTANCE_PTN_RE with a new regular expression. The example below demonstrates a simple addition of a new distance:
>>> # Out of the box behavior
>>> import lexnlp.extract.en.conditions
>>> text = "This improvement shall extend for no more than fifteen yards."
>>> print(list(lexnlp.extract.en.distances.get_distances(text)))
[]
>>> # Customize the regular expression pattern
>>> import regex as re
>>> import lexnlp.extract.en.amounts
>>> lexnlp.extract.en.distances.DISTANCE_TOKEN_MAP["yard"] = "yard"
>>> lexnlp.extract.en.distances.DISTANCE_TOKEN_MAP["yards"] = "yard"
>>> lexnlp.extract.en.distances.DISTANCE_SYMBOL_MAP["yd"] = "yard"
>>> lexnlp.extract.en.distances.DISTANCE_PTN = r"""
(({num_ptn})\s*
({distance_tokens}|{distance_symbols}))(?:\W|$)
""".format(
num_ptn=lexnlp.extract.en.amounts.NUM_PTN.replace('(?:\\W|$)', '').replace('(?<=\\W|^)', ''),
distance_symbols='|'.join(lexnlp.extract.en.distances.DISTANCE_SYMBOL_MAP),
distance_tokens='|'.join(lexnlp.extract.en.distances.DISTANCE_TOKEN_MAP)
)
>>> lexnlp.extract.en.distances.DISTANCE_PTN_RE = re.compile(lexnlp.extract.en.distances.DISTANCE_PTN,
re.IGNORECASE | re.MULTILINE | re.DOTALL | re.VERBOSE)
>>> # Run the method again to test
>>> print(list(lexnlp.extract.en.distances.get_distances(text)))
[(15, 'yard')]