lexnlp.extract.en.dates
: Extracting date references¶
The lexnlp.extract.en.dates
module contains methods that allow for the extraction
of dates from text. Sample formats that are handled by this module include:
- February 1, 1998
- 2017-06-01
- 1st day of June, 2017
- 31 October 2016
- 15th of March 2000
The full list of current unit test cases can be found here: https://github.com/LexPredict/lexpredict-lexnlp/tree/master/test_data/lexnlp/extract/en/tests/test_dates
Extracting dates¶
-
lexnlp.extract.en.dates.
get_dates
(text, strict=False, base_date=None, return_source=False, threshold=0.5) → typing.Generator¶ Find dates after cleaning false positives. :param text: raw text to search :param strict: whether to return only complete or strict matches :param base_date: base date to use for implied or partial matches :param return_source: whether to return raw text around date :param threshold: probability threshold to use for false positive classifier :return:
Example
>>> import lexnlp.extract.en.dates
>>> text = "This agreement shall terminate on the 15th day of March, 2020."
>>> print(list(lexnlp.extract.en.dates.get_dates(text)))
[datetime.date(2020, 3, 15)]
>>> text = "This agreement shall terminate on the 2nd of Apr 2030."
>>> print(list(lexnlp.extract.en.dates.get_dates(text)))
[datetime.date(2030, 4, 1)]
Note
This method combines both pattern-matching approaches as well as machine learning and NLP to remove false positive matches. If speed is more important than precision, then users should examine the get_raw_dates method below or train their own model using a smaller feature space or faster machine learning model type. For more details, see the Advanced Usage section below.
Advanced usage and customization¶
Out of the box, LexNLP uses a cross-validated logistic classifier whose inputs are the one-character and two-character sequence distributions within a 5-character window of the potential date match. The training and assessment data used can be found in train_default_model and unit tests.
-
lexnlp.extract.en.dates.
get_raw_date_list
(text, strict=False, base_date=None, return_source=False) → typing.List¶
-
lexnlp.extract.en.dates.
get_raw_dates
(text, strict=False, base_date=None, return_source=False) → typing.Generator¶ Find “raw” or potential date matches prior to false positive classification. :param text: raw text to search :param strict: whether to return only complete or strict matches :param base_date: base date to use for implied or partial matches :param return_source: whether to return raw text around date :return:
-
lexnlp.extract.en.dates.
get_date_features
(text, start_index, end_index, include_bigrams=True, window=5, characters=None, norm=True)¶ Get features to use for classification of date as false positive. :param text: raw text around potential date :param start_index: date start index :param end_index: date end index :param include_bigrams: whether to include bigram/bicharacter features :param window: window around match :param characters: characters to use for feature generation, e.g., digits only, alpha only :param norm: whether to norm, i.e., transform to proportion :return:
-
lexnlp.extract.en.dates.
build_date_model
(input_examples, output_file, verbose=True)¶ Build a sklearn model for classifying date strings as potential false positives. :param input_examples: :param output_file: :param verbose: :return:
-
lexnlp.extract.en.dates.
train_default_model
(save=True)¶ Train default model. :return: