lexnlp.utils package¶
Subpackages¶
- lexnlp.utils.lines_processing package
- lexnlp.utils.tests package
- Submodules
- lexnlp.utils.tests.test_line_processor module
- lexnlp.utils.tests.test_map module
- lexnlp.utils.tests.test_parse_df module
- lexnlp.utils.tests.test_parsed_text_corrector module
- lexnlp.utils.tests.test_parsed_text_quality_estimator module
- lexnlp.utils.tests.test_phrase_finder module
- Module contents
- lexnlp.utils.unicode package
Submodules¶
lexnlp.utils.decorators module¶
-
lexnlp.utils.decorators.
safe_failure
(func)¶ return None on failure, either skip result if generator
lexnlp.utils.iterating_helpers module¶
-
lexnlp.utils.iterating_helpers.
collapse_sequence
(sequence: collections.abc.Iterable, predicate: Callable[[Any, Any], Any], accumulator: Any = 0.0) → Any¶
-
lexnlp.utils.iterating_helpers.
count_sequence_matches
(sequence: collections.abc.Iterable, predicate: Callable[Any, bool]) → int¶
lexnlp.utils.map module¶
lexnlp.utils.parse_df module¶
-
class
lexnlp.utils.parse_df.
DataframeEntityParser
(dataframe, parse_columns, result_columns=None, preformed_entity=None, priority_sort_column=None, priority_sort_ascending=True, cell_values_separator=';', unique_column_values=True, line_processor: lexnlp.utils.lines_processing.line_processor.LineProcessor = None)¶ Bases:
object
Class that provides ability to extract entities from a text having some collection of entities formed as dataframe. By default it means that dataframe has UNIQUE values in those columns you use for search. Returns dict of start/end positions of found item in a text and other user-defined key-value pairs
- Params:
- dataframe: pandas.DataFrame with entities collection
- parse_columns: list or tuple - these columns will be used to search their values in a text
- result_columns: dict - map like {‘dataframe column name to take a value corresponding with extracted entity’: ‘new_column_name’}
- preformed_entity: dict - initial, static key-value pairs to use for each extracted entity
- priority_sort_column: str - column name to sort by and get first match if multiple results found, otherwise the first matched row will be used
- priority_sort_ascending: bool - sort order for priority_sort_column
- cell_values_separator: str or None - multiple values in datafame cell separated by that separator
- unique_column_values: bool - dataframe columns have unique values
- E.g.:
>>> parse_columns = ('Kurztitel', 'Titel', 'Abkürzung') >>> result_columns = {'Titel': 'name'} >>> preformed_entity = {'entity_type': 'Laws and Rules', >>> 'source': 'BaFin', >>> 'country': 'Germany'} >>> sort_column = 'Titel' >>> items = DataframeEntityParser( >>> df, parse_columns, result_columns, preformed_entity, sort_column).parse(text)
-
SEARCH_PTN
= '(?:^|\\W)({})(?:\\W|$)'¶
-
get_collection_ptn
(collection)¶ Convert list of values to regex pattern :param collection: list of entities to search in :return: compilled regex pattern
-
get_entities
(text: str)¶
-
get_entities_from_text
(text: str) → Generator[[dict, None], None]¶
-
get_entity_list
(text)¶
-
get_formed_entity
(match, col_name)¶ Get formed entity from matched row in dataframe :param match: re.match object :param col_name: df column name :return: dict
-
get_single_result
(rows)¶ By default we mean that all values we filter by in dataframe are UNIQUE, so just take 1st Implement your own logic to choose from multiple matched dataframe rows
-
lexnlp.utils.parse_df.
get_entities
(text: str, config: pandas.core.frame.DataFrame, parse_columns: Union[List[str], Tuple[str]], result_columns: Optional[dict] = None, preformed_entity: Optional[dict] = None, priority_sort_column: Optional[str] = None, priority_sort_ascending: bool = True, cell_values_separator: Optional[str] = ';', unique_column_values: bool = True) → Generator¶ Simple wrapper around DataframeEntityParser
-
lexnlp.utils.parse_df.
get_entity_list
(text: str, config: pandas.core.frame.DataFrame, parse_columns: Union[List[str], Tuple[str]], result_columns: Optional[dict] = None, preformed_entity: Optional[dict] = None, priority_sort_column: Optional[str] = None, priority_sort_ascending: bool = True, cell_values_separator: Optional[str] = ';', unique_column_values: bool = True) → List¶ Simple wrapper around DataframeEntityParser