src.preprocessing.extract_dates_regex

Extract the creation dates from the full text of documents.

Classes

ExtractDatesRegex(min_date, max_date)

Extract the creation date of a body of text with a regex.

class src.preprocessing.extract_dates_regex.ExtractDatesRegex(min_date, max_date)[source]

Extract the creation date of a body of text with a regex.

The regex tries to match all dates in the format of day-month-year and day {monthname} year. The month name can be in Dutch or English.

min_date: The minimum date to consider in format %d-%m-%Y. max_date: The maximum date to consider in format %d-%m-%Y.

min_date: InitVar
max_date: InitVar
custom_transform(data, **transform_args)[source]

Generate extracted dates from full body text.

Parameters:
  • data (DataFrame) – The data to transform.

  • transform_args (Never) – [UNUSED] Additional keyword arguments.

Return type:

DataFrame

Returns:

The transformed data.

extract_date_regex(full_text)[source]

Extract the date or creation from a full text with a regular expression.

Parameters:

full_text (str | None) – The full text to extract the date from.

Return type:

Timestamp | NaTType

Returns:

The extracted date or NaT if no valid date was found.