src.preprocessing.extract_important_sentences

A TransformationBlock that extracts the most important sentences from the data.

Classes

ExtractImportantSentences()

A TransformationBlock that extracts the most important sentences from the data.

class src.preprocessing.extract_important_sentences.ExtractImportantSentences[source]

A TransformationBlock that extracts the most important sentences from the data.

Expects a dataframe with a full_text column, and gives back the most important sentences in a summary column.

custom_transform(data, **transform_args)[source]

Extract the most important sentences from the data.

Parameters:
  • data (DataFrame) – A pandas dataframe with a full_text column.

  • transform_args (Never) – [UNUSED] Additional keyword arguments.

Return type:

DataFrame

Returns:

A dataframe with the most important sentences in a summary column.

merge_whitespace(data)[source]

Merge the whitespace in the data.

Parameters:

data (DataFrame) – a dataframe with a full_text column.

Return type:

DataFrame

Returns:

the dataframe with merged newlines and spaces.

tokenize_sentences(data)[source]

Tokenize the sentences in the data.

Parameters:

data (DataFrame) – a dataframe with a filtered_text column.

Return type:

DataFrame

Returns:

a dataframe where the filtered_text column is a list of sentences.

adjust_summary_size(num_sentences)[source]

Adjust dynamically the number of sentences for the summary.

Parameters:

num_sentences (int) – number of sentences to summarize.

Return type:

int

Returns:

the number of sentences for the LexRank summary.

extract_important_sentences(data)[source]

Extract the important sentences from the data.

Parameters:

data (DataFrame) – a dataframe with a filtered_text column, which is a list of sentences.

Return type:

DataFrame

Returns:

a dataframe with the most important sentences in the summary column.