src.preprocessing

Preprocessing package.

Modules

src.preprocessing.cluster_documents

Compute the membership vectors for each cluster.

src.preprocessing.cluster_explainer

Generate a summary of the clusters.

src.preprocessing.compute_layout

Computes the x and y coordinates for the nodes in the graph, based on a story that each row is in.

src.preprocessing.compute_topical_distributions

Generates topical distributions.

src.preprocessing.create_events

Cluster the documents based on time and event similarity.

src.preprocessing.extract_dates_regex

Extract the creation dates from the full text of documents.

src.preprocessing.extract_important_sentences

A TransformationBlock that extracts the most important sentences from the data.

src.preprocessing.filter_redundant_edges

Filter redundant edges from the data.

src.preprocessing.find_storylines

Find the storylines in the data.

src.preprocessing.generate_roberta_embedding

Generate the embeddings for the data using a RoBERTa model.

src.preprocessing.impute_dates

Impute missing dates by filling them with the most similar embedding.

src.preprocessing.linear_programming

Perform the linear programming on the clusters.

src.preprocessing.pdf_to_text

Extract text from PDFs.