src.preprocessing.compute_topical_distributions

Generates topical distributions.

Classes

TopicalDistribution(...)

Initialize the topical distribution pipeline block.

class src.preprocessing.compute_topical_distributions.TopicalDistribution(pretrained_model_name_or_path, dictionary_name_or_path)[source]

Initialize the topical distribution pipeline block.

Parameters:
  • pretrained_model_name_or_path (str) – LDA model name or path.

  • _pretrained_lda – LDA model instance.

  • _lemmatizer – WordNetLemmatizer instance.

  • _dict – Dictionary of the corpus.

pretrained_model_name_or_path: str
dictionary_name_or_path: str
custom_transform(data, **transform_args)[source]

Ensure the input Dataframe has the relevant columns.

Then computes the topical distributions for each document.

Parameters:
  • data (DataFrame) – The input dataframe.

  • transform_args (Never) – [UNUSED] Additional keyword arguments.

Return type:

DataFrame

Returns:

The transformed data.

preprocess_documents(docs)[source]

Preprocess a list of documents.

Tokenize, remove stopwords, and lemmatize the documents.

Parameters:

docs (Iterable[str]) – The list of document texts.

Return type:

list[list[str]]

Returns:

The preprocessed documents.

get_topic_dist(doc_bow)[source]

Compute the topical distribution for a given document.

Parameters:

doc_bow (list[tuple[int, int]]) – BoW representation of the document.

Return type:

ndarray[Any, dtype[float64]]

Returns:

The topical distribution.