src.preprocessing.pdf_to_text¶

Extract text from PDFs.

Classes

Extract the text from a list of PDF files.

class src.preprocessing.pdf_to_text.PdfToText(files)[source]¶

Extract the text from a list of PDF files.

Parameters:: files (list[Path]) – The list of PDF files to extract the text from.

custom_transform(data, **transform_args)[source]¶

Extract text from a list of PDF files.

Parameters:

Return type:

DataFrame

Returns:

a DataFrame with the extracted text.

extract_text_for_doc(filepath)[source]¶

Extract the text from a PDF file.