src.preprocessing.pdf_to_text

Extract text from PDFs.

Classes

PdfToText(files)

Extract the text from a list of PDF files.

class src.preprocessing.pdf_to_text.PdfToText(files)[source]

Extract the text from a list of PDF files.

Parameters:

files (list[Path]) – The list of PDF files to extract the text from.

files: list[Path]
custom_transform(data, **transform_args)[source]

Extract text from a list of PDF files.

Parameters:
  • data (DataFrame) – The data to transform.

  • transform_args (Never) – [UNUSED] Additional keyword arguments.

Return type:

DataFrame

Returns:

a DataFrame with the extracted text.

extract_text_for_doc(filepath)[source]

Extract the text from a PDF file.

Parameters:

filepath (Path) – The path to the PDF file.

Return type:

dict[str, Any]

Returns:

The extracted text.