Preprocess a Woogle Dump

A Woogle dump consists of the following files:

  • woo_dossiers.csv, which contains the metadata of the dossiers.

  • woo_documents.csv, which contains the metadata of the documents.

  • woo_bodytext.csv, which contains the body text of the documents.

To preprocess a Woogle dump, place these files in data/extracted and run split_woogle_dump_per_dossier.py. This creates individual .pkl files for each dossier with at least 20 documents in data/raw. This script is not very efficient, so it may take a few hours to run.

The paths and the minimum number of documents per dossier can be configured in conf/split_woogle_dump_per_dossier.yaml.