Preprocess a Woogle Dump¶
A Woogle dump consists of the following files:
woo_dossiers.csv
, which contains the metadata of the dossiers.woo_documents.csv
, which contains the metadata of the documents.woo_bodytext.csv
, which contains the body text of the documents.
To preprocess a Woogle dump, place these files in data/extracted
and run split_woogle_dump_per_dossier.py
.
This creates individual .pkl
files for each dossier with at least 20 documents in data/raw
.
This script is not very efficient, so it may take a few hours to run.
The paths and the minimum number of documents per dossier can be configured in conf/split_woogle_dump_per_dossier.yaml
.