To be able to reproduce the authors' results on the CNN/Daily Mail dataset you first need to download both CNN and Daily Mail datasets from Kyunghyun Cho's website (the links next to "Stories") in the same folder. Then uncompress the archives by running:
wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/cnn_dm.tgz
tar -xzvf cnn_dm.tgz
this should make a directory called cnn_dm/ with files like test.source
.
To use your own data, copy that files format. Each article to be summarized is on its own line.
To create summaries for each article in dataset, run:
python evaluate_cnn.py <path_to_test.source> cnn_test_summaries.txt
the default batch size, 8, fits in 16GB GPU memory, but may need to be adjusted to fit your system.
Run/modify run_train.sh
The core model is in src/transformers/modeling_bart.py
. This directory only contains examples.
ptb_tokenize () {
cat $1 | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > $2
}
sudo apt install openjdk-8-jre-headless
sudo apt-get install ant
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
unzip stanford-corenlp-full-2018-10-05.zip
cd stanford-corenlp-full-2018-10-05
export CLASSPATH=stanford-corenlp-3.9.2.jar:stanford-corenlp-3.9.2-models.jar
Then run ptb_tokenize
on test.target
and your generated hypotheses.
Install files2rouge
following the instructions at here.
I also needed to run sudo apt-get install libxml-parser-perl
from files2rouge import files2rouge
from files2rouge import settings
files2rouge.run(<path_to_tokenized_hypo>,
<path_to_tokenized_target>,
saveto='rouge_output.txt')