- Description:
BookSum: A Collection of Datasets for Long-form Narrative Summarization
This implementation currently only supports book and chapter summaries.
GitHub: https://github.com/salesforce/booksum
Additional Documentation: Explore on Papers With Code
Homepage: https://github.com/salesforce/booksum
Source code:
tfds.datasets.booksum.Builder
Versions:
1.0.0
(default): Initial release.
Download size:
Unknown size
Manual download instructions: This dataset requires you to download the source data manually into
download_config.manual_dir
(defaults to~/tensorflow_datasets/downloads/manual/
):1) Go to https://github.com/salesforce/booksum, and run steps 1-3. Place the whole
booksum
git project in the manual folder. 2) Download the chapterized books from https://storage.cloud.google.com/sfr-books-dataset-chapters-research/all_chapterized_books.zip and unzip to the manual folder.
The manual folder should contain the following directories:
- `booksum/`
- `all_chapterized_books/`
Auto-cached (documentation): Yes (test, validation), Only when
shuffle_files=False
(train)Feature structure:
FeaturesDict({
'document': Text(shape=(), dtype=string),
'summary': Text(shape=(), dtype=string),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
document | Text | string | ||
summary | Text | string |
Supervised keys (See
as_supervised
doc):('document', 'summary')
Figure (tfds.show_examples): Not supported.
Citation:
@article{kryscinski2021booksum,
title={BookSum: A Collection of Datasets for Long-form Narrative Summarization},
author={Wojciech Kry{\'s}ci{\'n}ski and Nazneen Rajani and Divyansh Agarwal and Caiming Xiong and Dragomir Radev},
year={2021},
eprint={2105.08209},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
booksum/book (default config)
Config description: Book-level summarization
Dataset size:
208.81 MiB
Splits:
Split | Examples |
---|---|
'test' |
46 |
'train' |
312 |
'validation' |
45 |
- Examples (tfds.as_dataframe):
booksum/chapter
Config description: chapter-level summarization
Dataset size:
216.71 MiB
Splits:
Split | Examples |
---|---|
'test' |
1,083 |
'train' |
6,524 |
'validation' |
891 |
- Examples (tfds.as_dataframe):