TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

booksum

Description:

BookSum: A Collection of Datasets for Long-form Narrative Summarization

This implementation currently only supports book and chapter summaries.

GitHub: https://github.com/salesforce/booksum

Additional Documentation: Explore on Papers With Code
Homepage: https://github.com/salesforce/booksum
Source code: tfds.datasets.booksum.Builder
Versions:
- 1.0.0 (default): Initial release.
Download size: Unknown size
Manual download instructions: This dataset requires you to download the source data manually into download_config.manual_dir (defaults to ~/tensorflow_datasets/downloads/manual/):

1) Go to https://github.com/salesforce/booksum, and run steps 1-3. Place the whole booksum git project in the manual folder. 2) Download the chapterized books from https://storage.cloud.google.com/sfr-books-dataset-chapters-research/all_chapterized_books.zip and unzip to the manual folder.

The manual folder should contain the following directories:

- `booksum/`
- `all_chapterized_books/`

Auto-cached (documentation): Yes (test, validation), Only when shuffle_files=False (train)
Feature structure:

FeaturesDict({
    'document': Text(shape=(), dtype=string),
    'summary': Text(shape=(), dtype=string),
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
document	Text	string
summary	Text	string

Supervised keys (See as_supervised doc): ('document', 'summary')
Figure (tfds.show_examples): Not supported.
Citation:

@article{kryscinski2021booksum,
      title={BookSum: A Collection of Datasets for Long-form Narrative Summarization},
      author={Wojciech Kry{\'s}ci{\'n}ski and Nazneen Rajani and Divyansh Agarwal and Caiming Xiong and Dragomir Radev},
      year={2021},
      eprint={2105.08209},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

booksum/book (default config)

Config description: Book-level summarization
Dataset size: 208.81 MiB
Splits:

Split	Examples
`'test'`	46
`'train'`	312
`'validation'`	45

Examples (tfds.as_dataframe):

booksum/chapter

Config description: chapter-level summarization
Dataset size: 216.71 MiB
Splits:

Split	Examples
`'test'`	1,083
`'train'`	6,524
`'validation'`	891

Examples (tfds.as_dataframe):