- Description:
This large-scale media interview dataset contains 463.6K transcripts with abstractive summaries, collected from interview transcripts and overview / topic descriptions from NPR and CNN.
Please restrict your usage of this dataset to research purpose only.
And please cite our paper: MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization
Ethics
We have used only the publicly available transcripts data from the media sources and adhere to their only-for-research-purpose guideline.
As media and guests may have biased views, the transcripts and summaries will likely contain them. The content of the transcripts and summaries only reflect the views of the media and guests, and should be viewed with discretion.
Homepage: https://github.com/zcgzcgzcg1/MediaSum
Source code:
tfds.datasets.media_sum.Builder
Versions:
1.0.0
(default): Initial release.
Download size:
Unknown size
Dataset size:
4.11 GiB
Manual download instructions: This dataset requires you to download the source data manually into
download_config.manual_dir
(defaults to~/tensorflow_datasets/downloads/manual/
):
manual_dir should contain the files:- news_dialogue.json
- train_val_test_split.json
The files can be downloaded and extracted from the dataset's GitHub page: https://github.com/zcgzcgzcg1/MediaSum/tree/main/data
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
10,000 |
'train' |
443,596 |
'val' |
10,000 |
- Feature structure:
FeaturesDict({
'date': Text(shape=(), dtype=string),
'id': Text(shape=(), dtype=string),
'program': Text(shape=(), dtype=string),
'speaker': Sequence(Text(shape=(), dtype=string)),
'summary': Text(shape=(), dtype=string),
'url': Text(shape=(), dtype=string),
'utt': Sequence(Text(shape=(), dtype=string)),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
date | Text | string | ||
id | Text | string | ||
program | Text | string | ||
speaker | Sequence(Text) | (None,) | string | |
summary | Text | string | ||
url | Text | string | ||
utt | Sequence(Text) | (None,) | string |
Supervised keys (See
as_supervised
doc):('utt', 'summary')
Figure (tfds.show_examples): Not supported.
Examples (tfds.as_dataframe):
- Citation:
@article{zhu2021mediasum,
title={MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization},
author={Zhu, Chenguang and Liu, Yang and Mei, Jie and Zeng, Michael},
journal={arXiv preprint arXiv:2103.06410},
year={2021}
}