TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

media_sum

Description:

This large-scale media interview dataset contains 463.6K transcripts with abstractive summaries, collected from interview transcripts and overview / topic descriptions from NPR and CNN.

Please restrict your usage of this dataset to research purpose only.

And please cite our paper: MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization

Ethics

We have used only the publicly available transcripts data from the media sources and adhere to their only-for-research-purpose guideline.

As media and guests may have biased views, the transcripts and summaries will likely contain them. The content of the transcripts and summaries only reflect the views of the media and guests, and should be viewed with discretion.

Homepage: https://github.com/zcgzcgzcg1/MediaSum
Source code: tfds.datasets.media_sum.Builder
Versions:
- 1.0.0 (default): Initial release.
Download size: Unknown size
Dataset size: 4.11 GiB
Manual download instructions: This dataset requires you to download the source data manually into download_config.manual_dir (defaults to ~/tensorflow_datasets/downloads/manual/):
manual_dir should contain the files:
- news_dialogue.json
- train_val_test_split.json

The files can be downloaded and extracted from the dataset's GitHub page: https://github.com/zcgzcgzcg1/MediaSum/tree/main/data

Auto-cached (documentation): No
Splits:

Split	Examples
`'test'`	10,000
`'train'`	443,596
`'val'`	10,000

Feature structure:

FeaturesDict({
    'date': Text(shape=(), dtype=string),
    'id': Text(shape=(), dtype=string),
    'program': Text(shape=(), dtype=string),
    'speaker': Sequence(Text(shape=(), dtype=string)),
    'summary': Text(shape=(), dtype=string),
    'url': Text(shape=(), dtype=string),
    'utt': Sequence(Text(shape=(), dtype=string)),
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
date	Text		string
id	Text		string
program	Text		string
speaker	Sequence(Text)	(None,)	string
summary	Text		string
url	Text		string
utt	Sequence(Text)	(None,)	string

Supervised keys (See as_supervised doc): ('utt', 'summary')
Figure (tfds.show_examples): Not supported.
Examples (tfds.as_dataframe):

Citation:

@article{zhu2021mediasum,
  title={MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization},
  author={Zhu, Chenguang and Liu, Yang and Mei, Jie and Zeng, Michael},
  journal={arXiv preprint arXiv:2103.06410},
  year={2021}
}

media_sum Stay organized with collections Save and categorize content based on your preferences.

Ethics

media_sum