reddit | TensorFlow Datasets

TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

Description:

This corpus contains preprocessed posts from the Reddit dataset. The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary.

Features includes strings: author, body, normalizedBody, content, summary, subreddit, subreddit_id. Content is used as document and summary is used as summary.

Additional Documentation: Explore on Papers With Code
Homepage: https://github.com/webis-de/webis-tldr-17-corpus
Source code: tfds.datasets.reddit.Builder
Versions:
- 1.0.0 (default): No release notes.
Download size: 2.93 GiB
Dataset size: 18.09 GiB
Auto-cached (documentation): No
Splits:

Split	Examples
`'train'`	3,848,330

Feature structure:

FeaturesDict({
    'author': string,
    'body': string,
    'content': string,
    'id': string,
    'normalizedBody': string,
    'subreddit': string,
    'subreddit_id': string,
    'summary': string,
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
author	Tensor	string
body	Tensor	string
content	Tensor	string
id	Tensor	string
normalizedBody	Tensor	string
subreddit	Tensor	string
subreddit_id	Tensor	string
summary	Tensor	string

Supervised keys (See as_supervised doc): ('content', 'summary')
Figure (tfds.show_examples): Not supported.
Examples (tfds.as_dataframe):

Citation:

@inproceedings{volske-etal-2017-tl,
    title = "{TL};{DR}: Mining {R}eddit to Learn Automatic Summarization",
    author = {V{\"o}lske, Michael  and
      Potthast, Martin  and
      Syed, Shahbaz  and
      Stein, Benno},
    booktitle = "Proceedings of the Workshop on New Frontiers in Summarization",
    month = sep,
    year = "2017",
    address = "Copenhagen, Denmark",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W17-4508",
    doi = "10.18653/v1/W17-4508",
    pages = "59--63",
    abstract = "Recent advances in automatic text summarization have used deep neural networks to generate high-quality abstractive summaries, but the performance of these models strongly depends on large amounts of suitable training data. We propose a new method for mining social media for author-provided summaries, taking advantage of the common practice of appending a {``}TL;DR{''} to long posts. A case study using a large Reddit crawl yields the Webis-TLDR-17 dataset, complementing existing corpora primarily from the news genre. Our technique is likely applicable to other social media sites and general web crawls.",
}