reddit_tifu

  • Description:

Reddit dataset, where TIFU denotes the name of subbreddit /r/tifu. As defined in the publication, style "short" uses title as summary and "long" uses tldr as summary.

Features includes:

  • document: post text without tldr.
  • tldr: tldr line.
  • title: trimmed title without tldr.
  • ups: upvotes.
  • score: score.
  • num_comments: number of comments.
  • upvote_ratio: upvote ratio.

  • Additional Documentation: Explore on Papers With Code

  • Homepage: https://github.com/ctr4si/MMN

  • Source code: tfds.datasets.reddit_tifu.Builder

  • Versions:

    • 1.1.0: Remove empty document and summary strings.
    • 1.1.1: Add train, dev and test (80/10/10) splits which are used in PEGASUS (https://arxiv.org/abs/1912.08777) in a separate config. These were created randomly using the tfds split function and are being released to ensure that results on Reddit Tifu Long are reproducible and comparable.Also add id to the datapoints.
    • 1.1.2 (default): Corrected splits uploaded.
  • Feature structure:

FeaturesDict({
    'documents': Text(shape=(), dtype=string),
    'id': Text(shape=(), dtype=string),
    'num_comments': float32,
    'score': float32,
    'title': Text(shape=(), dtype=string),
    'tldr': Text(shape=(), dtype=string),
    'ups': float32,
    'upvote_ratio': float32,
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
documents Text string
id Text string
num_comments Tensor float32
score Tensor float32
title Text string
tldr Text string
ups Tensor float32
upvote_ratio Tensor float32
@misc{kim2018abstractive,
    title={Abstractive Summarization of Reddit Posts with Multi-level Memory Networks},
    author={Byeongchang Kim and Hyunwoo Kim and Gunhee Kim},
    year={2018},
    eprint={1811.00783},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

reddit_tifu/short (default config)

  • Config description: Using title as summary.

  • Download size: 639.54 MiB

  • Dataset size: 141.46 MiB

  • Auto-cached (documentation): Only when shuffle_files=False (train)

  • Splits:

Split Examples
'train' 79,740

reddit_tifu/long

  • Config description: Using TLDR as summary.

  • Download size: 639.54 MiB

  • Dataset size: 93.10 MiB

  • Auto-cached (documentation): Yes

  • Splits:

Split Examples
'train' 42,139

reddit_tifu/long_split

  • Config description: Using TLDR as summary and return train/test/dev splits.

  • Download size: 639.94 MiB

  • Dataset size: 93.10 MiB

  • Auto-cached (documentation): Yes

  • Splits:

Split Examples
'test' 4,214
'train' 33,711
'validation' 4,214