wikihow

  • Description:

WikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base.

There are two features:

  • text: wikihow answers texts.
  • headline: bold lines as summary.

There are two separate versions:

  • all: consisting of the concatenation of all paragraphs as the articles and the bold lines as the reference summaries.
  • sep: consisting of each paragraph and its summary.

Download "wikihowAll.csv" and "wikihowSep.csv" from https://github.com/mahnazkoupaee/WikiHow-Dataset and place them in manual folder https://www.tensorflow.org/datasets/api_docs/python/tfds/download/DownloadConfig Train/validation/test splits are provided by the authors. Preprocessing is applied to remove short articles (abstract length < 0.75 article length) and clean up extra commas.

@misc{koupaee2018wikihow,
    title={WikiHow: A Large Scale Text Summarization Dataset},
    author={Mahnaz Koupaee and William Yang Wang},
    year={2018},
    eprint={1810.09305},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

wikihow/all (default config)

  • Config description: Use the concatenation of all paragraphs as the articles and the bold lines as the reference summaries

  • Splits:

Split Examples
'test' 5,577
'train' 157,252
'validation' 5,599
  • Features:
FeaturesDict({
    'headline': Text(shape=(), dtype=tf.string),
    'text': Text(shape=(), dtype=tf.string),
    'title': Text(shape=(), dtype=tf.string),
})

wikihow/sep

  • Config description: use each paragraph and its summary.

  • Splits:

Split Examples
'test' 37,800
'train' 1,060,732
'validation' 37,932
  • Features:
FeaturesDict({
    'headline': Text(shape=(), dtype=tf.string),
    'overview': Text(shape=(), dtype=tf.string),
    'sectionLabel': Text(shape=(), dtype=tf.string),
    'text': Text(shape=(), dtype=tf.string),
    'title': Text(shape=(), dtype=tf.string),
})