wikihow

  • Description:

WikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base.

There are two features: - text: wikihow answers texts. - headline: bold lines as summary.

There are two separate versions: - all: consisting of the concatenation of all paragraphs as the articles and the bold lines as the reference summaries. - sep: consisting of each paragraph and its summary.

Download "wikihowAll.csv" and "wikihowSep.csv" from https://github.com/mahnazkoupaee/WikiHow-Dataset and place them in manual folder https://www.tensorflow.org/datasets/api_docs/python/tfds/download/DownloadConfig Train/validation/test splits are provided by the authors. Preprocessing is applied to remove short articles (abstract length < 0.75 article length) and clean up extra commas.

@misc{koupaee2018wikihow,
    title={WikiHow: A Large Scale Text Summarization Dataset},
    author={Mahnaz Koupaee and William Yang Wang},
    year={2018},
    eprint={1810.09305},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

wikihow/all (default config)

  • Config description: Use the concatenation of all paragraphs as the articles and the bold lines as the reference summaries

  • Dataset size: 531.56 MiB

  • Splits:

Split Examples
'test' 5,577
'train' 157,252
'validation' 5,599
  • Feature structure:
FeaturesDict({
    'headline': Text(shape=(), dtype=string),
    'text': Text(shape=(), dtype=string),
    'title': Text(shape=(), dtype=string),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
headline Text string
text Text string
title Text string

wikihow/sep

  • Config description: use each paragraph and its summary.

  • Dataset size: 1.07 GiB

  • Splits:

Split Examples
'test' 37,800
'train' 1,060,732
'validation' 37,932
  • Feature structure:
FeaturesDict({
    'headline': Text(shape=(), dtype=string),
    'overview': Text(shape=(), dtype=string),
    'sectionLabel': Text(shape=(), dtype=string),
    'text': Text(shape=(), dtype=string),
    'title': Text(shape=(), dtype=string),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
headline Text string
overview Text string
sectionLabel Text string
text Text string
title Text string