wikihow (Manual download)

WikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base.

There are two features: - text: wikihow answers texts. - headline: bold lines as summary.

There are two separate versions: - all: consisting of the concatenation of all paragraphs as the articles and the bold lines as the reference summaries. - sep: consisting of each paragraph and its summary.

Download "wikihowAll.csv" and "wikihowSep.csv" from https://github.com/mahnazkoupaee/WikiHow-Dataset and place them in manual folder https://www.tensorflow.org/datasets/api_docs/python/tfds/download/DownloadConfig. Train/validation/test splits are provided by the authors. Preprocessing is applied to remove short articles (abstract length < 0.75 article length) and clean up extra commas.

wikihow is configured with tfds.summarization.wikihow.WikihowConfig and has the following configurations predefined (defaults to the first one):

  • all (v1.2.0) (Size: 5.21 MiB): Use the concatenation of all paragraphs as the articles and the bold lines as the reference summaries

  • sep (v1.2.0) (Size: 5.21 MiB): use each paragraph and its summary.

wikihow/all

Use the concatenation of all paragraphs as the articles and the bold lines as the reference summaries

Versions:

  • 1.2.0 (default):

WARNING: This dataset requires you to download the source data manually into manual_dir (defaults to ~/tensorflow_datasets/manual/wikihow/): Links to files can be found on https://github.com/mahnazkoupaee/WikiHow-Dataset Please download both wikihowAll.csv and wikihowSep.csv.

Statistics

Split Examples
ALL 168,428
TRAIN 157,252
VALIDATION 5,599
TEST 5,577

Features

FeaturesDict({
    'headline': Text(shape=(), dtype=tf.string),
    'text': Text(shape=(), dtype=tf.string),
    'title': Text(shape=(), dtype=tf.string),
})

Homepage

Supervised keys (for as_supervised=True)

(u'text', u'headline')

wikihow/sep

use each paragraph and its summary.

Versions:

  • 1.2.0 (default):

WARNING: This dataset requires you to download the source data manually into manual_dir (defaults to ~/tensorflow_datasets/manual/wikihow/): Links to files can be found on https://github.com/mahnazkoupaee/WikiHow-Dataset Please download both wikihowAll.csv and wikihowSep.csv.

Statistics

Split Examples
ALL 1,136,464
TRAIN 1,060,732
VALIDATION 37,932
TEST 37,800

Features

FeaturesDict({
    'headline': Text(shape=(), dtype=tf.string),
    'overview': Text(shape=(), dtype=tf.string),
    'sectionLabel': Text(shape=(), dtype=tf.string),
    'text': Text(shape=(), dtype=tf.string),
    'title': Text(shape=(), dtype=tf.string),
})

Homepage

Supervised keys (for as_supervised=True)

(u'text', u'headline')

Citation

@misc{koupaee2018wikihow,
    title={WikiHow: A Large Scale Text Summarization Dataset},
    author={Mahnaz Koupaee and William Yang Wang},
    year={2018},
    eprint={1810.09305},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}