c4

  • Description:

A colossal, cleaned version of Common Crawl's web crawl corpus.

Based on Common Crawl dataset: https://commoncrawl.org

To generate this dataset, please follow the instructions from t5.

Due to the overhead of cleaning the dataset, it is recommend you prepare it with a distributed service like Cloud Dataflow. More info at https://www.tensorflow.org/datasets/beam_datasets

FeaturesDict({
    'content-length': Text(shape=(), dtype=tf.string),
    'content-type': Text(shape=(), dtype=tf.string),
    'text': Text(shape=(), dtype=tf.string),
    'timestamp': Text(shape=(), dtype=tf.string),
    'url': Text(shape=(), dtype=tf.string),
})
@article{2019t5,
  author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
  title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
  journal = {arXiv e-prints},
  year = {2019},
  archivePrefix = {arXiv},
  eprint = {1910.10683},
}

c4/en (default config)

  • Config description: English C4 dataset.

  • Download size: 12.19 MiB

  • Dataset size: 803.19 GiB

  • Splits:

Split Examples
'train' 362,791,396
'validation' 362,626

c4/en.noclean

  • Config description: Disables all cleaning (deduplication, removal based on bad words, etc.)

  • Download size: 12.18 MiB

  • Dataset size: 6.21 TiB

  • Splits:

Split Examples
'train' 1,063,805,308
'validation' 1,065,026

c4/en.realnewslike

  • Config description: Filters from the default config to only include content from the domains used in the 'RealNews' dataset (Zellers et al., 2019).

  • Download size: 12.32 MiB

  • Dataset size: 36.51 GiB

  • Splits:

Split Examples
'train' 13,621,531
'validation' 13,687

c4/en.webtextlike

  • Config description: Filters from the default config to only include content from the URLs in OpenWebText (https://github.com/jcpeterson/openwebtext).

  • Download size: 12.19 MiB

  • Dataset size: 17.81 GiB

  • Splits:

Split Examples
'train' 4,437,916
'validation' 4,409