c4

  • Description:

A colossal, cleaned version of Common Crawl's web crawl corpus.

Based on Common Crawl dataset: https://commoncrawl.org

To generate this dataset, please follow the instructions from t5.

Due to the overhead of cleaning the dataset, it is recommend you prepare it with a distributed service like Cloud Dataflow. More info at https://www.tensorflow.org/datasets/beam_datasets

FeaturesDict({
    'content-length': Text(shape=(), dtype=tf.string),
    'content-type': Text(shape=(), dtype=tf.string),
    'text': Text(shape=(), dtype=tf.string),
    'timestamp': Text(shape=(), dtype=tf.string),
    'url': Text(shape=(), dtype=tf.string),
})
@article{2019t5,
  author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
  title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
  journal = {arXiv e-prints},
  year = {2019},
  archivePrefix = {arXiv},
  eprint = {1910.10683},
}

c4/en (default config)

  • Config description: English C4 dataset.

  • Download size: 12.28 MiB

  • Dataset size: 806.92 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'train' 364,868,901
'validation' 364,608

c4/en.noclean

  • Config description: Disables all cleaning (deduplication, removal based on bad words, etc.)

  • Download size: 12.25 MiB

  • Dataset size: 6.21 TiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'train' 1,063,805,381
'validation' 1,065,029

c4/realnewslike

  • Config description: Filters from the default config to only include content from the domains used in the 'RealNews' dataset (Zellers et al., 2019).

  • Download size: 12.41 MiB

  • Dataset size: 36.89 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'train' 13,799,838
'validation' 13,863

c4/webtextlike

Split Examples
'train' 4,500,790
'validation' 4,493

c4/multilingual

  • Config description: Multilingual C4 (mC4) has 102 languages and is generated from 71 Common Crawl dumps.

  • Download size: Unknown size

  • Dataset size: Unknown size

  • Auto-cached (documentation): Unknown

  • Splits:

Split Examples