c4

  • Description:

A colossal, cleaned version of Common Crawl's web crawl corpus.

Based on Common Crawl dataset: https://commoncrawl.org

To generate this dataset, please follow the instructions from t5.

Due to the overhead of cleaning the dataset, it is recommend you prepare it with a distributed service like Cloud Dataflow. More info at https://www.tensorflow.org/datasets/beam_datasets

  • Homepage: https://github.com/google-research/text-to-text-transfer-transformer#datasets

  • Source code: tfds.text.C4

  • Versions:

    • 3.0.1 (default) : No release notes.

    • 2.3.1: No release notes.

    • 2.3.0: No release notes.

    • 2.2.1: No release notes.

    • 2.2.0: No release notes.

  • Download size: Unknown size

  • Dataset size: Unknown size

  • Manual download instructions: This dataset requires you to download the source data manually into download_config.manual_dir (defaults to ~/tensorflow_datasets/downloads/manual/):
    You are using a C4 config that requires some files to be manually downloaded. For c4/webtextlike, download OpenWebText.zip from https://mega.nz/#F!EZZD0YwJ!9_PlEQzdMVLaNdKv_ICNVQ For c4/multilingual and en/noclean download the Common Crawl WET files.

  • Auto-cached (documentation): Unknown

  • Splits:

Split Examples
  • Features:
FeaturesDict({
    'content-length': Text(shape=(), dtype=tf.string),
    'content-type': Text(shape=(), dtype=tf.string),
    'text': Text(shape=(), dtype=tf.string),
    'timestamp': Text(shape=(), dtype=tf.string),
    'url': Text(shape=(), dtype=tf.string),
})
@article{2019t5,
  author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
  title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
  journal = {arXiv e-prints},
  year = {2019},
  archivePrefix = {arXiv},
  eprint = {1910.10683},
}

c4/en (default config)

  • Config description: English C4 dataset.

c4/en.noclean

  • Config description: Disables all cleaning (deduplication, removal based on bad words, etc.)

c4/realnewslike

  • Config description: Filters from the default config to only include content from the domains used in the 'RealNews' dataset (Zellers et al., 2019).

c4/webtextlike

c4/multilingual

  • Config description: Multilingual C4 (mC4) has 101 languages and is generated from 71 Common Crawl dumps.