Missed TensorFlow World? Check out the recap. Learn more

c4

A colossal, cleaned version of Common Crawl's web crawl corpus.

Based on Common Crawl dataset: "https://commoncrawl.org"

Due to the overhead of cleaning the dataset, it is recommend you prepare it with a distributed service like Cloud Dataflow. More info at https://www.tensorflow.org/datasets/beam_datasets.

c4 is configured with tfds.text.c4.C4Config and has the following configurations predefined (defaults to the first one):

  • en (v1.1.0) (Size: ?? GiB): English C4 dataset.

  • en.noclean (v1.1.0) (Size: ?? GiB): Disables all cleaning (deduplication, removal based on bad words, etc.)

  • en.realnewslike (v1.1.0) (Size: ?? GiB): Filters from the default config to only include content from the domains used in the 'RealNews' dataset (Zellers et al., 2019).

  • en.webtextlike (v1.1.0) (Size: ?? GiB): Filters from the default config to only include content from the URLs in OpenWebText (https://github.com/jcpeterson/openwebtext).

c4/en

English C4 dataset.

Versions:

  • 1.1.0 (default):
  • 1.0.0: None
  • 1.0.1: None

Statistics

None computed

Features

FeaturesDict({
    'content-length': Text(shape=(), dtype=tf.string),
    'content-type': Text(shape=(), dtype=tf.string),
    'text': Text(shape=(), dtype=tf.string),
    'timestamp': Text(shape=(), dtype=tf.string),
    'url': Text(shape=(), dtype=tf.string),
})

Homepage

c4/en.noclean

Disables all cleaning (deduplication, removal based on bad words, etc.)

Versions:

  • 1.1.0 (default):
  • 1.0.0: None
  • 1.0.1: None

Statistics

None computed

Features

FeaturesDict({
    'content-length': Text(shape=(), dtype=tf.string),
    'content-type': Text(shape=(), dtype=tf.string),
    'text': Text(shape=(), dtype=tf.string),
    'timestamp': Text(shape=(), dtype=tf.string),
    'url': Text(shape=(), dtype=tf.string),
})

Homepage

c4/en.realnewslike

Filters from the default config to only include content from the domains used in the 'RealNews' dataset (Zellers et al., 2019).

Versions:

  • 1.1.0 (default):
  • 1.0.0: None
  • 1.0.1: None

Statistics

None computed

Features

FeaturesDict({
    'content-length': Text(shape=(), dtype=tf.string),
    'content-type': Text(shape=(), dtype=tf.string),
    'text': Text(shape=(), dtype=tf.string),
    'timestamp': Text(shape=(), dtype=tf.string),
    'url': Text(shape=(), dtype=tf.string),
})

Homepage

c4/en.webtextlike

Filters from the default config to only include content from the URLs in OpenWebText (https://github.com/jcpeterson/openwebtext).

Versions:

  • 1.1.0 (default):
  • 1.0.0: None
  • 1.0.1: None

Statistics

None computed

Features

FeaturesDict({
    'content-length': Text(shape=(), dtype=tf.string),
    'content-type': Text(shape=(), dtype=tf.string),
    'text': Text(shape=(), dtype=tf.string),
    'timestamp': Text(shape=(), dtype=tf.string),
    'url': Text(shape=(), dtype=tf.string),
})

Homepage

Citation

@article{2019t5,
  author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
  title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
  journal = {arXiv e-prints},
  year = {2019},
  archivePrefix = {arXiv},
  eprint = {1910.10683},
}