TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

c4

Description:

A colossal, cleaned version of Common Crawl's web crawl corpus.

Based on Common Crawl dataset: https://commoncrawl.org

To generate this dataset, please follow the instructions from t5.

Due to the overhead of cleaning the dataset, it is recommend you prepare it with a distributed service like Cloud Dataflow. More info at https://www.tensorflow.org/datasets/beam_datasets

Additional Documentation: Explore on Papers With Code
Homepage: https://github.com/google-research/text-to-text-transfer-transformer#datasets
Source code: tfds.text.C4
Versions:
- 2.2.0: No release notes.
- 2.2.1: No release notes.
- 2.3.0: No release notes.
- 2.3.1: No release notes.
- 3.1.0 (default): No release notes.
Manual download instructions: This dataset requires you to download the source data manually into download_config.manual_dir (defaults to ~/tensorflow_datasets/downloads/manual/):
You are using a C4 config that requires some files to be manually downloaded. For c4/webtextlike, download OpenWebText.zip from https://mega.nz/#F!EZZD0YwJ!9_PlEQzdMVLaNdKv_ICNVQ
Auto-cached (documentation): No
Feature structure:

FeaturesDict({
    'content-length': Text(shape=(), dtype=string),
    'content-type': Text(shape=(), dtype=string),
    'text': Text(shape=(), dtype=string),
    'timestamp': Text(shape=(), dtype=string),
    'url': Text(shape=(), dtype=string),
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
content-length	Text	string
content-type	Text	string
text	Text	string
timestamp	Text	string
url	Text	string

Supervised keys (See as_supervised doc): None
Figure (tfds.show_examples): Not supported.
Citation:

@article{2019t5,
  author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
  title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
  journal = {arXiv e-prints},
  year = {2019},
  archivePrefix = {arXiv},
  eprint = {1910.10683},
}

c4/en (default config)

Config description: English C4 dataset.
Download size: 201.98 KiB
Dataset size: 806.87 GiB
Splits:

Split	Examples
`'train'`	364,613,570
`'validation'`	364,724

Examples (tfds.as_dataframe):

c4/en.noclean

Config description: Disables all cleaning (deduplication, removal based on bad words, etc.)
Download size: 177.11 KiB
Dataset size: 6.21 TiB
Splits:

Split	Examples
`'train'`	1,063,805,169
`'validation'`	1,065,028

Examples (tfds.as_dataframe):

c4/realnewslike

Config description: Filters from the default config to only include content from the domains used in the 'RealNews' dataset (Zellers et al., 2019).
Download size: 340.29 KiB
Dataset size: 36.91 GiB
Splits:

Split	Examples
`'train'`	13,804,817
`'validation'`	13,855

Examples (tfds.as_dataframe):

c4/webtextlike

Config description: Filters from the default config to only include content from the URLs in OpenWebText (https://github.com/jcpeterson/openwebtext).
Download size: 2.04 MiB
Dataset size: 17.93 GiB
Splits:

Split	Examples
`'train'`	4,488,694
`'validation'`	4,486

Examples (tfds.as_dataframe):

c4/multilingual

Config description: Multilingual C4 (mC4) has 101 languages and is generated from 86 Common Crawl dumps.
Download size: 13.60 MiB
Dataset size: 38.49 TiB
Splits:

Split	Examples
`'af'`	1,770,414
`'af-validation'`	1,757
`'am'`	291,570
`'am-validation'`	289
`'ar'`	92,455,378
`'ar-validation'`	92,374
`'az'`	7,179,300
`'az-validation'`	7,206
`'be'`	2,156,584
`'be-validation'`	2,103
`'bg'`	32,511,350
`'bg-Latn'`	44,290
`'bg-Latn-validation'`	41
`'bg-validation'`	32,690
`'bn'`	15,183,514
`'bn-validation'`	15,130
`'ca'`	19,438,615
`'ca-validation'`	19,562
`'ceb'`	415,208
`'ceb-validation'`	430
`'co'`	217,257
`'co-validation'`	211
`'cs'`	82,262,078
`'cs-validation'`	82,594
`'cy'`	1,066,595
`'cy-validation'`	1,016
`'da'`	36,884,558
`'da-validation'`	37,071
`'de'`	545,956,997
`'de-validation'`	547,566
`'el'`	68,577,376
`'el-Latn'`	162,004
`'el-Latn-validation'`	171
`'el-validation'`	69,435
`'en'`	3,928,733,379
`'en-validation'`	3,933,379
`'eo'`	560,151
`'eo-validation'`	546
`'es'`	591,272,119
`'es-validation'`	592,258
`'et'`	10,401,882
`'et-validation'`	10,276
`'eu'`	2,077,113
`'eu-validation'`	2,077
`'fa'`	81,252,911
`'fa-validation'`	81,034
`'fi'`	36,807,562
`'fi-validation'`	36,512
`'fil'`	2,331,209
`'fil-validation'`	2,381
`'fr'`	454,229,019
`'fr-validation'`	453,124
`'fy'`	502,656
`'fy-validation'`	478
`'ga'`	611,457
`'ga-validation'`	631
`'gd'`	201,237
`'gd-validation'`	196
`'gl'`	3,762,255
`'gl-validation'`	3,811
`'gu'`	1,292,191
`'gu-validation'`	1,323
`'ha'`	363,002
`'ha-validation'`	368
`'haw'`	103,043
`'haw-validation'`	99
`'hi'`	26,695,748
`'hi-Latn'`	251,231
`'hi-Latn-validation'`	261
`'hi-validation'`	26,721
`'hmn'`	157,016
`'hmn-validation'`	175
`'ht'`	232,354
`'ht-validation'`	246
`'hu'`	56,645,732
`'hu-validation'`	56,905
`'hy'`	3,873,029
`'hy-validation'`	3,804
`'id'`	19,423,746
`'id-validation'`	19,601
`'ig'`	110,582
`'ig-validation'`	103
`'is'`	3,139,312
`'is-validation'`	3,210
`'it'`	267,686,115
`'it-validation'`	267,322
`'iw'`	17,607,812
`'iw-validation'`	17,570
`'ja'`	85,226,039
`'ja-Latn'`	235,885
`'ja-Latn-validation'`	221
`'ja-validation'`	85,618
`'jv'`	218,969
`'jv-validation'`	253
`'ka'`	3,726,808
`'ka-validation'`	3,752
`'kk'`	3,421,165
`'kk-validation'`	3,443
`'km'`	1,384,128
`'km-validation'`	1,359
`'kn'`	1,916,445
`'kn-validation'`	1,895
`'ko'`	24,035,493
`'ko-validation'`	24,240
`'ku'`	399,027
`'ku-validation'`	417
`'ky'`	1,198,504
`'ky-validation'`	1,188
`'la'`	1,632,557
`'la-validation'`	1,630
`'lb'`	850,921
`'lb-validation'`	856
`'lo'`	302,612
`'lo-validation'`	290
`'lt'`	18,234,466
`'lt-validation'`	18,428
`'lv'`	9,882,376
`'lv-validation'`	10,034
`'mg'`	263,321
`'mg-validation'`	254
`'mi'`	148,146
`'mi-validation'`	156
`'mk'`	3,599,707
`'mk-validation'`	3,713
`'ml'`	3,604,562
`'ml-validation'`	3,514
`'mn'`	2,947,312
`'mn-validation'`	3,021
`'mr'`	4,555,599
`'mr-validation'`	4,602
`'ms'`	4,688,036
`'ms-validation'`	4,719
`'mt'`	1,109,191
`'mt-validation'`	1,207
`'my'`	1,248,242
`'my-validation'`	1,314
`'ne'`	4,679,412
`'ne-validation'`	4,738
`'nl'`	136,379,427
`'nl-validation'`	137,142
`'no'`	30,644,684
`'no-validation'`	31,134
`'ny'`	114,952
`'ny-validation'`	121
`'pa'`	729,394
`'pa-validation'`	719
`'pl'`	178,690,573
`'pl-validation'`	178,481
`'ps'`	497,321
`'ps-validation'`	468
`'pt'`	246,401,954
`'pt-validation'`	246,120
`'ro'`	66,499,899
`'ro-validation'`	66,384
`'ru'`	1,014,064,014
`'ru-Latn'`	582,022
`'ru-Latn-validation'`	616
`'ru-validation'`	1,014,169
`'sd'`	210,835
`'sd-validation'`	206
`'si'`	846,125
`'si-validation'`	846
`'sk'`	26,721,250
`'sk-validation'`	26,882
`'sl'`	12,381,886
`'sl-validation'`	12,381
`'sm'`	102,125
`'sm-validation'`	108
`'sn'`	124,984
`'sn-validation'`	116
`'so'`	1,168,106
`'so-validation'`	1,212
`'sq'`	7,023,573
`'sq-validation'`	7,057
`'sr'`	4,775,217
`'sr-validation'`	4,804
`'st'`	99,970
`'st-validation'`	103
`'su'`	153,302
`'su-validation'`	151
`'sv'`	63,308,307
`'sv-validation'`	63,488
`'sw'`	1,279,408
`'sw-validation'`	1,296
`'ta'`	5,769,533
`'ta-validation'`	5,770
`'te'`	2,034,828
`'te-validation'`	2,010
`'tg'`	1,563,304
`'tg-validation'`	1,526
`'th'`	28,021,205
`'th-validation'`	28,062
`'tr'`	132,662,955
`'tr-validation'`	133,062
`'uk'`	56,159,593
`'uk-validation'`	56,321
`'und'`	3,650,492,732
`'und-validation'`	3,656,588
`'ur'`	3,432,478
`'ur-validation'`	3,443
`'uz'`	1,183,603
`'uz-validation'`	1,259
`'vi'`	132,667,573
`'vi-validation'`	132,915
`'xh'`	122,232
`'xh-validation'`	117
`'yi'`	173,510
`'yi-validation'`	166
`'yo'`	86,686
`'yo-validation'`	82
`'zh'`	214,856,503
`'zh-Latn'`	471,314
`'zh-Latn-validation'`	492
`'zh-validation'`	214,733
`'zu'`	261,239
`'zu-validation'`	253

Examples (tfds.as_dataframe):