- Description:
Mozilla Common Voice Dataset
Additional Documentation: Explore on Papers With Code
Homepage: https://voice.mozilla.org/en/datasets
Source code:
tfds.audio.CommonVoice
Versions:
1.0.0
: Initial release.2.0.0
(default): Updated to corpus 6.1 from 2020-12-11.
Feature structure:
FeaturesDict({
'accent': Text(shape=(), dtype=string),
'age': Text(shape=(), dtype=string),
'client_id': Text(shape=(), dtype=string),
'downvotes': Scalar(shape=(), dtype=int32, description=Number of people who said audio does not match text),
'gender': ClassLabel(shape=(), dtype=int64, num_classes=3),
'segment': Text(shape=(), dtype=string),
'sentence': Text(shape=(), dtype=string),
'upvotes': Scalar(shape=(), dtype=int32, description=Number of people who said audio matches the text),
'voice': Audio(shape=(None,), dtype=int64),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
accent | Text | string | Accent of the speaker, see https://github.com/common-voice/common-voice/blob/main/web/src/stores/demographics.ts | |
age | Text | string | Age bucket of the speaker (e.g. teens, or fourties), see https://github.com/common-voice/common-voice/blob/main/web/src/stores/demographics.ts | |
client_id | Text | string | Hashed UUID of a given user | |
downvotes | Scalar | int32 | Number of people who said audio does not match text | |
gender | ClassLabel | int64 | Gender of the speaker | |
segment | Text | string | If sentence belongs to a custom dataset segment, it will be listed here | |
sentence | Text | string | Supposed transcription of the audio | |
upvotes | Scalar | int32 | Number of people who said audio matches the text | |
voice | Audio | (None,) | int64 |
Supervised keys (See
as_supervised
doc):None
Figure (tfds.show_examples): Not supported.
Citation:
common_voice/en (default config)
Config description: Language Code: en
Download size:
56.45 GiB
Dataset size:
2.79 TiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
16,164 |
'test' |
16,164 |
'train' |
564,337 |
'validation' |
1,224,864 |
- Examples (tfds.as_dataframe):
common_voice/ab
Config description: Language Code: ab
Download size:
39.14 MiB
Dataset size:
133.24 MiB
Auto-cached (documentation): Yes
Splits:
Split | Examples |
---|---|
'test' |
9 |
'train' |
22 |
'validation' |
31 |
- Examples (tfds.as_dataframe):
common_voice/ar
Config description: Language Code: ar
Download size:
1.64 GiB
Dataset size:
67.16 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
7,517 |
'test' |
7,622 |
'train' |
14,227 |
'validation' |
43,291 |
- Examples (tfds.as_dataframe):
common_voice/as
Config description: Language Code: as
Download size:
21.20 MiB
Dataset size:
1.65 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
124 |
'test' |
110 |
'train' |
270 |
'validation' |
504 |
- Examples (tfds.as_dataframe):
common_voice/br
Config description: Language Code: br
Download size:
443.72 MiB
Dataset size:
13.46 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
1,997 |
'test' |
2,087 |
'train' |
2,780 |
'validation' |
8,560 |
- Examples (tfds.as_dataframe):
common_voice/ca
Config description: Language Code: ca
Download size:
19.32 GiB
Dataset size:
1.19 TiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
15,724 |
'test' |
15,724 |
'train' |
285,584 |
'validation' |
416,701 |
- Examples (tfds.as_dataframe):
common_voice/cnh
Config description: Language Code: cnh
Download size:
153.86 MiB
Dataset size:
5.12 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
756 |
'test' |
752 |
'train' |
807 |
'validation' |
2,432 |
- Examples (tfds.as_dataframe):
common_voice/cs
Config description: Language Code: cs
Download size:
1.18 GiB
Dataset size:
56.89 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
4,118 |
'test' |
4,144 |
'train' |
5,655 |
'validation' |
30,431 |
- Examples (tfds.as_dataframe):
common_voice/cv
Config description: Language Code: cv
Download size:
418.98 MiB
Dataset size:
8.10 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
818 |
'test' |
788 |
'train' |
931 |
'validation' |
3,496 |
- Examples (tfds.as_dataframe):
common_voice/cy
Config description: Language Code: cy
Download size:
3.20 GiB
Dataset size:
128.68 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
4,776 |
'test' |
4,820 |
'train' |
6,839 |
'validation' |
72,984 |
- Examples (tfds.as_dataframe):
common_voice/de
Config description: Language Code: de
Download size:
21.68 GiB
Dataset size:
1.29 TiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
15,588 |
'test' |
15,588 |
'train' |
246,525 |
'validation' |
565,186 |
- Examples (tfds.as_dataframe):
common_voice/dv
Config description: Language Code: dv
Download size:
515.45 MiB
Dataset size:
31.59 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
2,077 |
'test' |
2,202 |
'train' |
2,680 |
'validation' |
11,866 |
- Examples (tfds.as_dataframe):
common_voice/el
Config description: Language Code: el
Download size:
363.89 MiB
Dataset size:
14.62 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
1,401 |
'test' |
1,522 |
'train' |
2,316 |
'validation' |
5,996 |
- Examples (tfds.as_dataframe):
common_voice/eo
Config description: Language Code: eo
Download size:
2.69 GiB
Dataset size:
167.14 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
8,987 |
'test' |
8,969 |
'train' |
19,587 |
'validation' |
58,094 |
- Examples (tfds.as_dataframe):
common_voice/es
Config description: Language Code: es
Download size:
15.08 GiB
Dataset size:
684.66 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
15,089 |
'test' |
15,089 |
'train' |
161,813 |
'validation' |
236,314 |
- Examples (tfds.as_dataframe):
common_voice/et
Config description: Language Code: et
Download size:
731.63 MiB
Dataset size:
37.95 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
2,507 |
'test' |
2,509 |
'train' |
2,966 |
'validation' |
10,683 |
- Examples (tfds.as_dataframe):
common_voice/eu
Config description: Language Code: eu
Download size:
3.41 GiB
Dataset size:
127.60 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
5,172 |
'test' |
5,172 |
'train' |
7,505 |
'validation' |
63,009 |
- Examples (tfds.as_dataframe):
common_voice/fa
Config description: Language Code: fa
Download size:
8.27 GiB
Dataset size:
328.61 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
5,213 |
'test' |
5,213 |
'train' |
7,593 |
'validation' |
251,659 |
- Examples (tfds.as_dataframe):
common_voice/fi
Config description: Language Code: fi
Download size:
47.57 MiB
Dataset size:
3.41 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
415 |
'test' |
428 |
'train' |
460 |
'validation' |
1,305 |
- Examples (tfds.as_dataframe):
common_voice/fr
Config description: Language Code: fr
Download size:
17.82 GiB
Dataset size:
1.17 TiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
15,763 |
'test' |
15,763 |
'train' |
298,982 |
'validation' |
461,004 |
- Examples (tfds.as_dataframe):
common_voice/fy-NL
Config description: Language Code: fy-NL
Download size:
1.15 GiB
Dataset size:
29.93 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
2,790 |
'test' |
3,020 |
'train' |
3,927 |
'validation' |
10,495 |
- Examples (tfds.as_dataframe):
common_voice/ga-IE
Config description: Language Code: ga-IE
Download size:
149.30 MiB
Dataset size:
5.11 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
497 |
'test' |
506 |
'train' |
541 |
'validation' |
3,352 |
- Examples (tfds.as_dataframe):
common_voice/hi
Config description: Language Code: hi
Download size:
20.43 MiB
Dataset size:
1.15 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
135 |
'test' |
127 |
'train' |
157 |
'validation' |
419 |
- Examples (tfds.as_dataframe):
common_voice/hsb
Config description: Language Code: hsb
Download size:
75.69 MiB
Dataset size:
5.67 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
172 |
'test' |
387 |
'train' |
808 |
'validation' |
1,367 |
- Examples (tfds.as_dataframe):
common_voice/hu
Config description: Language Code: hu
Download size:
231.51 MiB
Dataset size:
17.07 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
1,434 |
'test' |
1,649 |
'train' |
3,348 |
'validation' |
6,457 |
- Examples (tfds.as_dataframe):
common_voice/ia
Config description: Language Code: ia
Download size:
216.01 MiB
Dataset size:
14.99 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
1,601 |
'test' |
899 |
'train' |
3,477 |
'validation' |
5,978 |
- Examples (tfds.as_dataframe):
common_voice/id
Config description: Language Code: id
Download size:
453.87 MiB
Dataset size:
17.20 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
1,835 |
'test' |
1,844 |
'train' |
2,130 |
'validation' |
8,696 |
- Examples (tfds.as_dataframe):
common_voice/it
Config description: Language Code: it
Download size:
5.20 GiB
Dataset size:
316.38 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
12,928 |
'test' |
12,928 |
'train' |
58,015 |
'validation' |
102,579 |
- Examples (tfds.as_dataframe):
common_voice/ja
Config description: Language Code: ja
Download size:
145.80 MiB
Dataset size:
6.83 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
586 |
'test' |
632 |
'train' |
722 |
'validation' |
3,072 |
- Examples (tfds.as_dataframe):
common_voice/ka
Config description: Language Code: ka
Download size:
99.45 MiB
Dataset size:
7.51 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
527 |
'test' |
656 |
'train' |
1,058 |
'validation' |
2,275 |
- Examples (tfds.as_dataframe):
common_voice/kab
Config description: Language Code: kab
Download size:
15.99 GiB
Dataset size:
718.51 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
14,622 |
'test' |
14,622 |
'train' |
120,530 |
'validation' |
573,718 |
- Examples (tfds.as_dataframe):
common_voice/ky
Config description: Language Code: ky
Download size:
552.60 MiB
Dataset size:
18.70 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
1,511 |
'test' |
1,503 |
'train' |
1,955 |
'validation' |
9,236 |
- Examples (tfds.as_dataframe):
common_voice/lg
Config description: Language Code: lg
Download size:
198.55 MiB
Dataset size:
6.65 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
384 |
'test' |
584 |
'train' |
1,250 |
'validation' |
2,220 |
- Examples (tfds.as_dataframe):
common_voice/lt
Config description: Language Code: lt
Download size:
129.03 MiB
Dataset size:
4.79 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
244 |
'test' |
466 |
'train' |
931 |
'validation' |
1,644 |
- Examples (tfds.as_dataframe):
common_voice/lv
Config description: Language Code: lv
Download size:
198.66 MiB
Dataset size:
13.07 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
2,002 |
'test' |
1,882 |
'train' |
2,552 |
'validation' |
6,444 |
- Examples (tfds.as_dataframe):
common_voice/mn
Config description: Language Code: mn
Download size:
463.84 MiB
Dataset size:
22.09 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
1,837 |
'test' |
1,862 |
'train' |
2,183 |
'validation' |
7,487 |
- Examples (tfds.as_dataframe):
common_voice/mt
Config description: Language Code: mt
Download size:
405.42 MiB
Dataset size:
15.09 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
1,516 |
'test' |
1,617 |
'train' |
2,036 |
'validation' |
5,747 |
- Examples (tfds.as_dataframe):
common_voice/nl
Config description: Language Code: nl
Download size:
1.62 GiB
Dataset size:
90.20 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
4,938 |
'test' |
5,708 |
'train' |
9,460 |
'validation' |
52,488 |
- Examples (tfds.as_dataframe):
common_voice/or
Config description: Language Code: or
Download size:
189.85 MiB
Dataset size:
1.97 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
129 |
'test' |
98 |
'train' |
388 |
'validation' |
615 |
- Examples (tfds.as_dataframe):
common_voice/pa-IN
Config description: Language Code: pa-IN
Download size:
66.52 MiB
Dataset size:
1.03 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
44 |
'test' |
116 |
'train' |
211 |
'validation' |
371 |
- Examples (tfds.as_dataframe):
common_voice/pl
Config description: Language Code: pl
Download size:
3.29 GiB
Dataset size:
141.06 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
5,153 |
'test' |
5,153 |
'train' |
7,468 |
'validation' |
90,791 |
- Examples (tfds.as_dataframe):
common_voice/pt
Config description: Language Code: pt
Download size:
1.59 GiB
Dataset size:
75.64 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
4,592 |
'test' |
4,641 |
'train' |
6,514 |
'validation' |
41,584 |
- Examples (tfds.as_dataframe):
common_voice/rm-sursilv
Config description: Language Code: rm-sursilv
Download size:
263.17 MiB
Dataset size:
12.31 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
1,205 |
'test' |
1,194 |
'train' |
1,384 |
'validation' |
3,783 |
- Examples (tfds.as_dataframe):
common_voice/rm-vallader
Config description: Language Code: rm-vallader
Download size:
103.11 MiB
Dataset size:
4.89 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
357 |
'test' |
378 |
'train' |
574 |
'validation' |
1,316 |
- Examples (tfds.as_dataframe):
common_voice/ro
Config description: Language Code: ro
Download size:
249.84 MiB
Dataset size:
14.54 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
858 |
'test' |
1,778 |
'train' |
3,399 |
'validation' |
6,039 |
- Examples (tfds.as_dataframe):
common_voice/ru
Config description: Language Code: ru
Download size:
3.40 GiB
Dataset size:
175.04 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
7,963 |
'test' |
8,007 |
'train' |
15,481 |
'validation' |
74,256 |
- Examples (tfds.as_dataframe):
common_voice/rw
Config description: Language Code: rw
Download size:
39.62 GiB
Dataset size:
2.18 TiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
15,032 |
'test' |
15,724 |
'train' |
515,197 |
'validation' |
832,929 |
- Examples (tfds.as_dataframe):
common_voice/sah
Config description: Language Code: sah
Download size:
172.85 MiB
Dataset size:
9.42 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
405 |
'test' |
757 |
'train' |
1,442 |
'validation' |
2,606 |
- Examples (tfds.as_dataframe):
common_voice/sl
Config description: Language Code: sl
Download size:
212.43 MiB
Dataset size:
9.67 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
556 |
'test' |
881 |
'train' |
2,038 |
'validation' |
4,669 |
- Examples (tfds.as_dataframe):
common_voice/sv-SE
Config description: Language Code: sv-SE
Download size:
401.91 MiB
Dataset size:
18.27 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
2,019 |
'test' |
2,027 |
'train' |
2,331 |
'validation' |
12,552 |
- Examples (tfds.as_dataframe):
common_voice/ta
Config description: Language Code: ta
Download size:
648.28 MiB
Dataset size:
24.06 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
1,779 |
'test' |
1,781 |
'train' |
2,009 |
'validation' |
12,652 |
- Examples (tfds.as_dataframe):
common_voice/th
Config description: Language Code: th
Download size:
325.49 MiB
Dataset size:
18.32 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
1,922 |
'test' |
2,188 |
'train' |
2,917 |
'validation' |
7,028 |
- Examples (tfds.as_dataframe):
common_voice/tr
Config description: Language Code: tr
Download size:
592.09 MiB
Dataset size:
28.21 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
1,647 |
'test' |
1,647 |
'train' |
1,831 |
'validation' |
18,685 |
- Examples (tfds.as_dataframe):
common_voice/tt
Config description: Language Code: tt
Download size:
741.15 MiB
Dataset size:
46.85 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
2,127 |
'test' |
4,485 |
'train' |
11,211 |
'validation' |
25,781 |
- Examples (tfds.as_dataframe):
common_voice/uk
Config description: Language Code: uk
Download size:
1.13 GiB
Dataset size:
49.66 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
3,236 |
'test' |
3,235 |
'train' |
4,035 |
'validation' |
22,337 |
- Examples (tfds.as_dataframe):
common_voice/vi
Config description: Language Code: vi
Download size:
49.52 MiB
Dataset size:
1.47 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
200 |
'test' |
198 |
'train' |
221 |
'validation' |
619 |
- Examples (tfds.as_dataframe):
common_voice/vot
Config description: Language Code: vot
Download size:
7.43 MiB
Dataset size:
11.39 MiB
Auto-cached (documentation): Yes
Splits:
Split | Examples |
---|---|
'train' |
3 |
'validation' |
3 |
- Examples (tfds.as_dataframe):
common_voice/zh-CN
Config description: Language Code: zh-CN
Download size:
2.03 GiB
Dataset size:
122.54 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
8,743 |
'test' |
8,760 |
'train' |
18,541 |
'validation' |
36,405 |
- Examples (tfds.as_dataframe):
common_voice/zh-HK
Config description: Language Code: zh-HK
Download size:
2.58 GiB
Dataset size:
78.80 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
5,172 |
'test' |
5,172 |
'train' |
7,506 |
'validation' |
41,835 |
- Examples (tfds.as_dataframe):
common_voice/zh-TW
Config description: Language Code: zh-TW
Download size:
2.03 GiB
Dataset size:
69.06 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'dev' |
2,895 |
'test' |
2,895 |
'train' |
3,507 |
'validation' |
61,232 |
- Examples (tfds.as_dataframe):