- Description:
Clean-up text for 40+ Wikipedia languages editions of pages correspond to entities. The datasets have train/dev/test splits per language. The dataset is cleaned up by page filtering to remove disambiguation pages, redirect pages, deleted pages, and non-entity pages. Each example contains the wikidata id of the entity, and the full Wikipedia article after page processing that removes non-content sections and structured objects. The language models trained on this corpus - including 41 monolingual models, and 2 multilingual models - can be found at https://tfhub.dev/google/collections/wiki40b-lm/1
Additional Documentation: Explore on Papers With Code
Homepage: https://research.google/pubs/pub49029/
Source code:
tfds.text.Wiki40b
Versions:
1.3.0
(default): No release notes.
Download size:
Unknown size
Feature structure:
FeaturesDict({
'text': Text(shape=(), dtype=string),
'version_id': Text(shape=(), dtype=string),
'wikidata_id': Text(shape=(), dtype=string),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
text | Text | string | ||
version_id | Text | string | ||
wikidata_id | Text | string |
Supervised keys (See
as_supervised
doc):None
Figure (tfds.show_examples): Not supported.
Citation:
@inproceedings{49029,
title = {Wiki-40B: Multilingual Language Model Dataset},
author = {Mandy Guo and Zihang Dai and Denny Vrandecic and Rami Al-Rfou},
year = {2020},
booktitle = {LREC 2020}
}
wiki40b/en (default config)
Config description: Wiki40B dataset for en.
Dataset size:
9.91 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
162,274 |
'train' |
2,926,536 |
'validation' |
163,597 |
- Examples (tfds.as_dataframe):
wiki40b/ar
Config description: Wiki40B dataset for ar.
Dataset size:
833.20 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
12,271 |
'train' |
220,885 |
'validation' |
12,198 |
- Examples (tfds.as_dataframe):
wiki40b/zh-cn
Config description: Wiki40B dataset for zh-cn.
Dataset size:
985.53 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
30,355 |
'train' |
549,672 |
'validation' |
30,299 |
- Examples (tfds.as_dataframe):
wiki40b/zh-tw
Config description: Wiki40B dataset for zh-tw.
Dataset size:
986.45 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
30,670 |
'train' |
552,031 |
'validation' |
30,739 |
- Examples (tfds.as_dataframe):
wiki40b/nl
Config description: Wiki40B dataset for nl.
Dataset size:
961.82 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
24,776 |
'train' |
447,555 |
'validation' |
25,201 |
- Examples (tfds.as_dataframe):
wiki40b/fr
Config description: Wiki40B dataset for fr.
Dataset size:
3.37 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
68,004 |
'train' |
1,227,206 |
'validation' |
68,655 |
- Examples (tfds.as_dataframe):
wiki40b/de
Config description: Wiki40B dataset for de.
Dataset size:
4.78 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
86,594 |
'train' |
1,554,910 |
'validation' |
86,068 |
- Examples (tfds.as_dataframe):
wiki40b/it
Config description: Wiki40B dataset for it.
Dataset size:
2.00 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
40,443 |
'train' |
732,609 |
'validation' |
40,684 |
- Examples (tfds.as_dataframe):
wiki40b/ja
Config description: Wiki40B dataset for ja.
Dataset size:
2.19 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
41,268 |
'train' |
745,392 |
'validation' |
41,576 |
- Examples (tfds.as_dataframe):
wiki40b/ko
Config description: Wiki40B dataset for ko.
Dataset size:
453.98 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
10,802 |
'train' |
194,977 |
'validation' |
10,805 |
- Examples (tfds.as_dataframe):
wiki40b/pl
Config description: Wiki40B dataset for pl.
Dataset size:
1.03 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
27,987 |
'train' |
505,191 |
'validation' |
28,310 |
- Examples (tfds.as_dataframe):
wiki40b/pt
Config description: Wiki40B dataset for pt.
Dataset size:
1.08 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
22,693 |
'train' |
406,507 |
'validation' |
22,301 |
- Examples (tfds.as_dataframe):
wiki40b/ru
Config description: Wiki40B dataset for ru.
Dataset size:
4.13 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
51,885 |
'train' |
926,037 |
'validation' |
51,287 |
- Examples (tfds.as_dataframe):
wiki40b/es
Config description: Wiki40B dataset for es.
Dataset size:
2.70 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
48,764 |
'train' |
872,541 |
'validation' |
48,592 |
- Examples (tfds.as_dataframe):
wiki40b/th
Config description: Wiki40B dataset for th.
Dataset size:
326.29 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
3,114 |
'train' |
56,798 |
'validation' |
3,093 |
- Examples (tfds.as_dataframe):
wiki40b/tr
Config description: Wiki40B dataset for tr.
Dataset size:
308.87 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
7,890 |
'train' |
142,576 |
'validation' |
7,845 |
- Examples (tfds.as_dataframe):
wiki40b/bg
Config description: Wiki40B dataset for bg.
Dataset size:
433.20 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
7,289 |
'train' |
130,670 |
'validation' |
7,259 |
- Examples (tfds.as_dataframe):
wiki40b/ca
Config description: Wiki40B dataset for ca.
Dataset size:
753.00 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
15,568 |
'train' |
277,313 |
'validation' |
15,362 |
- Examples (tfds.as_dataframe):
wiki40b/cs
Config description: Wiki40B dataset for cs.
Dataset size:
631.84 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
12,984 |
'train' |
235,971 |
'validation' |
13,096 |
- Examples (tfds.as_dataframe):
wiki40b/da
Config description: Wiki40B dataset for da.
Dataset size:
240.51 MiB
Auto-cached (documentation): Yes (test, validation), Only when
shuffle_files=False
(train)Splits:
Split | Examples |
---|---|
'test' |
6,219 |
'train' |
109,486 |
'validation' |
6,173 |
- Examples (tfds.as_dataframe):
wiki40b/el
Config description: Wiki40B dataset for el.
Dataset size:
524.77 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
5,261 |
'train' |
93,596 |
'validation' |
5,130 |
- Examples (tfds.as_dataframe):
wiki40b/et
Config description: Wiki40B dataset for et.
Dataset size:
184.07 MiB
Auto-cached (documentation): Yes (test, validation), Only when
shuffle_files=False
(train)Splits:
Split | Examples |
---|---|
'test' |
6,205 |
'train' |
114,464 |
'validation' |
6,351 |
- Examples (tfds.as_dataframe):
wiki40b/fa
Config description: Wiki40B dataset for fa.
Dataset size:
482.55 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
11,262 |
'train' |
203,145 |
'validation' |
11,180 |
- Examples (tfds.as_dataframe):
wiki40b/fi
Config description: Wiki40B dataset for fi.
Dataset size:
534.13 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
14,179 |
'train' |
255,822 |
'validation' |
13,962 |
- Examples (tfds.as_dataframe):
wiki40b/he
Config description: Wiki40B dataset for he.
Dataset size:
869.51 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
9,344 |
'train' |
165,359 |
'validation' |
9,231 |
- Examples (tfds.as_dataframe):
wiki40b/hi
Config description: Wiki40B dataset for hi.
Dataset size:
277.56 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
2,643 |
'train' |
45,737 |
'validation' |
2,596 |
- Examples (tfds.as_dataframe):
wiki40b/hr
Config description: Wiki40B dataset for hr.
Dataset size:
235.58 MiB
Auto-cached (documentation): Yes (test, validation), Only when
shuffle_files=False
(train)Splits:
Split | Examples |
---|---|
'test' |
5,724 |
'train' |
103,857 |
'validation' |
5,792 |
- Examples (tfds.as_dataframe):
wiki40b/hu
Config description: Wiki40B dataset for hu.
Dataset size:
634.25 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
15,258 |
'train' |
273,248 |
'validation' |
15,208 |
- Examples (tfds.as_dataframe):
wiki40b/id
Config description: Wiki40B dataset for id.
Dataset size:
334.06 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
8,598 |
'train' |
156,255 |
'validation' |
8,714 |
- Examples (tfds.as_dataframe):
wiki40b/lt
Config description: Wiki40B dataset for lt.
Dataset size:
140.46 MiB
Auto-cached (documentation): Yes
Splits:
Split | Examples |
---|---|
'test' |
4,683 |
'train' |
84,854 |
'validation' |
4,754 |
- Examples (tfds.as_dataframe):
wiki40b/lv
Config description: Wiki40B dataset for lv.
Dataset size:
80.07 MiB
Auto-cached (documentation): Yes
Splits:
Split | Examples |
---|---|
'test' |
1,932 |
'train' |
33,064 |
'validation' |
1,857 |
- Examples (tfds.as_dataframe):
wiki40b/ms
Config description: Wiki40B dataset for ms.
Dataset size:
142.49 MiB
Auto-cached (documentation): Yes (test, validation), Only when
shuffle_files=False
(train)Splits:
Split | Examples |
---|---|
'test' |
5,235 |
'train' |
97,509 |
'validation' |
5,357 |
- Examples (tfds.as_dataframe):
wiki40b/no
Config description: Wiki40B dataset for no.
Dataset size:
382.03 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
10,588 |
'train' |
190,588 |
'validation' |
10,547 |
- Examples (tfds.as_dataframe):
wiki40b/ro
Config description: Wiki40B dataset for ro.
Dataset size:
319.68 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
7,870 |
'train' |
139,615 |
'validation' |
7,624 |
- Examples (tfds.as_dataframe):
wiki40b/sk
Config description: Wiki40B dataset for sk.
Dataset size:
170.20 MiB
Auto-cached (documentation): Yes (test, validation), Only when
shuffle_files=False
(train)Splits:
Split | Examples |
---|---|
'test' |
5,741 |
'train' |
103,095 |
'validation' |
5,604 |
- Examples (tfds.as_dataframe):
wiki40b/sl
Config description: Wiki40B dataset for sl.
Dataset size:
157.38 MiB
Auto-cached (documentation): Yes (test, validation), Only when
shuffle_files=False
(train)Splits:
Split | Examples |
---|---|
'test' |
3,341 |
'train' |
60,927 |
'validation' |
3,287 |
- Examples (tfds.as_dataframe):
wiki40b/sr
Config description: Wiki40B dataset for sr.
Dataset size:
582.20 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
17,997 |
'train' |
327,313 |
'validation' |
18,100 |
- Examples (tfds.as_dataframe):
wiki40b/sv
Config description: Wiki40B dataset for sv.
Dataset size:
613.62 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
22,291 |
'train' |
400,742 |
'validation' |
22,263 |
- Examples (tfds.as_dataframe):
wiki40b/tl
Config description: Wiki40B dataset for tl.
Dataset size:
29.04 MiB
Auto-cached (documentation): Yes
Splits:
Split | Examples |
---|---|
'test' |
1,446 |
'train' |
25,940 |
'validation' |
1,472 |
- Examples (tfds.as_dataframe):
wiki40b/uk
Config description: Wiki40B dataset for uk.
Dataset size:
1.67 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
26,581 |
'train' |
477,618 |
'validation' |
26,324 |
- Examples (tfds.as_dataframe):
wiki40b/vi
Config description: Wiki40B dataset for vi.
Dataset size:
497.70 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
7,942 |
'train' |
146,255 |
'validation' |
8,195 |
- Examples (tfds.as_dataframe):