Translate dataset based on the data from statmt.org.
Versions exists for the different years using a combination of multiple data
sources. The base wmt_translate
allows you to create your own config to choose
your own data/language pair by creating a custom tfds.translate.wmt.WmtConfig
.
config = tfds.translate.wmt.WmtConfig(
version="0.0.1",
language_pair=("fr", "de"),
subsets={
tfds.Split.TRAIN: ["commoncrawl_frde"],
tfds.Split.VALIDATION: ["euelections_dev2019"],
},
)
builder = tfds.builder("wmt_translate", config=config)
- URL: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/translate_ende.py
DatasetBuilder
:tfds.translate.wmt_t2t.WmtT2tTranslate
wmt_t2t_translate
is configured with tfds.translate.wmt.WmtConfig
and has
the following configurations predefined (defaults to the first one):
de-en
(v0.0.1
) (Size: 1.61 GiB
): WMT T2T EnDe translation task dataset.
wmt_t2t_translate/de-en
WMT T2T EnDe translation task dataset.
Versions:
0.0.1
(default):1.0.0
: None
WARNING: This dataset requires you to download the source data manually into
manual_dir (defaults to ~/tensorflow_datasets/manual/wmt_t2t_translate/
): Some
of the wmt configs here, require a manual download. Please look into wmt.py to
see the exact path (and file name) that has to be downloaded.
Statistics
Split | Examples |
---|---|
ALL | 4,598,292 |
TRAIN | 4,592,289 |
TEST | 3,003 |
VALIDATION | 3,000 |
Features
Translation({
'de': Text(shape=(), dtype=tf.string),
'en': Text(shape=(), dtype=tf.string),
})
Homepage
Supervised keys (for as_supervised=True
)
(u'de', u'en')
Citation
@InProceedings{bojar-EtAl:2014:W14-33,
author = {Bojar, Ondrej and Buck, Christian and Federmann, Christian and Haddow, Barry and Koehn, Philipp and Leveling, Johannes and Monz, Christof and Pecina, Pavel and Post, Matt and Saint-Amand, Herve and Soricut, Radu and Specia, Lucia and Tamchyna, Ale
{s}},
title = {Findings of the 2014 Workshop on Statistical Machine Translation},
booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation},
month = {June},
year = {2014},
address = {Baltimore, Maryland, USA},
publisher = {Association for Computational Linguistics},
pages = {12--58},
url = {http://www.aclweb.org/anthology/W/W14/W14-3302}
}