wmt_t2t_translate (Manual download)

Translate dataset based on the data from statmt.org.

Versions exists for the different years using a combination of multiple data sources. The base wmt_translate allows you to create your own config to choose your own data/language pair by creating a custom tfds.translate.wmt.WmtConfig.

config = tfds.translate.wmt.WmtConfig(
    version="0.0.1",
    language_pair=("fr", "de"),
    subsets={
        tfds.Split.TRAIN: ["commoncrawl_frde"],
        tfds.Split.VALIDATION: ["euelections_dev2019"],
    },
)
builder = tfds.builder("wmt_translate", config=config)

wmt_t2t_translate is configured with tfds.translate.wmt.WmtConfig and has the following configurations predefined (defaults to the first one):

  • de-en (v0.0.1) (Size: 1.61 GiB): WMT T2T EnDe translation task dataset.

wmt_t2t_translate/de-en

WMT T2T EnDe translation task dataset.

Versions:

  • 0.0.1 (default):
  • 1.0.0: None

WARNING: This dataset requires you to download the source data manually into manual_dir (defaults to ~/tensorflow_datasets/manual/wmt_t2t_translate/): Some of the wmt configs here, require a manual download. Please look into wmt.py to see the exact path (and file name) that has to be downloaded.

Statistics

Split Examples
ALL 4,598,292
TRAIN 4,592,289
TEST 3,003
VALIDATION 3,000

Features

Translation({
    'de': Text(shape=(), dtype=tf.string),
    'en': Text(shape=(), dtype=tf.string),
})

Homepage

Supervised keys (for as_supervised=True)

(u'de', u'en')

Citation

@InProceedings{bojar-EtAl:2014:W14-33,
  author    = {Bojar, Ondrej  and  Buck, Christian  and  Federmann, Christian  and  Haddow, Barry  and  Koehn, Philipp  and  Leveling, Johannes  and  Monz, Christof  and  Pecina, Pavel  and  Post, Matt  and  Saint-Amand, Herve  and  Soricut, Radu  and  Specia, Lucia  and  Tamchyna, Ale
{s}},
  title     = {Findings of the 2014 Workshop on Statistical Machine Translation},
  booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation},
  month     = {June},
  year      = {2014},
  address   = {Baltimore, Maryland, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {12--58},
  url       = {http://www.aclweb.org/anthology/W/W14/W14-3302}
}