TensorFlow 2.0 RC is available Learn more

flores

Evaluation datasets for low-resource machine translation: Nepali-English and Sinhala-English.

flores is configured with tfds.translate.flores.FloresConfig and has the following configurations predefined (defaults to the first one):

  • neen_plain_text (v0.0.3) (Size: 984.65 KiB): Translation dataset from ne to en, uses encoder plain_text.

  • sien_plain_text (v0.0.3) (Size: 984.65 KiB): Translation dataset from si to en, uses encoder plain_text.

flores/neen_plain_text

Translation({
    'en': Text(shape=(), dtype=tf.string),
    'ne': Text(shape=(), dtype=tf.string),
})

flores/sien_plain_text

Translation({
    'en': Text(shape=(), dtype=tf.string),
    'si': Text(shape=(), dtype=tf.string),
})

Statistics

Split Examples
ALL 5,664
VALIDATION 2,898
TEST 2,766

Urls

Supervised keys (for as_supervised=True)

(u'si', u'en')

Citation

@misc{guzmn2019new,
    title={Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English},
    author={Francisco Guzman and Peng-Jen Chen and Myle Ott and Juan Pino and Guillaume Lample and Philipp Koehn and Vishrav Chaudhary and Marc'Aurelio Ranzato},
    year={2019},
    eprint={1902.01382},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}