Ayuda a proteger la Gran Barrera de Coral con TensorFlow en Kaggle Únete Challenge

joya

  • Descripción:

GEM es un entorno de referencia para la Generación de Lenguaje Natural con un enfoque en su evaluación, tanto a través de anotaciones humanos y métricas automatizadas.

GEM tiene como objetivo: (1) medir el progreso de NLG en 13 conjuntos de datos que abarcan muchas tareas e idiomas de NLG. (2) proporcionar un análisis en profundidad de datos y modelos presentados a través de declaraciones de datos y conjuntos de desafíos. (3) desarrollar estándares para la evaluación del texto generado utilizando métricas tanto automatizadas como humanas.

Más información se puede encontrar en https://gem-benchmark.com .

gem / common_gen (configuración predeterminada)

  • Descripción config: CommonGen es una tarea de generación de texto restringido, asociado a un conjunto de datos de referencia, a las máquinas de prueba explícita para la capacidad de razonamiento de sentido común generativa. Dado un conjunto de conceptos comunes; la tarea es generar una oración coherente que describa un escenario cotidiano utilizando estos conceptos.

  • Tamaño del paquete: 1.84 MiB

  • Conjunto de datos de tamaño: 16.84 MiB

  • Auto-caché ( documentación ): Sí

  • Fraccionamientos:

Separar Ejemplos de
'challenge_test_scramble' 500
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 1,497
'train' 67,389
'validation' 993
  • características:
FeaturesDict({
    'concept_set_id': tf.int32,
    'concepts': Sequence(tf.string),
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'target': tf.string,
})
  • Cita:
@inproceedings{lin2020commongen,
  title = "CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning",
  author = "Lin, Bill Yuchen  and
    Zhou, Wangchunshu  and
    Shen, Ming  and
    Zhou, Pei  and
    Bhagavatula, Chandra  and
    Choi, Yejin  and
    Ren, Xiang",
  booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
  month = nov,
  year = "2020",
  address = "Online",
  publisher = "Association for Computational Linguistics",
  url = "https://www.aclweb.org/anthology/2020.findings-emnlp.165",
  pages = "1823--1840",
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / cs_restaurants

  • Descripción config: La tarea está generando respuestas en el contexto de una (hipotética) sistema de diálogo que proporciona información sobre los restaurantes. La entrada es un tipo de acto de intención / diálogo básico y una lista de espacios (atributos) y sus valores. El resultado es una oración en lenguaje natural.

  • Tamaño del paquete: 1.46 MiB

  • Conjunto de datos de tamaño: 2.71 MiB

  • Auto-caché ( documentación ): Sí

  • Fraccionamientos:

Separar Ejemplos de
'challenge_test_scramble' 500
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 842
'train' 3,569
'validation' 781
  • características:
FeaturesDict({
    'dialog_act': tf.string,
    'dialog_act_delexicalized': tf.string,
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'target': tf.string,
    'target_delexicalized': tf.string,
})
  • Cita:
@inproceedings{cs_restaurants,
  address = {Tokyo, Japan},
  title = {Neural {Generation} for {Czech}: {Data} and {Baselines} },
  shorttitle = {Neural {Generation} for {Czech} },
  url = {https://www.aclweb.org/anthology/W19-8670/},
  urldate = {2019-10-18},
  booktitle = {Proceedings of the 12th {International} {Conference} on {Natural} {Language} {Generation} ({INLG} 2019)},
  author = {Dušek, Ondřej and Jurčíček, Filip},
  month = oct,
  year = {2019},
  pages = {563--574}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gema / dardo

  • Descripción config: DART es un grande y abierto en el dominio estructurado datos de registro a la generación de corpus de texto con anotaciones de frases de alta calidad con ser cada entrada un conjunto de triples Entidad-Relación tras un árbol estructurado ontología.

  • Tamaño del paquete: 28.01 MiB

  • Conjunto de datos de tamaño: 33.78 MiB

  • Auto-caché ( documentación ): Sí

  • Fraccionamientos:

Separar Ejemplos de
'test' 6,959
'train' 62.659
'validation' 2,768
  • características:
FeaturesDict({
    'dart_id': tf.int32,
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'subtree_was_extended': tf.bool,
    'target': tf.string,
    'target_sources': Sequence(tf.string),
    'tripleset': Sequence(tf.string),
})
  • Cita:
@article{radev2020dart,
  title=Dart: Open-domain structured data record to text generation,
  author={Radev, Dragomir and Zhang, Rui and Rau, Amrit and Sivaprasad, Abhinand and Hsieh, Chiachun and Rajani, Nazneen Fatema and Tang, Xiangru and Vyas, Aadit and Verma, Neha and Krishna, Pranav and others},
  journal={arXiv preprint arXiv:2007.02871},
  year={2020}
}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / e2e_nlg

  • Descripción config: El conjunto de datos E2E está diseñado para una tarea de dominio limitado-de datos a texto - generación de descripciones / recomendaciones de restaurantes basa en hasta 8 atributos diferentes (nombre, superficie, nivel de precios, etc.)

  • Tamaño del paquete: 13.99 MiB

  • Conjunto de datos de tamaño: 16.92 MiB

  • Auto-caché ( documentación ): Sí

  • Fraccionamientos:

Separar Ejemplos de
'challenge_test_scramble' 500
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 4.693
'train' 33,525
'validation' 4.299
  • características:
FeaturesDict({
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'meaning_representation': tf.string,
    'references': Sequence(tf.string),
    'target': tf.string,
})
  • Cita:
@inproceedings{e2e_cleaned,
  address = {Tokyo, Japan},
  title = {Semantic {Noise} {Matters} for {Neural} {Natural} {Language} {Generation} },
  url = {https://www.aclweb.org/anthology/W19-8652/},
  booktitle = {Proceedings of the 12th {International} {Conference} on {Natural} {Language} {Generation} ({INLG} 2019)},
  author = {Dušek, Ondřej and Howcroft, David M and Rieser, Verena},
  year = {2019},
  pages = {421--426},
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / mlsum_de

  • Descripción config: MLSum es un gran escala multilingüe resumen conjunto de datos. Está construido a partir de medios de comunicación en línea, esta división se centra en el alemán.

  • Tamaño del paquete: 345.98 MiB

  • Conjunto de datos de tamaño: 963.60 MiB

  • Auto-caché ( documentación ): No

  • Fraccionamientos:

Separar Ejemplos de
'challenge_test_covid' 5,058
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 10,695
'train' 220,748
'validation' 11,392
  • características:
FeaturesDict({
    'date': tf.string,
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'target': tf.string,
    'text': tf.string,
    'title': tf.string,
    'topic': tf.string,
    'url': tf.string,
})
  • Cita:
@inproceedings{scialom-etal-2020-mlsum,
    title = "{MLSUM}: The Multilingual Summarization Corpus",
    author = {Scialom, Thomas  and Dray, Paul-Alexis  and Lamprier, Sylvain  and Piwowarski, Benjamin  and Staiano, Jacopo},
    booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
    year = {2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / mlsum_es

  • Descripción config: MLSum es un gran escala multilingüe resumen conjunto de datos. Está construido a partir de medios de comunicación en línea, esta división se centra en el español.

  • Tamaño del paquete: 501.27 MiB

  • Tamaño Conjunto de datos: 1.29 GiB

  • Auto-caché ( documentación ): No

  • Fraccionamientos:

Separar Ejemplos de
'challenge_test_covid' 1.938
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 13,366
'train' 259,888
'validation' 9,977
  • características:
FeaturesDict({
    'date': tf.string,
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'target': tf.string,
    'text': tf.string,
    'title': tf.string,
    'topic': tf.string,
    'url': tf.string,
})
  • Cita:
@inproceedings{scialom-etal-2020-mlsum,
    title = "{MLSUM}: The Multilingual Summarization Corpus",
    author = {Scialom, Thomas  and Dray, Paul-Alexis  and Lamprier, Sylvain  and Piwowarski, Benjamin  and Staiano, Jacopo},
    booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
    year = {2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / schema_guided_dialog

  • Descripción config: Diálogo guiada-Esquema El (SGD) conjunto de datos contiene 18 quilates diálogos orientados a tareas multi-dominio entre un un asistente virtual, que cubre 17 dominios que van desde bancos y eventos para los medios de comunicación, el calendario, los viajes y el tiempo y humanos.

  • Tamaño del paquete: 17.00 MiB

  • Conjunto de datos de tamaño: 201.19 MiB

  • Auto-caché ( documentación ): Sí (challenge_test_backtranslation, challenge_test_bfp02, challenge_test_bfp05, challenge_test_nopunc, challenge_test_scramble, challenge_train_sample, challenge_validation_sample, prueba, validación), solamente cuando está shuffle_files=False (tren)

  • Fraccionamientos:

Separar Ejemplos de
'challenge_test_backtranslation' 500
'challenge_test_bfp02' 500
'challenge_test_bfp05' 500
'challenge_test_nopunc' 500
'challenge_test_scramble' 500
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 10,000
'train' 164,982
'validation' 10,000
  • características:
FeaturesDict({
    'context': Sequence(tf.string),
    'dialog_acts': Sequence({
        'act': ClassLabel(shape=(), dtype=tf.int64, num_classes=18),
        'slot': tf.string,
        'values': Sequence(tf.string),
    }),
    'dialog_id': tf.string,
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'prompt': tf.string,
    'references': Sequence(tf.string),
    'service': tf.string,
    'target': tf.string,
    'turn_id': tf.int32,
})
  • Cita:
@article{rastogi2019towards,
  title={Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset},
  author={Rastogi, Abhinav and Zang, Xiaoxue and Sunkara, Srinivas and Gupta, Raghav and Khaitan, Pranav},
  journal={arXiv preprint arXiv:1909.05855},
  year={2019}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gema / totto

  • Descripción config: Totto es una tarea NLG-tabla a texto. La tarea es la siguiente: Dada una tabla de Wikipedia con nombres de filas, nombres de columnas y celdas de tabla, con un subconjunto de celdas resaltadas, genere una descripción en lenguaje natural para la parte resaltada de la tabla.

  • Tamaño del paquete: 180.75 MiB

  • Conjunto de datos de tamaño: 645.86 MiB

  • Auto-caché ( documentación ): No

  • Fraccionamientos:

Separar Ejemplos de
'challenge_test_scramble' 500
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 7.700
'train' 121,153
'validation' 7.700
  • características:
FeaturesDict({
    'example_id': tf.string,
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'highlighted_cells': Sequence(Sequence(tf.int32)),
    'overlap_subset': tf.string,
    'references': Sequence(tf.string),
    'sentence_annotations': Sequence({
        'final_sentence': tf.string,
        'original_sentence': tf.string,
        'sentence_after_ambiguity': tf.string,
        'sentence_after_deletion': tf.string,
    }),
    'table': Sequence(Sequence({
        'column_span': tf.int32,
        'is_header': tf.bool,
        'row_span': tf.int32,
        'value': tf.string,
    })),
    'table_page_title': tf.string,
    'table_section_text': tf.string,
    'table_section_title': tf.string,
    'table_webpage_url': tf.string,
    'target': tf.string,
    'totto_id': tf.int32,
})
  • Cita:
@inproceedings{parikh2020totto,
  title=ToTTo: A Controlled Table-To-Text Generation Dataset,
  author={Parikh, Ankur and Wang, Xuezhi and Gehrmann, Sebastian and Faruqui, Manaal and Dhingra, Bhuwan and Yang, Diyi and Das, Dipanjan},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  pages={1173--1186},
  year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / web_nlg_en

  • Descripción config: WebNLG es un conjunto de datos bilingüe (Inglés, ruso) de DBpedia paralelos conjuntos triples y textos cortos que cubren aproximadamente 450 diferentes propiedades DBpedia. Los datos de WebNLG se crearon originalmente para promover el desarrollo de verbalizadores RDF capaces de generar texto corto y manejar la microplaneación.

  • Tamaño del paquete: 12.57 MiB

  • Conjunto de datos de tamaño: 19.91 MiB

  • Auto-caché ( documentación ): Sí

  • Fraccionamientos:

Separar Ejemplos de
'challenge_test_numbers' 500
'challenge_test_scramble' 500
'challenge_train_sample' 502
'challenge_validation_sample' 499
'test' 1,779
'train' 35,426
'validation' 1,667
  • características:
FeaturesDict({
    'category': tf.string,
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'input': Sequence(tf.string),
    'references': Sequence(tf.string),
    'target': tf.string,
    'webnlg_id': tf.string,
})
  • Cita:
@inproceedings{gardent2017creating,
  author = "Gardent, Claire
    and Shimorina, Anastasia
    and Narayan, Shashi
    and Perez-Beltrachini, Laura",
  title = "Creating Training Corpora for NLG Micro-Planners",
  booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
  year = "2017",
  publisher = "Association for Computational Linguistics",
  pages = "179--188",
  location = "Vancouver, Canada",
  doi = "10.18653/v1/P17-1017",
  url = "http://www.aclweb.org/anthology/P17-1017"
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / web_nlg_ru

  • Descripción config: WebNLG es un conjunto de datos bilingüe (Inglés, ruso) de DBpedia paralelos conjuntos triples y textos cortos que cubren aproximadamente 450 diferentes propiedades DBpedia. Los datos de WebNLG se crearon originalmente para promover el desarrollo de verbalizadores RDF capaces de generar texto corto y manejar la microplaneación.

  • Tamaño del paquete: 7.49 MiB

  • Conjunto de datos de tamaño: 11.30 MiB

  • Auto-caché ( documentación ): Sí

  • Fraccionamientos:

Separar Ejemplos de
'challenge_test_scramble' 500
'challenge_train_sample' 501
'challenge_validation_sample' 500
'test' 1.102
'train' 14,630
'validation' 790
  • características:
FeaturesDict({
    'category': tf.string,
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'input': Sequence(tf.string),
    'references': Sequence(tf.string),
    'target': tf.string,
    'webnlg_id': tf.string,
})
  • Cita:
@inproceedings{gardent2017creating,
  author = "Gardent, Claire
    and Shimorina, Anastasia
    and Narayan, Shashi
    and Perez-Beltrachini, Laura",
  title = "Creating Training Corpora for NLG Micro-Planners",
  booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
  year = "2017",
  publisher = "Association for Computational Linguistics",
  pages = "179--188",
  location = "Vancouver, Canada",
  doi = "10.18653/v1/P17-1017",
  url = "http://www.aclweb.org/anthology/P17-1017"
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / wiki_auto_asset_turk

  • Descripción config: WikiAuto proporciona un conjunto de frases de Inglés alineados Wikipedia y simple Inglés Wikipedia como un recurso para capacitar a los sistemas de simplificación de oraciones. ASSET y TURK son conjuntos de datos de simplificación de alta calidad que se utilizan para realizar pruebas.

  • Tamaño del paquete: 121.01 MiB

  • Conjunto de datos de tamaño: 202.40 MiB

  • Auto-caché ( documentación ): Sí (challenge_test_asset_backtranslation, challenge_test_asset_bfp02, challenge_test_asset_bfp05, challenge_test_asset_nopunc, challenge_test_turk_backtranslation, challenge_test_turk_bfp02, challenge_test_turk_bfp05, challenge_test_turk_nopunc, challenge_train_sample, challenge_validation_sample, test_asset, test_turk, validación), solamente cuando está shuffle_files=False (tren)

  • Fraccionamientos:

Separar Ejemplos de
'challenge_test_asset_backtranslation' 359
'challenge_test_asset_bfp02' 359
'challenge_test_asset_bfp05' 359
'challenge_test_asset_nopunc' 359
'challenge_test_turk_backtranslation' 359
'challenge_test_turk_bfp02' 359
'challenge_test_turk_bfp05' 359
'challenge_test_turk_nopunc' 359
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test_asset' 359
'test_turk' 359
'train' 483,801
'validation' 20.000
  • características:
FeaturesDict({
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'source': tf.string,
    'target': tf.string,
})
  • Cita:
@inproceedings{jiang-etal-2020-neural,
    title = "Neural {CRF} Model for Sentence Alignment in Text Simplification",
    author = "Jiang, Chao  and
      Maddela, Mounica  and
      Lan, Wuwei  and
      Zhong, Yang  and
      Xu, Wei",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.709",
    doi = "10.18653/v1/2020.acl-main.709",
    pages = "7943--7960",
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gema / xsum

  • Descripción config: El conjunto de datos es para la tarea de resumen abstractiva en su forma extrema, es acerca de un documento que resume en una sola frase.

  • Tamaño del paquete: 246.31 MiB

  • Conjunto de datos de tamaño: 78.89 MiB

  • Auto-caché ( documentación ): Sí

  • Fraccionamientos:

Separar Ejemplos de
'challenge_test_backtranslation' 500
'challenge_test_bfp_02' 500
'challenge_test_bfp_05' 500
'challenge_test_covid' 401
'challenge_test_nopunc' 500
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 1,166
'train' 23,206
'validation' 1,117
  • características:
FeaturesDict({
    'document': tf.string,
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'target': tf.string,
    'xsum_id': tf.string,
})
  • Cita:
@inproceedings{Narayan2018dont,
  author = "Shashi Narayan and Shay B. Cohen and Mirella Lapata",
  title = "Don't Give Me the Details, Just the Summary! {T}opic-Aware Convolutional Neural Networks for Extreme Summarization",
  booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing ",
  year = "2018",
  address = "Brussels, Belgium",
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / wiki_lingua_arabic_ar

  • Descripción config: Wikilingua es un gran escala, conjunto de datos multilingüe para la evaluación de los sistemas de resumen abstractiva cruzada lingual ..

  • Tamaño del paquete: 56.25 MiB

  • Conjunto de datos de tamaño: 291.42 MiB

  • Auto-caché ( documentación ): No

  • Fraccionamientos:

Separar Ejemplos de
'test' 5.841
'train' 20,441
'validation' 2,919
  • características:
FeaturesDict({
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'source': tf.string,
    'source_aligned': Translation({
        'ar': Text(shape=(), dtype=tf.string),
        'en': Text(shape=(), dtype=tf.string),
    }),
    'target': tf.string,
    'target_aligned': Translation({
        'ar': Text(shape=(), dtype=tf.string),
        'en': Text(shape=(), dtype=tf.string),
    }),
})
  • Cita:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / wiki_lingua_chinese_zh

  • Descripción config: Wikilingua es un gran escala, conjunto de datos multilingüe para la evaluación de los sistemas de resumen abstractiva cruzada lingual ..

  • Tamaño del paquete: 31.38 MiB

  • Conjunto de datos de tamaño: 122.06 MiB

  • Auto-caché ( documentación ): Sí

  • Fraccionamientos:

Separar Ejemplos de
'test' 3.775
'train' 13,211
'validation' 1,886
  • características:
FeaturesDict({
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'source': tf.string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'zh': Text(shape=(), dtype=tf.string),
    }),
    'target': tf.string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'zh': Text(shape=(), dtype=tf.string),
    }),
})
  • Cita:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / wiki_lingua_czech_cs

  • Descripción config: Wikilingua es un gran escala, conjunto de datos multilingüe para la evaluación de los sistemas de resumen abstractiva cruzada lingual ..

  • Tamaño del paquete: 13.84 MiB

  • Conjunto de datos de tamaño: 58.05 MiB

  • Auto-caché ( documentación ): Sí

  • Fraccionamientos:

Separar Ejemplos de
'test' 1.438
'train' 5,033
'validation' 718
  • características:
FeaturesDict({
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'source': tf.string,
    'source_aligned': Translation({
        'cs': Text(shape=(), dtype=tf.string),
        'en': Text(shape=(), dtype=tf.string),
    }),
    'target': tf.string,
    'target_aligned': Translation({
        'cs': Text(shape=(), dtype=tf.string),
        'en': Text(shape=(), dtype=tf.string),
    }),
})
  • Cita:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / wiki_lingua_dutch_nl

  • Descripción config: Wikilingua es un gran escala, conjunto de datos multilingüe para la evaluación de los sistemas de resumen abstractiva cruzada lingual ..

  • Tamaño del paquete: 53.88 MiB

  • Conjunto de datos de tamaño: 237.97 MiB

  • Auto-caché ( documentación ): Sí (ensayo, validación), solamente cuando está shuffle_files=False (tren)

  • Fraccionamientos:

Separar Ejemplos de
'test' 6.248
'train' 21,866
'validation' 3.123
  • características:
FeaturesDict({
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'source': tf.string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'nl': Text(shape=(), dtype=tf.string),
    }),
    'target': tf.string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'nl': Text(shape=(), dtype=tf.string),
    }),
})
  • Cita:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / wiki_lingua_english_en

  • Descripción config: Wikilingua es un gran escala, conjunto de datos multilingüe para la evaluación de los sistemas de resumen abstractiva cruzada lingual ..

  • Tamaño del paquete: 112.56 MiB

  • Conjunto de datos de tamaño: 657.51 MiB

  • Auto-caché ( documentación ): No

  • Fraccionamientos:

Separar Ejemplos de
'test' 28,614
'train' 99.020
'validation' 13,823
  • características:
FeaturesDict({
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'source': tf.string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
    }),
    'target': tf.string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
    }),
})
  • Cita:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / wiki_lingua_french_fr

  • Descripción config: Wikilingua es un gran escala, conjunto de datos multilingüe para la evaluación de los sistemas de resumen abstractiva cruzada lingual ..

  • Tamaño del paquete: 113.26 MiB

  • Conjunto de datos de tamaño: 522.28 MiB

  • Auto-caché ( documentación ): No

  • Fraccionamientos:

Separar Ejemplos de
'test' 12,731
'train' 44,556
'validation' 6.364
  • características:
FeaturesDict({
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'source': tf.string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'fr': Text(shape=(), dtype=tf.string),
    }),
    'target': tf.string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'fr': Text(shape=(), dtype=tf.string),
    }),
})
  • Cita:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / wiki_lingua_german_de

  • Descripción config: Wikilingua es un gran escala, conjunto de datos multilingüe para la evaluación de los sistemas de resumen abstractiva cruzada lingual ..

  • Tamaño del paquete: 102.65 MiB

  • Conjunto de datos de tamaño: 452.46 MiB

  • Auto-caché ( documentación ): No

  • Fraccionamientos:

Separar Ejemplos de
'test' 11,669
'train' 40,839
'validation' 5.833
  • características:
FeaturesDict({
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'source': tf.string,
    'source_aligned': Translation({
        'de': Text(shape=(), dtype=tf.string),
        'en': Text(shape=(), dtype=tf.string),
    }),
    'target': tf.string,
    'target_aligned': Translation({
        'de': Text(shape=(), dtype=tf.string),
        'en': Text(shape=(), dtype=tf.string),
    }),
})
  • Cita:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / wiki_lingua_hindi_hi

  • Descripción config: Wikilingua es un gran escala, conjunto de datos multilingüe para la evaluación de los sistemas de resumen abstractiva cruzada lingual ..

  • Tamaño del paquete: 20.07 MiB

  • Conjunto de datos de tamaño: 138.06 MiB

  • Auto-caché ( documentación ): Sí

  • Fraccionamientos:

Separar Ejemplos de
'test' 1.984
'train' 6,942
'validation' 991
  • características:
FeaturesDict({
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'source': tf.string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'hi': Text(shape=(), dtype=tf.string),
    }),
    'target': tf.string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'hi': Text(shape=(), dtype=tf.string),
    }),
})
  • Cita:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / wiki_lingua_indonesian_id

  • Descripción config: Wikilingua es un gran escala, conjunto de datos multilingüe para la evaluación de los sistemas de resumen abstractiva cruzada lingual ..

  • Tamaño del paquete: 80.08 MiB

  • Conjunto de datos de tamaño: 370.63 MiB

  • Auto-caché ( documentación ): No

  • Fraccionamientos:

Separar Ejemplos de
'test' 9.497
'train' 33,237
'validation' 4.747
  • características:
FeaturesDict({
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'source': tf.string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'id': Text(shape=(), dtype=tf.string),
    }),
    'target': tf.string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'id': Text(shape=(), dtype=tf.string),
    }),
})
  • Cita:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / wiki_lingua_italian_it

  • Descripción config: Wikilingua es un gran escala, conjunto de datos multilingüe para la evaluación de los sistemas de resumen abstractiva cruzada lingual ..

  • Tamaño del paquete: 84.80 MiB

  • Conjunto de datos de tamaño: 374.40 MiB

  • Auto-caché ( documentación ): No

  • Fraccionamientos:

Separar Ejemplos de
'test' 10.189
'train' 35,661
'validation' 5.093
  • características:
FeaturesDict({
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'source': tf.string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'it': Text(shape=(), dtype=tf.string),
    }),
    'target': tf.string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'it': Text(shape=(), dtype=tf.string),
    }),
})
  • Cita:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / wiki_lingua_japanese_ja

  • Descripción config: Wikilingua es un gran escala, conjunto de datos multilingüe para la evaluación de los sistemas de resumen abstractiva cruzada lingual ..

  • Tamaño del paquete: 21.75 MiB

  • Conjunto de datos de tamaño: 103.19 MiB

  • Auto-caché ( documentación ): Sí

  • Fraccionamientos:

Separar Ejemplos de
'test' 2.530
'train' 8.853
'validation' 1,264
  • características:
FeaturesDict({
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'source': tf.string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'ja': Text(shape=(), dtype=tf.string),
    }),
    'target': tf.string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'ja': Text(shape=(), dtype=tf.string),
    }),
})
  • Cita:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / wiki_lingua_korean_ko

  • Descripción config: Wikilingua es un gran escala, conjunto de datos multilingüe para la evaluación de los sistemas de resumen abstractiva cruzada lingual ..

  • Tamaño del paquete: 22.26 MiB

  • Conjunto de datos de tamaño: 102.35 MiB

  • Auto-caché ( documentación ): Sí

  • Fraccionamientos:

Separar Ejemplos de
'test' 2,436
'train' 8.524
'validation' 1.216
  • características:
FeaturesDict({
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'source': tf.string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'ko': Text(shape=(), dtype=tf.string),
    }),
    'target': tf.string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'ko': Text(shape=(), dtype=tf.string),
    }),
})
  • Cita:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / wiki_lingua_portuguese_pt

  • Descripción config: Wikilingua es un gran escala, conjunto de datos multilingüe para la evaluación de los sistemas de resumen abstractiva cruzada lingual ..

  • Tamaño del paquete: 131.17 MiB

  • Conjunto de datos de tamaño: 570.46 MiB

  • Auto-caché ( documentación ): No

  • Fraccionamientos:

Separar Ejemplos de
'test' 16,331
'train' 57,159
'validation' 8.165
  • características:
FeaturesDict({
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'source': tf.string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'pt': Text(shape=(), dtype=tf.string),
    }),
    'target': tf.string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'pt': Text(shape=(), dtype=tf.string),
    }),
})
  • Cita:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / wiki_lingua_russian_ru

  • Descripción config: Wikilingua es un gran escala, conjunto de datos multilingüe para la evaluación de los sistemas de resumen abstractiva cruzada lingual ..

  • Tamaño del paquete: 101.36 MiB

  • Conjunto de datos de tamaño: 564.69 MiB

  • Auto-caché ( documentación ): No

  • Fraccionamientos:

Separar Ejemplos de
'test' 10,580
'train' 37.028
'validation' 5.288
  • características:
FeaturesDict({
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'source': tf.string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'ru': Text(shape=(), dtype=tf.string),
    }),
    'target': tf.string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'ru': Text(shape=(), dtype=tf.string),
    }),
})
  • Cita:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / wiki_lingua_spanish_es

  • Descripción config: Wikilingua es un gran escala, conjunto de datos multilingüe para la evaluación de los sistemas de resumen abstractiva cruzada lingual ..

  • Tamaño del paquete: 189.06 MiB

  • Conjunto de datos de tamaño: 849.75 MiB

  • Auto-caché ( documentación ): No

  • Fraccionamientos:

Separar Ejemplos de
'test' 22,632
'train' 79,212
'validation' 11,316
  • características:
FeaturesDict({
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'source': tf.string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'es': Text(shape=(), dtype=tf.string),
    }),
    'target': tf.string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'es': Text(shape=(), dtype=tf.string),
    }),
})
  • Cita:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / wiki_lingua_thai_th

  • Descripción config: Wikilingua es un gran escala, conjunto de datos multilingüe para la evaluación de los sistemas de resumen abstractiva cruzada lingual ..

  • Tamaño del paquete: 28.60 MiB

  • Conjunto de datos de tamaño: 193.77 MiB

  • Auto-caché ( documentación ): Sí (ensayo, validación), solamente cuando está shuffle_files=False (tren)

  • Fraccionamientos:

Separar Ejemplos de
'test' 2,950
'train' 10,325
'validation' 1,475
  • características:
FeaturesDict({
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'source': tf.string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'th': Text(shape=(), dtype=tf.string),
    }),
    'target': tf.string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'th': Text(shape=(), dtype=tf.string),
    }),
})
  • Cita:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / wiki_lingua_turkish_tr

  • Descripción config: Wikilingua es un gran escala, conjunto de datos multilingüe para la evaluación de los sistemas de resumen abstractiva cruzada lingual ..

  • Tamaño del paquete: 6.73 MiB

  • Conjunto de datos de tamaño: 30.75 MiB

  • Auto-caché ( documentación ): Sí

  • Fraccionamientos:

Separar Ejemplos de
'test' 900
'train' 3,148
'validation' 449
  • características:
FeaturesDict({
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'source': tf.string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'tr': Text(shape=(), dtype=tf.string),
    }),
    'target': tf.string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'tr': Text(shape=(), dtype=tf.string),
    }),
})
  • Cita:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem / wiki_lingua_vietnamese_vi

  • Descripción config: Wikilingua es un gran escala, conjunto de datos multilingüe para la evaluación de los sistemas de resumen abstractiva cruzada lingual ..

  • Tamaño del paquete: 36.27 MiB

  • Conjunto de datos de tamaño: 179.77 MiB

  • Auto-caché ( documentación ): Sí

  • Fraccionamientos:

Separar Ejemplos de
'test' 3.917
'train' 13,707
'validation' 1.957
  • características:
FeaturesDict({
    'gem_id': tf.string,
    'gem_parent_id': tf.string,
    'references': Sequence(tf.string),
    'source': tf.string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'vi': Text(shape=(), dtype=tf.string),
    }),
    'target': tf.string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=tf.string),
        'vi': Text(shape=(), dtype=tf.string),
    }),
})
  • Cita:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."