פְּנִינָה

 • תיאור:

GEM היא סביבת benchmark עבור Natural Language הדור עם דגש על ההערכה שלו, הוא באמצעות הערות אדם מדדים אוטומטיים.

GEM שואפת: (1) למדוד התקדמות NLG על פני 13 מערכי נתונים המשתרעים על משימות ושפות NLG רבות. (2) לספק ניתוח מעמיק של נתונים ומודלים המוצגים באמצעות הצהרות נתונים ומערכות אתגרים. (3) לפתח תקנים להערכת טקסט שנוצר באמצעות מדדים אוטומטיים ואנושיים כאחד.

ניתן למצוא מידע נוסף באתר https://gem-benchmark.com .

gem/common_gen (הגדרת ברירת מחדל)

 • תיאור Config: CommonGen היא משימה בדור טקסט מאולצת, הקשורים במערך benchmark, למכונות בדיקה במפורש את היכולת של חשיבת שכל ישר יוצרות. בהינתן מכלול של מושגים נפוצים; המשימה היא ליצור משפט קוהרנטי המתאר תרחיש יומיומי באמצעות מושגים אלה.

 • גודל ההורדה: 1.84 MiB

 • מערך נתונים גודל: 16.84 MiB

 • Auto-במטמון ( תיעוד ): כן

 • פיצולים:

לְפַצֵל דוגמאות
'challenge_test_scramble' 500
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 1,497
'train' 67,389
'validation' 993
 • מאפיינים:
FeaturesDict({
  'concept_set_id': tf.int32,
  'concepts': Sequence(tf.string),
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'target': tf.string,
})
 • ציטוט:
@inproceedings{lin2020commongen,
 title = "CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning",
 author = "Lin, Bill Yuchen and
  Zhou, Wangchunshu and
  Shen, Ming and
  Zhou, Pei and
  Bhagavatula, Chandra and
  Choi, Yejin and
  Ren, Xiang",
 booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
 month = nov,
 year = "2020",
 address = "Online",
 publisher = "Association for Computational Linguistics",
 url = "https://www.aclweb.org/anthology/2020.findings-emnlp.165",
 pages = "1823--1840",
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/cs_restaurants

 • תיאור Config: המשימה היא לייצר תגובות בהקשר של מערכת דיאלוג (היפותטי) המספק מידע על מסעדות. הקלט הוא סוג מעשה בסיסי של כוונה/דיאלוג ורשימת חריצים (תכונות) וערכיהם. הפלט הוא משפט בשפה טבעית.

 • גודל ההורדה: 1.46 MiB

 • מערך נתונים גודל: 2.71 MiB

 • Auto-במטמון ( תיעוד ): כן

 • פיצולים:

לְפַצֵל דוגמאות
'challenge_test_scramble' 500
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 842
'train' 3,569
'validation' 781
 • מאפיינים:
FeaturesDict({
  'dialog_act': tf.string,
  'dialog_act_delexicalized': tf.string,
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'target': tf.string,
  'target_delexicalized': tf.string,
})
 • ציטוט:
@inproceedings{cs_restaurants,
 address = {Tokyo, Japan},
 title = {Neural {Generation} for {Czech}: {Data} and {Baselines} },
 shorttitle = {Neural {Generation} for {Czech} },
 url = {https://www.aclweb.org/anthology/W19-8670/},
 urldate = {2019-10-18},
 booktitle = {Proceedings of the 12th {International} {Conference} on {Natural} {Language} {Generation} ({INLG} 2019)},
 author = {Dušek, Ondřej and Jurčíček, Filip},
 month = oct,
 year = {2019},
 pages = {563--574}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/חץ

 • תיאור Config: DART הוא תחום-פתוח וגדול מובן נתון שיא אל קורפוס דור טקסט עם הסברי משפט איכותיים עם כול קלט להיות קבוצה של משולשי יישות-קשר הבא עץ מובנה האונטולוגיה.

 • גודל ההורדה: 28.01 MiB

 • מערך נתונים גודל: 33.78 MiB

 • Auto-במטמון ( תיעוד ): כן

 • פיצולים:

לְפַצֵל דוגמאות
'test' 6,959
'train' 62,659
'validation' 2,768
 • מאפיינים:
FeaturesDict({
  'dart_id': tf.int32,
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'subtree_was_extended': tf.bool,
  'target': tf.string,
  'target_sources': Sequence(tf.string),
  'tripleset': Sequence(tf.string),
})
 • ציטוט:
@article{radev2020dart,
 title=Dart: Open-domain structured data record to text generation,
 author={Radev, Dragomir and Zhang, Rui and Rau, Amrit and Sivaprasad, Abhinand and Hsieh, Chiachun and Rajani, Nazneen Fatema and Tang, Xiangru and Vyas, Aadit and Verma, Neha and Krishna, Pranav and others},
 journal={arXiv preprint arXiv:2007.02871},
 year={2020}
}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/e2e_nlg

 • תיאור Config: בסיס נתון E2E מיועד משימה-תחום מוגבל נתונים אלי טקסט - דור של תיאורי מסעדה / המלצות המבוסס על עד 8 תכונות שונות (שם, באזור, בטווח מחירים וכו ')

 • גודל ההורדה: 13.99 MiB

 • מערך נתונים גודל: 16.92 MiB

 • Auto-במטמון ( תיעוד ): כן

 • פיצולים:

לְפַצֵל דוגמאות
'challenge_test_scramble' 500
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 4,693
'train' 33,525
'validation' 4,299
 • מאפיינים:
FeaturesDict({
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'meaning_representation': tf.string,
  'references': Sequence(tf.string),
  'target': tf.string,
})
 • ציטוט:
@inproceedings{e2e_cleaned,
 address = {Tokyo, Japan},
 title = {Semantic {Noise} {Matters} for {Neural} {Natural} {Language} {Generation} },
 url = {https://www.aclweb.org/anthology/W19-8652/},
 booktitle = {Proceedings of the 12th {International} {Conference} on {Natural} {Language} {Generation} ({INLG} 2019)},
 author = {Dušek, Ondřej and Howcroft, David M and Rieser, Verena},
 year = {2019},
 pages = {421--426},
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/mlsum_de

 • תיאור Config: MLSum הוא במערך תמצות רב לשוני בקנה מידה גדול. הוא נרכש מכלי חדשות מקוונים, הפיצול הזה מתמקד בגרמנית.

 • גודל ההורדה: 345.98 MiB

 • מערך נתונים גודל: 963.60 MiB

 • Auto-במטמון ( תיעוד ): אין

 • פיצולים:

לְפַצֵל דוגמאות
'challenge_test_covid' 5,058
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 10,695
'train' 220,748
'validation' 11,392
 • מאפיינים:
FeaturesDict({
  'date': tf.string,
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'target': tf.string,
  'text': tf.string,
  'title': tf.string,
  'topic': tf.string,
  'url': tf.string,
})
 • ציטוט:
@inproceedings{scialom-etal-2020-mlsum,
  title = "{MLSUM}: The Multilingual Summarization Corpus",
  author = {Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo},
  booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2020}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/mlsum_es

 • תיאור Config: MLSum הוא במערך תמצות רב לשוני בקנה מידה גדול. הוא נרכש מכלי חדשות מקוונים, הפיצול המתמקד בספרדית.

 • גודל ההורדה: 501.27 MiB

 • גודל בסיס הנתונים: 1.29 GiB

 • Auto-במטמון ( תיעוד ): אין

 • פיצולים:

לְפַצֵל דוגמאות
'challenge_test_covid' 1,938
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 13,366
'train' 259,888
'validation' 9,977
 • מאפיינים:
FeaturesDict({
  'date': tf.string,
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'target': tf.string,
  'text': tf.string,
  'title': tf.string,
  'topic': tf.string,
  'url': tf.string,
})
 • ציטוט:
@inproceedings{scialom-etal-2020-mlsum,
  title = "{MLSUM}: The Multilingual Summarization Corpus",
  author = {Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo},
  booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2020}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/schema_guided_dialog

 • תיאור Config: דיאלוג סכימה-מודרך (SGD) נתון מכילים 18K דיאלוגי תחום מרובה-משימתית בין אדם לבין עוזר וירטואלי, אשר מכסה 17 תחומים, החל מבנקים ואירועים למדיה, לוח שנה, נסיעות, ומזג אוויר.

 • גודל ההורדה: 17.00 MiB

 • מערך נתונים גודל: 201.19 MiB

 • Auto-במטמון ( תיעוד ): כן (challenge_test_backtranslation, challenge_test_bfp02, challenge_test_bfp05, challenge_test_nopunc, challenge_test_scramble, challenge_train_sample, challenge_validation_sample, מבחן, אימות), רק כאשר shuffle_files=False (הרכבת)

 • פיצולים:

לְפַצֵל דוגמאות
'challenge_test_backtranslation' 500
'challenge_test_bfp02' 500
'challenge_test_bfp05' 500
'challenge_test_nopunc' 500
'challenge_test_scramble' 500
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 10,000
'train' 164,982
'validation' 10,000
 • מאפיינים:
FeaturesDict({
  'context': Sequence(tf.string),
  'dialog_acts': Sequence({
    'act': ClassLabel(shape=(), dtype=tf.int64, num_classes=18),
    'slot': tf.string,
    'values': Sequence(tf.string),
  }),
  'dialog_id': tf.string,
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'prompt': tf.string,
  'references': Sequence(tf.string),
  'service': tf.string,
  'target': tf.string,
  'turn_id': tf.int32,
})
 • ציטוט:
@article{rastogi2019towards,
 title={Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset},
 author={Rastogi, Abhinav and Zang, Xiaoxue and Sunkara, Srinivas and Gupta, Raghav and Khaitan, Pranav},
 journal={arXiv preprint arXiv:1909.05855},
 year={2019}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/טוטו

 • תיאור Config: Totto היא משימה NLG שולחן-to-Text. המשימה היא כדלקמן: בהתחשב בטבלת ויקיפדיה עם שמות שורות, שמות עמודות ותאי טבלה, עם קבוצת תאים מודגשת, יוצרים תיאור בשפה טבעית עבור החלק המודגש של הטבלה.

 • גודל ההורדה: 180.75 MiB

 • מערך נתונים גודל: 645.86 MiB

 • Auto-במטמון ( תיעוד ): אין

 • פיצולים:

לְפַצֵל דוגמאות
'challenge_test_scramble' 500
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 7,700
'train' 121,153
'validation' 7,700
 • מאפיינים:
FeaturesDict({
  'example_id': tf.string,
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'highlighted_cells': Sequence(Sequence(tf.int32)),
  'overlap_subset': tf.string,
  'references': Sequence(tf.string),
  'sentence_annotations': Sequence({
    'final_sentence': tf.string,
    'original_sentence': tf.string,
    'sentence_after_ambiguity': tf.string,
    'sentence_after_deletion': tf.string,
  }),
  'table': Sequence(Sequence({
    'column_span': tf.int32,
    'is_header': tf.bool,
    'row_span': tf.int32,
    'value': tf.string,
  })),
  'table_page_title': tf.string,
  'table_section_text': tf.string,
  'table_section_title': tf.string,
  'table_webpage_url': tf.string,
  'target': tf.string,
  'totto_id': tf.int32,
})
 • ציטוט:
@inproceedings{parikh2020totto,
 title=ToTTo: A Controlled Table-To-Text Generation Dataset,
 author={Parikh, Ankur and Wang, Xuezhi and Gehrmann, Sebastian and Faruqui, Manaal and Dhingra, Bhuwan and Yang, Diyi and Das, Dipanjan},
 booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
 pages={1173--1186},
 year={2020}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/web_nlg_en

 • תיאור Config: WebNLG הוא במערך דו-לשוני (אנגלית, רוסית) של DBpedia במקביל סטים משולשים וטקסטים קצרים כיסוי על 450 תכונות DBpedia שונות. נתוני WebNLG נוצרו במקור על מנת לקדם את פיתוחם של RDF verbalisers המסוגלים ליצור טקסט קצר ולטפל בתכנון מיקרו.

 • גודל ההורדה: 12.57 MiB

 • מערך נתונים גודל: 19.91 MiB

 • Auto-במטמון ( תיעוד ): כן

 • פיצולים:

לְפַצֵל דוגמאות
'challenge_test_numbers' 500
'challenge_test_scramble' 500
'challenge_train_sample' 502
'challenge_validation_sample' 499
'test' 1,779
'train' 35,426
'validation' 1,667
 • מאפיינים:
FeaturesDict({
  'category': tf.string,
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'input': Sequence(tf.string),
  'references': Sequence(tf.string),
  'target': tf.string,
  'webnlg_id': tf.string,
})
 • ציטוט:
@inproceedings{gardent2017creating,
 author = "Gardent, Claire
  and Shimorina, Anastasia
  and Narayan, Shashi
  and Perez-Beltrachini, Laura",
 title = "Creating Training Corpora for NLG Micro-Planners",
 booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
 year = "2017",
 publisher = "Association for Computational Linguistics",
 pages = "179--188",
 location = "Vancouver, Canada",
 doi = "10.18653/v1/P17-1017",
 url = "http://www.aclweb.org/anthology/P17-1017"
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/web_nlg_ru

 • תיאור Config: WebNLG הוא במערך דו-לשוני (אנגלית, רוסית) של DBpedia במקביל סטים משולשים וטקסטים קצרים כיסוי על 450 תכונות DBpedia שונות. נתוני WebNLG נוצרו במקור על מנת לקדם את פיתוחם של RDF verbalisers המסוגלים ליצור טקסט קצר ולטפל בתכנון מיקרו.

 • גודל ההורדה: 7.49 MiB

 • מערך נתונים גודל: 11.30 MiB

 • Auto-במטמון ( תיעוד ): כן

 • פיצולים:

לְפַצֵל דוגמאות
'challenge_test_scramble' 500
'challenge_train_sample' 501
'challenge_validation_sample' 500
'test' 1,102
'train' 14,630
'validation' 790
 • מאפיינים:
FeaturesDict({
  'category': tf.string,
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'input': Sequence(tf.string),
  'references': Sequence(tf.string),
  'target': tf.string,
  'webnlg_id': tf.string,
})
 • ציטוט:
@inproceedings{gardent2017creating,
 author = "Gardent, Claire
  and Shimorina, Anastasia
  and Narayan, Shashi
  and Perez-Beltrachini, Laura",
 title = "Creating Training Corpora for NLG Micro-Planners",
 booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
 year = "2017",
 publisher = "Association for Computational Linguistics",
 pages = "179--188",
 location = "Vancouver, Canada",
 doi = "10.18653/v1/P17-1017",
 url = "http://www.aclweb.org/anthology/P17-1017"
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/wiki_auto_asset_turk

 • Config תיאור: WikiAuto מספק סט של משפטים מיושרים מוויקיפדיה באנגלית ויקיפדיה באנגלית פשוט כמשאב להכשיר מערכות פישוט משפט. ASSET ו- TURK הם מערכות נתונים מפשטות באיכות גבוהה המשמשות לבדיקה.

 • גודל ההורדה: 121.01 MiB

 • מערך נתונים גודל: 202.40 MiB

 • Auto-במטמון ( תיעוד ): כן (challenge_test_asset_backtranslation, challenge_test_asset_bfp02, challenge_test_asset_bfp05, challenge_test_asset_nopunc, challenge_test_turk_backtranslation, challenge_test_turk_bfp02, challenge_test_turk_bfp05, challenge_test_turk_nopunc, challenge_train_sample, challenge_validation_sample, test_asset, test_turk, אימות), רק כאשר shuffle_files=False (הרכבת)

 • פיצולים:

לְפַצֵל דוגמאות
'challenge_test_asset_backtranslation' 359
'challenge_test_asset_bfp02' 359
'challenge_test_asset_bfp05' 359
'challenge_test_asset_nopunc' 359
'challenge_test_turk_backtranslation' 359
'challenge_test_turk_bfp02' 359
'challenge_test_turk_bfp05' 359
'challenge_test_turk_nopunc' 359
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test_asset' 359
'test_turk' 359
'train' 483,801
'validation' 20,000
 • מאפיינים:
FeaturesDict({
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'source': tf.string,
  'target': tf.string,
})
 • ציטוט:
@inproceedings{jiang-etal-2020-neural,
  title = "Neural {CRF} Model for Sentence Alignment in Text Simplification",
  author = "Jiang, Chao and
   Maddela, Mounica and
   Lan, Wuwei and
   Zhong, Yang and
   Xu, Wei",
  booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
  month = jul,
  year = "2020",
  address = "Online",
  publisher = "Association for Computational Linguistics",
  url = "https://www.aclweb.org/anthology/2020.acl-main.709",
  doi = "10.18653/v1/2020.acl-main.709",
  pages = "7943--7960",
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/קסום

 • Config תיאור: מערך הנתונים למשימה של תמצות abstractive בצורתה הקיצונית, שלה כ המסכם מסמך במשפט אחד.

 • גודל ההורדה: 246.31 MiB

 • מערך נתונים גודל: 78.89 MiB

 • Auto-במטמון ( תיעוד ): כן

 • פיצולים:

לְפַצֵל דוגמאות
'challenge_test_backtranslation' 500
'challenge_test_bfp_02' 500
'challenge_test_bfp_05' 500
'challenge_test_covid' 401
'challenge_test_nopunc' 500
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 1,166
'train' 23,206
'validation' 1,117
 • מאפיינים:
FeaturesDict({
  'document': tf.string,
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'target': tf.string,
  'xsum_id': tf.string,
})
 • ציטוט:
@inproceedings{Narayan2018dont,
 author = "Shashi Narayan and Shay B. Cohen and Mirella Lapata",
 title = "Don't Give Me the Details, Just the Summary! {T}opic-Aware Convolutional Neural Networks for Extreme Summarization",
 booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing ",
 year = "2018",
 address = "Brussels, Belgium",
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/wiki_lingua_arabic_ar

 • תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..

 • גודל ההורדה: 56.25 MiB

 • מערך נתונים גודל: 291.42 MiB

 • Auto-במטמון ( תיעוד ): אין

 • פיצולים:

לְפַצֵל דוגמאות
'test' 5,841
'train' 20,441
'validation' 2,919
 • מאפיינים:
FeaturesDict({
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'source': tf.string,
  'source_aligned': Translation({
    'ar': Text(shape=(), dtype=tf.string),
    'en': Text(shape=(), dtype=tf.string),
  }),
  'target': tf.string,
  'target_aligned': Translation({
    'ar': Text(shape=(), dtype=tf.string),
    'en': Text(shape=(), dtype=tf.string),
  }),
})
 • ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/wiki_lingua_chinese_zh

 • תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..

 • גודל ההורדה: 31.38 MiB

 • מערך נתונים גודל: 122.06 MiB

 • Auto-במטמון ( תיעוד ): כן

 • פיצולים:

לְפַצֵל דוגמאות
'test' 3,775
'train' 13,211
'validation' 1,886
 • מאפיינים:
FeaturesDict({
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'source': tf.string,
  'source_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'zh': Text(shape=(), dtype=tf.string),
  }),
  'target': tf.string,
  'target_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'zh': Text(shape=(), dtype=tf.string),
  }),
})
 • ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/wiki_lingua_czech_cs

 • תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..

 • גודל ההורדה: 13.84 MiB

 • מערך נתונים גודל: 58.05 MiB

 • Auto-במטמון ( תיעוד ): כן

 • פיצולים:

לְפַצֵל דוגמאות
'test' 1,438
'train' 5,033
'validation' 718
 • מאפיינים:
FeaturesDict({
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'source': tf.string,
  'source_aligned': Translation({
    'cs': Text(shape=(), dtype=tf.string),
    'en': Text(shape=(), dtype=tf.string),
  }),
  'target': tf.string,
  'target_aligned': Translation({
    'cs': Text(shape=(), dtype=tf.string),
    'en': Text(shape=(), dtype=tf.string),
  }),
})
 • ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/wiki_lingua_dutch_nl

 • תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..

 • גודל ההורדה: 53.88 MiB

 • מערך נתונים גודל: 237.97 MiB

 • Auto-במטמון ( תיעוד ): כן (מבחן, אימות), רק כאשר shuffle_files=False (הרכבת)

 • פיצולים:

לְפַצֵל דוגמאות
'test' 6,248
'train' 21,866
'validation' 3,123
 • מאפיינים:
FeaturesDict({
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'source': tf.string,
  'source_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'nl': Text(shape=(), dtype=tf.string),
  }),
  'target': tf.string,
  'target_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'nl': Text(shape=(), dtype=tf.string),
  }),
})
 • ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/wiki_lingua_english_en

 • תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..

 • גודל ההורדה: 112.56 MiB

 • מערך נתונים גודל: 657.51 MiB

 • Auto-במטמון ( תיעוד ): אין

 • פיצולים:

לְפַצֵל דוגמאות
'test' 28,614
'train' 99,020
'validation' 13,823
 • מאפיינים:
FeaturesDict({
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'source': tf.string,
  'source_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
  }),
  'target': tf.string,
  'target_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
  }),
})
 • ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/wiki_lingua_french_fr

 • תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..

 • גודל ההורדה: 113.26 MiB

 • מערך נתונים גודל: 522.28 MiB

 • Auto-במטמון ( תיעוד ): אין

 • פיצולים:

לְפַצֵל דוגמאות
'test' 12,731
'train' 44,556
'validation' 6,364
 • מאפיינים:
FeaturesDict({
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'source': tf.string,
  'source_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'fr': Text(shape=(), dtype=tf.string),
  }),
  'target': tf.string,
  'target_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'fr': Text(shape=(), dtype=tf.string),
  }),
})
 • ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/wiki_lingua_german_de

 • תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..

 • גודל ההורדה: 102.65 MiB

 • מערך נתונים גודל: 452.46 MiB

 • Auto-במטמון ( תיעוד ): אין

 • פיצולים:

לְפַצֵל דוגמאות
'test' 11,669
'train' 40,839
'validation' 5,833
 • מאפיינים:
FeaturesDict({
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'source': tf.string,
  'source_aligned': Translation({
    'de': Text(shape=(), dtype=tf.string),
    'en': Text(shape=(), dtype=tf.string),
  }),
  'target': tf.string,
  'target_aligned': Translation({
    'de': Text(shape=(), dtype=tf.string),
    'en': Text(shape=(), dtype=tf.string),
  }),
})
 • ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/wiki_lingua_hindi_hi

 • תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..

 • גודל ההורדה: 20.07 MiB

 • מערך נתונים גודל: 138.06 MiB

 • Auto-במטמון ( תיעוד ): כן

 • פיצולים:

לְפַצֵל דוגמאות
'test' 1,984
'train' 6,942
'validation' 991
 • מאפיינים:
FeaturesDict({
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'source': tf.string,
  'source_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'hi': Text(shape=(), dtype=tf.string),
  }),
  'target': tf.string,
  'target_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'hi': Text(shape=(), dtype=tf.string),
  }),
})
 • ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/wiki_lingua_indonesian_id

 • תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..

 • גודל ההורדה: 80.08 MiB

 • מערך נתונים גודל: 370.63 MiB

 • Auto-במטמון ( תיעוד ): אין

 • פיצולים:

לְפַצֵל דוגמאות
'test' 9,497
'train' 33,237
'validation' 4,747
 • מאפיינים:
FeaturesDict({
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'source': tf.string,
  'source_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'id': Text(shape=(), dtype=tf.string),
  }),
  'target': tf.string,
  'target_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'id': Text(shape=(), dtype=tf.string),
  }),
})
 • ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/wiki_lingua_italian_it

 • תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..

 • גודל ההורדה: 84.80 MiB

 • מערך נתונים גודל: 374.40 MiB

 • Auto-במטמון ( תיעוד ): אין

 • פיצולים:

לְפַצֵל דוגמאות
'test' 10,189
'train' 35,661
'validation' 5,093
 • מאפיינים:
FeaturesDict({
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'source': tf.string,
  'source_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'it': Text(shape=(), dtype=tf.string),
  }),
  'target': tf.string,
  'target_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'it': Text(shape=(), dtype=tf.string),
  }),
})
 • ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/wiki_lingua_japanese_ja

 • תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..

 • גודל ההורדה: 21.75 MiB

 • מערך נתונים גודל: 103.19 MiB

 • Auto-במטמון ( תיעוד ): כן

 • פיצולים:

לְפַצֵל דוגמאות
'test' 2,530
'train' 8,853
'validation' 1,264
 • מאפיינים:
FeaturesDict({
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'source': tf.string,
  'source_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'ja': Text(shape=(), dtype=tf.string),
  }),
  'target': tf.string,
  'target_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'ja': Text(shape=(), dtype=tf.string),
  }),
})
 • ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/wiki_lingua_korean_ko

 • תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..

 • גודל ההורדה: 22.26 MiB

 • מערך נתונים גודל: 102.35 MiB

 • Auto-במטמון ( תיעוד ): כן

 • פיצולים:

לְפַצֵל דוגמאות
'test' 2,436
'train' 8,524
'validation' 1,216
 • מאפיינים:
FeaturesDict({
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'source': tf.string,
  'source_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'ko': Text(shape=(), dtype=tf.string),
  }),
  'target': tf.string,
  'target_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'ko': Text(shape=(), dtype=tf.string),
  }),
})
 • ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/wiki_lingua_portuguese_pt

 • תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..

 • גודל ההורדה: 131.17 MiB

 • מערך נתונים גודל: 570.46 MiB

 • Auto-במטמון ( תיעוד ): אין

 • פיצולים:

לְפַצֵל דוגמאות
'test' 16,331
'train' 57,159
'validation' 8,165
 • מאפיינים:
FeaturesDict({
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'source': tf.string,
  'source_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'pt': Text(shape=(), dtype=tf.string),
  }),
  'target': tf.string,
  'target_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'pt': Text(shape=(), dtype=tf.string),
  }),
})
 • ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/wiki_lingua_russian_ru

 • תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..

 • גודל ההורדה: 101.36 MiB

 • מערך נתונים גודל: 564.69 MiB

 • Auto-במטמון ( תיעוד ): אין

 • פיצולים:

לְפַצֵל דוגמאות
'test' 10,580
'train' 37,028
'validation' 5,288
 • מאפיינים:
FeaturesDict({
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'source': tf.string,
  'source_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'ru': Text(shape=(), dtype=tf.string),
  }),
  'target': tf.string,
  'target_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'ru': Text(shape=(), dtype=tf.string),
  }),
})
 • ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/wiki_lingua_spanish_es

 • תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..

 • גודל ההורדה: 189.06 MiB

 • מערך נתונים גודל: 849.75 MiB

 • Auto-במטמון ( תיעוד ): אין

 • פיצולים:

לְפַצֵל דוגמאות
'test' 22,632
'train' 79,212
'validation' 11,316
 • מאפיינים:
FeaturesDict({
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'source': tf.string,
  'source_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'es': Text(shape=(), dtype=tf.string),
  }),
  'target': tf.string,
  'target_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'es': Text(shape=(), dtype=tf.string),
  }),
})
 • ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/wiki_lingua_thai_th

 • תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..

 • גודל ההורדה: 28.60 MiB

 • מערך נתונים גודל: 193.77 MiB

 • Auto-במטמון ( תיעוד ): כן (מבחן, אימות), רק כאשר shuffle_files=False (הרכבת)

 • פיצולים:

לְפַצֵל דוגמאות
'test' 2,950
'train' 10,325
'validation' 1,475
 • מאפיינים:
FeaturesDict({
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'source': tf.string,
  'source_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'th': Text(shape=(), dtype=tf.string),
  }),
  'target': tf.string,
  'target_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'th': Text(shape=(), dtype=tf.string),
  }),
})
 • ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/wiki_lingua_turkish_tr

 • תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..

 • גודל ההורדה: 6.73 MiB

 • מערך נתונים גודל: 30.75 MiB

 • Auto-במטמון ( תיעוד ): כן

 • פיצולים:

לְפַצֵל דוגמאות
'test' 900
'train' 3,148
'validation' 449
 • מאפיינים:
FeaturesDict({
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'source': tf.string,
  'source_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'tr': Text(shape=(), dtype=tf.string),
  }),
  'target': tf.string,
  'target_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'tr': Text(shape=(), dtype=tf.string),
  }),
})
 • ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

פנינה/wiki_lingua_vietnamese_vi

 • תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..

 • גודל ההורדה: 36.27 MiB

 • מערך נתונים גודל: 179.77 MiB

 • Auto-במטמון ( תיעוד ): כן

 • פיצולים:

לְפַצֵל דוגמאות
'test' 3,917
'train' 13,707
'validation' 1,957
 • מאפיינים:
FeaturesDict({
  'gem_id': tf.string,
  'gem_parent_id': tf.string,
  'references': Sequence(tf.string),
  'source': tf.string,
  'source_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'vi': Text(shape=(), dtype=tf.string),
  }),
  'target': tf.string,
  'target_aligned': Translation({
    'en': Text(shape=(), dtype=tf.string),
    'vi': Text(shape=(), dtype=tf.string),
  }),
})
 • ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
 author  = {Sebastian Gehrmann and
        Tosin P. Adewumi and
        Karmanya Aggarwal and
        Pawan Sasanka Ammanamanchi and
        Aremu Anuoluwapo and
        Antoine Bosselut and
        Khyathi Raghavi Chandu and
        Miruna{-}Adriana Clinciu and
        Dipanjan Das and
        Kaustubh D. Dhole and
        Wanyu Du and
        Esin Durmus and
        Ondrej Dusek and
        Chris Emezue and
        Varun Gangal and
        Cristina Garbacea and
        Tatsunori Hashimoto and
        Yufang Hou and
        Yacine Jernite and
        Harsh Jhamtani and
        Yangfeng Ji and
        Shailza Jolly and
        Dhruv Kumar and
        Faisal Ladhak and
        Aman Madaan and
        Mounica Maddela and
        Khyati Mahajan and
        Saad Mahamood and
        Bodhisattwa Prasad Majumder and
        Pedro Henrique Martins and
        Angelina McMillan{-}Major and
        Simon Mille and
        Emiel van Miltenburg and
        Moin Nadeem and
        Shashi Narayan and
        Vitaly Nikolaev and
        Rubungo Andre Niyongabo and
        Salomey Osei and
        Ankur P. Parikh and
        Laura Perez{-}Beltrachini and
        Niranjan Ramesh Rao and
        Vikas Raunak and
        Juan Diego Rodriguez and
        Sashank Santhanam and
        Jo{\~{a} }o Sedoc and
        Thibault Sellam and
        Samira Shaikh and
        Anastasia Shimorina and
        Marco Antonio Sobrevilla Cabezudo and
        Hendrik Strobelt and
        Nishant Subramani and
        Wei Xu and
        Diyi Yang and
        Akhila Yerukola and
        Jiawei Zhou},
 title   = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
        Metrics},
 journal  = {CoRR},
 volume  = {abs/2102.01672},
 year   = {2021},
 url    = {https://arxiv.org/abs/2102.01672},
 archivePrefix = {arXiv},
 eprint  = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."