- תיאור:
GEM היא סביבת benchmark עבור Natural Language הדור עם דגש על ההערכה שלו, הוא באמצעות הערות אדם מדדים אוטומטיים.
GEM שואפת: (1) למדוד התקדמות NLG על פני 13 מערכי נתונים המשתרעים על משימות ושפות NLG רבות. (2) לספק ניתוח מעמיק של נתונים ומודלים המוצגים באמצעות הצהרות נתונים ומערכות אתגרים. (3) לפתח תקנים להערכת טקסט שנוצר באמצעות מדדים אוטומטיים ואנושיים כאחד.
ניתן למצוא מידע נוסף באתר https://gem-benchmark.com .
דף הבית: https://gem-benchmark.com
קוד מקור:
tfds.text.gem.Gem
גרסאות:
-
1.0.0
: גרסה ראשונית -
1.0.1
: מסנן קישורים רעים עדכון עבור MLSum -
1.1.0
(ברירת המחדל): שחרור של סטים אתגר
-
מפתחות השגחה (ראה
as_supervised
doc ):None
איור ( tfds.show_examples ): לא נתמך.
gem/common_gen (הגדרת ברירת מחדל)
תיאור Config: CommonGen היא משימה בדור טקסט מאולצת, הקשורים במערך benchmark, למכונות בדיקה במפורש את היכולת של חשיבת שכל ישר יוצרות. בהינתן מכלול של מושגים נפוצים; המשימה היא ליצור משפט קוהרנטי המתאר תרחיש יומיומי באמצעות מושגים אלה.
גודל ההורדה:
1.84 MiB
מערך נתונים גודל:
16.84 MiB
Auto-במטמון ( תיעוד ): כן
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'challenge_test_scramble' | 500 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 1,497 |
'train' | 67,389 |
'validation' | 993 |
- מאפיינים:
FeaturesDict({
'concept_set_id': tf.int32,
'concepts': Sequence(tf.string),
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'target': tf.string,
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{lin2020commongen,
title = "CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning",
author = "Lin, Bill Yuchen and
Zhou, Wangchunshu and
Shen, Ming and
Zhou, Pei and
Bhagavatula, Chandra and
Choi, Yejin and
Ren, Xiang",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.findings-emnlp.165",
pages = "1823--1840",
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/cs_restaurants
תיאור Config: המשימה היא לייצר תגובות בהקשר של מערכת דיאלוג (היפותטי) המספק מידע על מסעדות. הקלט הוא סוג מעשה בסיסי של כוונה/דיאלוג ורשימת חריצים (תכונות) וערכיהם. הפלט הוא משפט בשפה טבעית.
גודל ההורדה:
1.46 MiB
מערך נתונים גודל:
2.71 MiB
Auto-במטמון ( תיעוד ): כן
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'challenge_test_scramble' | 500 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 842 |
'train' | 3,569 |
'validation' | 781 |
- מאפיינים:
FeaturesDict({
'dialog_act': tf.string,
'dialog_act_delexicalized': tf.string,
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'target': tf.string,
'target_delexicalized': tf.string,
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{cs_restaurants,
address = {Tokyo, Japan},
title = {Neural {Generation} for {Czech}: {Data} and {Baselines} },
shorttitle = {Neural {Generation} for {Czech} },
url = {https://www.aclweb.org/anthology/W19-8670/},
urldate = {2019-10-18},
booktitle = {Proceedings of the 12th {International} {Conference} on {Natural} {Language} {Generation} ({INLG} 2019)},
author = {Dušek, Ondřej and Jurčíček, Filip},
month = oct,
year = {2019},
pages = {563--574}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/חץ
תיאור Config: DART הוא תחום-פתוח וגדול מובן נתון שיא אל קורפוס דור טקסט עם הסברי משפט איכותיים עם כול קלט להיות קבוצה של משולשי יישות-קשר הבא עץ מובנה האונטולוגיה.
גודל ההורדה:
28.01 MiB
מערך נתונים גודל:
33.78 MiB
Auto-במטמון ( תיעוד ): כן
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'test' | 6,959 |
'train' | 62,659 |
'validation' | 2,768 |
- מאפיינים:
FeaturesDict({
'dart_id': tf.int32,
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'subtree_was_extended': tf.bool,
'target': tf.string,
'target_sources': Sequence(tf.string),
'tripleset': Sequence(tf.string),
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@article{radev2020dart,
title=Dart: Open-domain structured data record to text generation,
author={Radev, Dragomir and Zhang, Rui and Rau, Amrit and Sivaprasad, Abhinand and Hsieh, Chiachun and Rajani, Nazneen Fatema and Tang, Xiangru and Vyas, Aadit and Verma, Neha and Krishna, Pranav and others},
journal={arXiv preprint arXiv:2007.02871},
year={2020}
}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/e2e_nlg
תיאור Config: בסיס נתון E2E מיועד משימה-תחום מוגבל נתונים אלי טקסט - דור של תיאורי מסעדה / המלצות המבוסס על עד 8 תכונות שונות (שם, באזור, בטווח מחירים וכו ')
גודל ההורדה:
13.99 MiB
מערך נתונים גודל:
16.92 MiB
Auto-במטמון ( תיעוד ): כן
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'challenge_test_scramble' | 500 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 4,693 |
'train' | 33,525 |
'validation' | 4,299 |
- מאפיינים:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'meaning_representation': tf.string,
'references': Sequence(tf.string),
'target': tf.string,
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{e2e_cleaned,
address = {Tokyo, Japan},
title = {Semantic {Noise} {Matters} for {Neural} {Natural} {Language} {Generation} },
url = {https://www.aclweb.org/anthology/W19-8652/},
booktitle = {Proceedings of the 12th {International} {Conference} on {Natural} {Language} {Generation} ({INLG} 2019)},
author = {Dušek, Ondřej and Howcroft, David M and Rieser, Verena},
year = {2019},
pages = {421--426},
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/mlsum_de
תיאור Config: MLSum הוא במערך תמצות רב לשוני בקנה מידה גדול. הוא נרכש מכלי חדשות מקוונים, הפיצול הזה מתמקד בגרמנית.
גודל ההורדה:
345.98 MiB
מערך נתונים גודל:
963.60 MiB
Auto-במטמון ( תיעוד ): אין
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'challenge_test_covid' | 5,058 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 10,695 |
'train' | 220,748 |
'validation' | 11,392 |
- מאפיינים:
FeaturesDict({
'date': tf.string,
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'target': tf.string,
'text': tf.string,
'title': tf.string,
'topic': tf.string,
'url': tf.string,
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{scialom-etal-2020-mlsum,
title = "{MLSUM}: The Multilingual Summarization Corpus",
author = {Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = {2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/mlsum_es
תיאור Config: MLSum הוא במערך תמצות רב לשוני בקנה מידה גדול. הוא נרכש מכלי חדשות מקוונים, הפיצול המתמקד בספרדית.
גודל ההורדה:
501.27 MiB
גודל בסיס הנתונים:
1.29 GiB
Auto-במטמון ( תיעוד ): אין
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'challenge_test_covid' | 1,938 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 13,366 |
'train' | 259,888 |
'validation' | 9,977 |
- מאפיינים:
FeaturesDict({
'date': tf.string,
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'target': tf.string,
'text': tf.string,
'title': tf.string,
'topic': tf.string,
'url': tf.string,
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{scialom-etal-2020-mlsum,
title = "{MLSUM}: The Multilingual Summarization Corpus",
author = {Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = {2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/schema_guided_dialog
תיאור Config: דיאלוג סכימה-מודרך (SGD) נתון מכילים 18K דיאלוגי תחום מרובה-משימתית בין אדם לבין עוזר וירטואלי, אשר מכסה 17 תחומים, החל מבנקים ואירועים למדיה, לוח שנה, נסיעות, ומזג אוויר.
גודל ההורדה:
17.00 MiB
מערך נתונים גודל:
201.19 MiB
Auto-במטמון ( תיעוד ): כן (challenge_test_backtranslation, challenge_test_bfp02, challenge_test_bfp05, challenge_test_nopunc, challenge_test_scramble, challenge_train_sample, challenge_validation_sample, מבחן, אימות), רק כאשר
shuffle_files=False
(הרכבת)פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'challenge_test_backtranslation' | 500 |
'challenge_test_bfp02' | 500 |
'challenge_test_bfp05' | 500 |
'challenge_test_nopunc' | 500 |
'challenge_test_scramble' | 500 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 10,000 |
'train' | 164,982 |
'validation' | 10,000 |
- מאפיינים:
FeaturesDict({
'context': Sequence(tf.string),
'dialog_acts': Sequence({
'act': ClassLabel(shape=(), dtype=tf.int64, num_classes=18),
'slot': tf.string,
'values': Sequence(tf.string),
}),
'dialog_id': tf.string,
'gem_id': tf.string,
'gem_parent_id': tf.string,
'prompt': tf.string,
'references': Sequence(tf.string),
'service': tf.string,
'target': tf.string,
'turn_id': tf.int32,
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@article{rastogi2019towards,
title={Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset},
author={Rastogi, Abhinav and Zang, Xiaoxue and Sunkara, Srinivas and Gupta, Raghav and Khaitan, Pranav},
journal={arXiv preprint arXiv:1909.05855},
year={2019}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/טוטו
תיאור Config: Totto היא משימה NLG שולחן-to-Text. המשימה היא כדלקמן: בהתחשב בטבלת ויקיפדיה עם שמות שורות, שמות עמודות ותאי טבלה, עם קבוצת תאים מודגשת, יוצרים תיאור בשפה טבעית עבור החלק המודגש של הטבלה.
גודל ההורדה:
180.75 MiB
מערך נתונים גודל:
645.86 MiB
Auto-במטמון ( תיעוד ): אין
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'challenge_test_scramble' | 500 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 7,700 |
'train' | 121,153 |
'validation' | 7,700 |
- מאפיינים:
FeaturesDict({
'example_id': tf.string,
'gem_id': tf.string,
'gem_parent_id': tf.string,
'highlighted_cells': Sequence(Sequence(tf.int32)),
'overlap_subset': tf.string,
'references': Sequence(tf.string),
'sentence_annotations': Sequence({
'final_sentence': tf.string,
'original_sentence': tf.string,
'sentence_after_ambiguity': tf.string,
'sentence_after_deletion': tf.string,
}),
'table': Sequence(Sequence({
'column_span': tf.int32,
'is_header': tf.bool,
'row_span': tf.int32,
'value': tf.string,
})),
'table_page_title': tf.string,
'table_section_text': tf.string,
'table_section_title': tf.string,
'table_webpage_url': tf.string,
'target': tf.string,
'totto_id': tf.int32,
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{parikh2020totto,
title=ToTTo: A Controlled Table-To-Text Generation Dataset,
author={Parikh, Ankur and Wang, Xuezhi and Gehrmann, Sebastian and Faruqui, Manaal and Dhingra, Bhuwan and Yang, Diyi and Das, Dipanjan},
booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
pages={1173--1186},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/web_nlg_en
תיאור Config: WebNLG הוא במערך דו-לשוני (אנגלית, רוסית) של DBpedia במקביל סטים משולשים וטקסטים קצרים כיסוי על 450 תכונות DBpedia שונות. נתוני WebNLG נוצרו במקור על מנת לקדם את פיתוחם של RDF verbalisers המסוגלים ליצור טקסט קצר ולטפל בתכנון מיקרו.
גודל ההורדה:
12.57 MiB
מערך נתונים גודל:
19.91 MiB
Auto-במטמון ( תיעוד ): כן
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'challenge_test_numbers' | 500 |
'challenge_test_scramble' | 500 |
'challenge_train_sample' | 502 |
'challenge_validation_sample' | 499 |
'test' | 1,779 |
'train' | 35,426 |
'validation' | 1,667 |
- מאפיינים:
FeaturesDict({
'category': tf.string,
'gem_id': tf.string,
'gem_parent_id': tf.string,
'input': Sequence(tf.string),
'references': Sequence(tf.string),
'target': tf.string,
'webnlg_id': tf.string,
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{gardent2017creating,
author = "Gardent, Claire
and Shimorina, Anastasia
and Narayan, Shashi
and Perez-Beltrachini, Laura",
title = "Creating Training Corpora for NLG Micro-Planners",
booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2017",
publisher = "Association for Computational Linguistics",
pages = "179--188",
location = "Vancouver, Canada",
doi = "10.18653/v1/P17-1017",
url = "http://www.aclweb.org/anthology/P17-1017"
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/web_nlg_ru
תיאור Config: WebNLG הוא במערך דו-לשוני (אנגלית, רוסית) של DBpedia במקביל סטים משולשים וטקסטים קצרים כיסוי על 450 תכונות DBpedia שונות. נתוני WebNLG נוצרו במקור על מנת לקדם את פיתוחם של RDF verbalisers המסוגלים ליצור טקסט קצר ולטפל בתכנון מיקרו.
גודל ההורדה:
7.49 MiB
מערך נתונים גודל:
11.30 MiB
Auto-במטמון ( תיעוד ): כן
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'challenge_test_scramble' | 500 |
'challenge_train_sample' | 501 |
'challenge_validation_sample' | 500 |
'test' | 1,102 |
'train' | 14,630 |
'validation' | 790 |
- מאפיינים:
FeaturesDict({
'category': tf.string,
'gem_id': tf.string,
'gem_parent_id': tf.string,
'input': Sequence(tf.string),
'references': Sequence(tf.string),
'target': tf.string,
'webnlg_id': tf.string,
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{gardent2017creating,
author = "Gardent, Claire
and Shimorina, Anastasia
and Narayan, Shashi
and Perez-Beltrachini, Laura",
title = "Creating Training Corpora for NLG Micro-Planners",
booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2017",
publisher = "Association for Computational Linguistics",
pages = "179--188",
location = "Vancouver, Canada",
doi = "10.18653/v1/P17-1017",
url = "http://www.aclweb.org/anthology/P17-1017"
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/wiki_auto_asset_turk
Config תיאור: WikiAuto מספק סט של משפטים מיושרים מוויקיפדיה באנגלית ויקיפדיה באנגלית פשוט כמשאב להכשיר מערכות פישוט משפט. ASSET ו- TURK הם מערכות נתונים מפשטות באיכות גבוהה המשמשות לבדיקה.
גודל ההורדה:
121.01 MiB
מערך נתונים גודל:
202.40 MiB
Auto-במטמון ( תיעוד ): כן (challenge_test_asset_backtranslation, challenge_test_asset_bfp02, challenge_test_asset_bfp05, challenge_test_asset_nopunc, challenge_test_turk_backtranslation, challenge_test_turk_bfp02, challenge_test_turk_bfp05, challenge_test_turk_nopunc, challenge_train_sample, challenge_validation_sample, test_asset, test_turk, אימות), רק כאשר
shuffle_files=False
(הרכבת)פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'challenge_test_asset_backtranslation' | 359 |
'challenge_test_asset_bfp02' | 359 |
'challenge_test_asset_bfp05' | 359 |
'challenge_test_asset_nopunc' | 359 |
'challenge_test_turk_backtranslation' | 359 |
'challenge_test_turk_bfp02' | 359 |
'challenge_test_turk_bfp05' | 359 |
'challenge_test_turk_nopunc' | 359 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test_asset' | 359 |
'test_turk' | 359 |
'train' | 483,801 |
'validation' | 20,000 |
- מאפיינים:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'target': tf.string,
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{jiang-etal-2020-neural,
title = "Neural {CRF} Model for Sentence Alignment in Text Simplification",
author = "Jiang, Chao and
Maddela, Mounica and
Lan, Wuwei and
Zhong, Yang and
Xu, Wei",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.709",
doi = "10.18653/v1/2020.acl-main.709",
pages = "7943--7960",
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/קסום
Config תיאור: מערך הנתונים למשימה של תמצות abstractive בצורתה הקיצונית, שלה כ המסכם מסמך במשפט אחד.
גודל ההורדה:
246.31 MiB
מערך נתונים גודל:
78.89 MiB
Auto-במטמון ( תיעוד ): כן
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'challenge_test_backtranslation' | 500 |
'challenge_test_bfp_02' | 500 |
'challenge_test_bfp_05' | 500 |
'challenge_test_covid' | 401 |
'challenge_test_nopunc' | 500 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 1,166 |
'train' | 23,206 |
'validation' | 1,117 |
- מאפיינים:
FeaturesDict({
'document': tf.string,
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'target': tf.string,
'xsum_id': tf.string,
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{Narayan2018dont,
author = "Shashi Narayan and Shay B. Cohen and Mirella Lapata",
title = "Don't Give Me the Details, Just the Summary! {T}opic-Aware Convolutional Neural Networks for Extreme Summarization",
booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing ",
year = "2018",
address = "Brussels, Belgium",
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/wiki_lingua_arabic_ar
תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..
גודל ההורדה:
56.25 MiB
מערך נתונים גודל:
291.42 MiB
Auto-במטמון ( תיעוד ): אין
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'test' | 5,841 |
'train' | 20,441 |
'validation' | 2,919 |
- מאפיינים:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'ar': Text(shape=(), dtype=tf.string),
'en': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'ar': Text(shape=(), dtype=tf.string),
'en': Text(shape=(), dtype=tf.string),
}),
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/wiki_lingua_chinese_zh
תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..
גודל ההורדה:
31.38 MiB
מערך נתונים גודל:
122.06 MiB
Auto-במטמון ( תיעוד ): כן
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'test' | 3,775 |
'train' | 13,211 |
'validation' | 1,886 |
- מאפיינים:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'zh': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'zh': Text(shape=(), dtype=tf.string),
}),
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/wiki_lingua_czech_cs
תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..
גודל ההורדה:
13.84 MiB
מערך נתונים גודל:
58.05 MiB
Auto-במטמון ( תיעוד ): כן
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'test' | 1,438 |
'train' | 5,033 |
'validation' | 718 |
- מאפיינים:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'cs': Text(shape=(), dtype=tf.string),
'en': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'cs': Text(shape=(), dtype=tf.string),
'en': Text(shape=(), dtype=tf.string),
}),
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/wiki_lingua_dutch_nl
תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..
גודל ההורדה:
53.88 MiB
מערך נתונים גודל:
237.97 MiB
Auto-במטמון ( תיעוד ): כן (מבחן, אימות), רק כאשר
shuffle_files=False
(הרכבת)פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'test' | 6,248 |
'train' | 21,866 |
'validation' | 3,123 |
- מאפיינים:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'nl': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'nl': Text(shape=(), dtype=tf.string),
}),
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/wiki_lingua_english_en
תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..
גודל ההורדה:
112.56 MiB
מערך נתונים גודל:
657.51 MiB
Auto-במטמון ( תיעוד ): אין
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'test' | 28,614 |
'train' | 99,020 |
'validation' | 13,823 |
- מאפיינים:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
}),
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/wiki_lingua_french_fr
תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..
גודל ההורדה:
113.26 MiB
מערך נתונים גודל:
522.28 MiB
Auto-במטמון ( תיעוד ): אין
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'test' | 12,731 |
'train' | 44,556 |
'validation' | 6,364 |
- מאפיינים:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'fr': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'fr': Text(shape=(), dtype=tf.string),
}),
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/wiki_lingua_german_de
תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..
גודל ההורדה:
102.65 MiB
מערך נתונים גודל:
452.46 MiB
Auto-במטמון ( תיעוד ): אין
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'test' | 11,669 |
'train' | 40,839 |
'validation' | 5,833 |
- מאפיינים:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'de': Text(shape=(), dtype=tf.string),
'en': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'de': Text(shape=(), dtype=tf.string),
'en': Text(shape=(), dtype=tf.string),
}),
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/wiki_lingua_hindi_hi
תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..
גודל ההורדה:
20.07 MiB
מערך נתונים גודל:
138.06 MiB
Auto-במטמון ( תיעוד ): כן
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'test' | 1,984 |
'train' | 6,942 |
'validation' | 991 |
- מאפיינים:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'hi': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'hi': Text(shape=(), dtype=tf.string),
}),
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/wiki_lingua_indonesian_id
תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..
גודל ההורדה:
80.08 MiB
מערך נתונים גודל:
370.63 MiB
Auto-במטמון ( תיעוד ): אין
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'test' | 9,497 |
'train' | 33,237 |
'validation' | 4,747 |
- מאפיינים:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'id': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'id': Text(shape=(), dtype=tf.string),
}),
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/wiki_lingua_italian_it
תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..
גודל ההורדה:
84.80 MiB
מערך נתונים גודל:
374.40 MiB
Auto-במטמון ( תיעוד ): אין
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'test' | 10,189 |
'train' | 35,661 |
'validation' | 5,093 |
- מאפיינים:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'it': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'it': Text(shape=(), dtype=tf.string),
}),
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/wiki_lingua_japanese_ja
תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..
גודל ההורדה:
21.75 MiB
מערך נתונים גודל:
103.19 MiB
Auto-במטמון ( תיעוד ): כן
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'test' | 2,530 |
'train' | 8,853 |
'validation' | 1,264 |
- מאפיינים:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'ja': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'ja': Text(shape=(), dtype=tf.string),
}),
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/wiki_lingua_korean_ko
תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..
גודל ההורדה:
22.26 MiB
מערך נתונים גודל:
102.35 MiB
Auto-במטמון ( תיעוד ): כן
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'test' | 2,436 |
'train' | 8,524 |
'validation' | 1,216 |
- מאפיינים:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'ko': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'ko': Text(shape=(), dtype=tf.string),
}),
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/wiki_lingua_portuguese_pt
תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..
גודל ההורדה:
131.17 MiB
מערך נתונים גודל:
570.46 MiB
Auto-במטמון ( תיעוד ): אין
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'test' | 16,331 |
'train' | 57,159 |
'validation' | 8,165 |
- מאפיינים:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'pt': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'pt': Text(shape=(), dtype=tf.string),
}),
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/wiki_lingua_russian_ru
תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..
גודל ההורדה:
101.36 MiB
מערך נתונים גודל:
564.69 MiB
Auto-במטמון ( תיעוד ): אין
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'test' | 10,580 |
'train' | 37,028 |
'validation' | 5,288 |
- מאפיינים:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'ru': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'ru': Text(shape=(), dtype=tf.string),
}),
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/wiki_lingua_spanish_es
תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..
גודל ההורדה:
189.06 MiB
מערך נתונים גודל:
849.75 MiB
Auto-במטמון ( תיעוד ): אין
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'test' | 22,632 |
'train' | 79,212 |
'validation' | 11,316 |
- מאפיינים:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'es': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'es': Text(shape=(), dtype=tf.string),
}),
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/wiki_lingua_thai_th
תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..
גודל ההורדה:
28.60 MiB
מערך נתונים גודל:
193.77 MiB
Auto-במטמון ( תיעוד ): כן (מבחן, אימות), רק כאשר
shuffle_files=False
(הרכבת)פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'test' | 2,950 |
'train' | 10,325 |
'validation' | 1,475 |
- מאפיינים:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'th': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'th': Text(shape=(), dtype=tf.string),
}),
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/wiki_lingua_turkish_tr
תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..
גודל ההורדה:
6.73 MiB
מערך נתונים גודל:
30.75 MiB
Auto-במטמון ( תיעוד ): כן
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'test' | 900 |
'train' | 3,148 |
'validation' | 449 |
- מאפיינים:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'tr': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'tr': Text(shape=(), dtype=tf.string),
}),
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
פנינה/wiki_lingua_vietnamese_vi
תיאור Config: Wikilingua היא בקנה מידה גדול, בסיס הנתונים הרב-לשוני עבור הערכה של מערכות תמצות abstractive צולבות לשוני ..
גודל ההורדה:
36.27 MiB
מערך נתונים גודל:
179.77 MiB
Auto-במטמון ( תיעוד ): כן
פיצולים:
לְפַצֵל | דוגמאות |
---|---|
'test' | 3,917 |
'train' | 13,707 |
'validation' | 1,957 |
- מאפיינים:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'vi': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'vi': Text(shape=(), dtype=tf.string),
}),
})
- דוגמאות ( tfds.as_dataframe ):
- ציטוט:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."