گوهر

با مجموعه‌ها، منظم بمانید ذخیره و دسته‌بندی محتوا براساس اولویت‌های شما.

  • توضیحات :

GEM یک محیط معیار برای تولید زبان طبیعی با تمرکز بر ارزیابی آن، هم از طریق حاشیه نویسی های انسانی و هم از طریق معیارهای خودکار است.

هدف GEM این است: (1) پیشرفت NLG را در 13 مجموعه داده که بسیاری از وظایف و زبان های NLG را در بر می گیرد، اندازه گیری کند. (2) یک تجزیه و تحلیل عمیق از داده ها و مدل های ارائه شده از طریق بیانیه های داده و مجموعه های چالش ارائه می کند. (3) استانداردهایی را برای ارزیابی متن تولید شده با استفاده از معیارهای خودکار و انسانی ایجاد کنید.

اطلاعات بیشتر را می توانید در https://gem-benchmark.com بیابید .

gem/common_gen (پیکربندی پیش‌فرض)

  • توضیحات پیکربندی : CommonGen یک وظیفه تولید متن محدود است که با مجموعه داده های معیار مرتبط است تا به طور صریح ماشین ها را برای توانایی استدلال مولد عقل سلیم آزمایش کند. با توجه به مجموعه ای از مفاهیم رایج؛ وظیفه تولید یک جمله منسجم برای توصیف یک سناریوی روزمره با استفاده از این مفاهیم است.

  • حجم دانلود : 1.84 MiB

  • حجم مجموعه داده : 16.84 MiB

  • ذخیره خودکار ( اسناد ): بله

  • تقسیم ها :

شکاف مثال ها
'challenge_test_scramble' 500
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 1,497
'train' 67,389
'validation' 993
  • ساختار ویژگی :
FeaturesDict({
    'concept_set_id': int32,
    'concepts': Sequence(string),
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'target': string,
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
concept_set_id تانسور int32
مفاهیم دنباله (تنسور) (هیچ یک،) رشته
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
هدف تانسور رشته
  • نقل قول :
@inproceedings{lin2020commongen,
  title = "CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning",
  author = "Lin, Bill Yuchen  and
    Zhou, Wangchunshu  and
    Shen, Ming  and
    Zhou, Pei  and
    Bhagavatula, Chandra  and
    Choi, Yejin  and
    Ren, Xiang",
  booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
  month = nov,
  year = "2020",
  address = "Online",
  publisher = "Association for Computational Linguistics",
  url = "https://www.aclweb.org/anthology/2020.findings-emnlp.165",
  pages = "1823--1840",
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/cs_restaurants

  • شرح پیکربندی : وظیفه ایجاد پاسخ در زمینه یک سیستم گفتگوی (فرضی) است که اطلاعاتی در مورد رستوران ها ارائه می دهد. ورودی یک نوع عمل هدف/گفتگوی اساسی و فهرستی از اسلات ها (ویژگی ها) و مقادیر آنهاست. خروجی یک جمله زبان طبیعی است.

  • حجم دانلود : 1.46 MiB

  • حجم مجموعه داده : 2.71 MiB

  • ذخیره خودکار ( اسناد ): بله

  • تقسیم ها :

شکاف مثال ها
'challenge_test_scramble' 500
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 842
'train' 3,569
'validation' 781
  • ساختار ویژگی :
FeaturesDict({
    'dialog_act': string,
    'dialog_act_delexicalized': string,
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'target': string,
    'target_delexicalized': string,
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
دیالوگ_عمل تانسور رشته
dialog_act_delexicalized تانسور رشته
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
هدف تانسور رشته
target_delexicalized تانسور رشته
  • نقل قول :
@inproceedings{cs_restaurants,
  address = {Tokyo, Japan},
  title = {Neural {Generation} for {Czech}: {Data} and {Baselines} },
  shorttitle = {Neural {Generation} for {Czech} },
  url = {https://www.aclweb.org/anthology/W19-8670/},
  urldate = {2019-10-18},
  booktitle = {Proceedings of the 12th {International} {Conference} on {Natural} {Language} {Generation} ({INLG} 2019)},
  author = {Dušek, Ondřej and Jurčíček, Filip},
  month = oct,
  year = {2019},
  pages = {563--574}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

گوهر/دارت

  • توضیحات پیکربندی : DART یک مجموعه بزرگ و با دامنه باز ساختار داده‌ای است که ضبط به متن تولید می‌کند با حاشیه‌نویسی جملات با کیفیت بالا که هر ورودی مجموعه‌ای از سه‌گانه‌های رابطه موجودیت است که به دنبال یک هستی‌شناسی با ساختار درختی هستند.

  • حجم دانلود : 28.01 MiB

  • حجم مجموعه داده : 33.78 MiB

  • ذخیره خودکار ( اسناد ): بله

  • تقسیم ها :

شکاف مثال ها
'test' 6959
'train' 62659
'validation' 2768
  • ساختار ویژگی :
FeaturesDict({
    'dart_id': int32,
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'subtree_was_extended': bool,
    'target': string,
    'target_sources': Sequence(string),
    'tripleset': Sequence(string),
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
dart_id تانسور int32
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
زیردرخت_توسعه_شد تانسور بوول
هدف تانسور رشته
target_sources دنباله (تنسور) (هیچ یک،) رشته
سه گانه دنباله (تنسور) (هیچ یک،) رشته
  • نقل قول :
@article{radev2020dart,
  title=Dart: Open-domain structured data record to text generation,
  author={Radev, Dragomir and Zhang, Rui and Rau, Amrit and Sivaprasad, Abhinand and Hsieh, Chiachun and Rajani, Nazneen Fatema and Tang, Xiangru and Vyas, Aadit and Verma, Neha and Krishna, Pranav and others},
  journal={arXiv preprint arXiv:2007.02871},
  year={2020}
}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/e2e_nlg

  • توضیحات پیکربندی : مجموعه داده E2E برای یک کار داده به متن با دامنه محدود طراحی شده است - تولید توضیحات/توصیه های رستوران بر اساس حداکثر 8 ویژگی مختلف (نام، منطقه، محدوده قیمت و غیره)

  • حجم دانلود : 13.99 MiB

  • حجم مجموعه داده : 16.92 MiB

  • ذخیره خودکار ( اسناد ): بله

  • تقسیم ها :

شکاف مثال ها
'challenge_test_scramble' 500
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 4693
'train' 33,525
'validation' 4299
  • ساختار ویژگی :
FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'meaning_representation': string,
    'references': Sequence(string),
    'target': string,
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
gem_id تانسور رشته
gem_parent_id تانسور رشته
معنی_نمایش تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
هدف تانسور رشته
  • نقل قول :
@inproceedings{e2e_cleaned,
  address = {Tokyo, Japan},
  title = {Semantic {Noise} {Matters} for {Neural} {Natural} {Language} {Generation} },
  url = {https://www.aclweb.org/anthology/W19-8652/},
  booktitle = {Proceedings of the 12th {International} {Conference} on {Natural} {Language} {Generation} ({INLG} 2019)},
  author = {Dušek, Ondřej and Howcroft, David M and Rieser, Verena},
  year = {2019},
  pages = {421--426},
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/mlsum_de

  • توضیحات پیکربندی : MLSum یک مجموعه داده خلاصه چند زبانه در مقیاس بزرگ است. این از رسانه های خبری آنلاین ساخته شده است، این تقسیم بر آلمانی تمرکز دارد.

  • حجم دانلود : 345.98 MiB

  • حجم مجموعه داده : 963.60 MiB

  • ذخیره خودکار ( اسناد ): خیر

  • تقسیم ها :

شکاف مثال ها
'challenge_test_covid' 5,058
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 10695
'train' 220,748
'validation' 11,392
  • ساختار ویژگی :
FeaturesDict({
    'date': string,
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'target': string,
    'text': string,
    'title': string,
    'topic': string,
    'url': string,
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
تاریخ تانسور رشته
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
هدف تانسور رشته
متن تانسور رشته
عنوان تانسور رشته
موضوع تانسور رشته
آدرس اینترنتی تانسور رشته
  • نقل قول :
@inproceedings{scialom-etal-2020-mlsum,
    title = "{MLSUM}: The Multilingual Summarization Corpus",
    author = {Scialom, Thomas  and Dray, Paul-Alexis  and Lamprier, Sylvain  and Piwowarski, Benjamin  and Staiano, Jacopo},
    booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
    year = {2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/mlsum_es

  • توضیحات پیکربندی : MLSum یک مجموعه داده خلاصه چند زبانه در مقیاس بزرگ است. این از رسانه های خبری آنلاین ساخته شده است، این تقسیم بر اسپانیایی تمرکز دارد.

  • حجم دانلود : 501.27 MiB

  • حجم مجموعه داده : 1.29 GiB

  • ذخیره خودکار ( اسناد ): خیر

  • تقسیم ها :

شکاف مثال ها
'challenge_test_covid' 1,938
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 13,366
'train' 259,888
'validation' 9,977
  • ساختار ویژگی :
FeaturesDict({
    'date': string,
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'target': string,
    'text': string,
    'title': string,
    'topic': string,
    'url': string,
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
تاریخ تانسور رشته
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
هدف تانسور رشته
متن تانسور رشته
عنوان تانسور رشته
موضوع تانسور رشته
آدرس اینترنتی تانسور رشته
  • نقل قول :
@inproceedings{scialom-etal-2020-mlsum,
    title = "{MLSUM}: The Multilingual Summarization Corpus",
    author = {Scialom, Thomas  and Dray, Paul-Alexis  and Lamprier, Sylvain  and Piwowarski, Benjamin  and Staiano, Jacopo},
    booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
    year = {2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/schema_guided_dialog

  • توضیحات پیکربندی : مجموعه داده گفتگوی طرحواره (SGD) شامل دیالوگ های 18K چند دامنه ای وظیفه محور بین یک انسان و یک دستیار مجازی است که 17 دامنه از بانک ها و رویدادها گرفته تا رسانه، تقویم، سفر و آب و هوا را پوشش می دهد.

  • حجم دانلود : 17.00 MiB

  • حجم مجموعه داده : 201.19 MiB

  • ذخیره خودکار ( مستندات ): بله (challenge_test_backtranslation، challenge_test_bfp02، challenge_test_bfp05، challenge_test_nopunc، challenge_test_scramble، challenge_train_sample، challenge_validation_sample، تست، اعتبارسنجی)، فقط زمانی که shuffle_files=False (train)

  • تقسیم ها :

شکاف مثال ها
'challenge_test_backtranslation' 500
'challenge_test_bfp02' 500
'challenge_test_bfp05' 500
'challenge_test_nopunc' 500
'challenge_test_scramble' 500
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 10000
'train' 164982
'validation' 10000
  • ساختار ویژگی :
FeaturesDict({
    'context': Sequence(string),
    'dialog_acts': Sequence({
        'act': ClassLabel(shape=(), dtype=int64, num_classes=18),
        'slot': string,
        'values': Sequence(string),
    }),
    'dialog_id': string,
    'gem_id': string,
    'gem_parent_id': string,
    'prompt': string,
    'references': Sequence(string),
    'service': string,
    'target': string,
    'turn_id': int32,
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
متن نوشته دنباله (تنسور) (هیچ یک،) رشته
دیالوگ_عمل ها توالی
dialog_acts/act ClassLabel int64
dialog_acts/slot تانسور رشته
دیالوگ_عمل ها/ ارزش ها دنباله (تنسور) (هیچ یک،) رشته
dialog_id تانسور رشته
gem_id تانسور رشته
gem_parent_id تانسور رشته
سریع تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
سرویس تانسور رشته
هدف تانسور رشته
turn_id تانسور int32
  • نقل قول :
@article{rastogi2019towards,
  title={Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset},
  author={Rastogi, Abhinav and Zang, Xiaoxue and Sunkara, Srinivas and Gupta, Raghav and Khaitan, Pranav},
  journal={arXiv preprint arXiv:1909.05855},
  year={2019}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

گوهر/تاتو

  • توضیحات پیکربندی : ToTTo یک کار NLG جدول به متن است. وظیفه به شرح زیر است: با توجه به یک جدول ویکی‌پدیا با نام ردیف‌ها، نام ستون‌ها و سلول‌های جدول، با زیرمجموعه‌ای از سلول‌های برجسته، یک توصیف زبان طبیعی برای قسمت برجسته‌شده جدول ایجاد کنید.

  • حجم دانلود : 180.75 MiB

  • حجم مجموعه داده : 645.86 MiB

  • ذخیره خودکار ( اسناد ): خیر

  • تقسیم ها :

شکاف مثال ها
'challenge_test_scramble' 500
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 7700
'train' 121,153
'validation' 7700
  • ساختار ویژگی :
FeaturesDict({
    'example_id': string,
    'gem_id': string,
    'gem_parent_id': string,
    'highlighted_cells': Sequence(Sequence(int32)),
    'overlap_subset': string,
    'references': Sequence(string),
    'sentence_annotations': Sequence({
        'final_sentence': string,
        'original_sentence': string,
        'sentence_after_ambiguity': string,
        'sentence_after_deletion': string,
    }),
    'table': Sequence(Sequence({
        'column_span': int32,
        'is_header': bool,
        'row_span': int32,
        'value': string,
    })),
    'table_page_title': string,
    'table_section_text': string,
    'table_section_title': string,
    'table_webpage_url': string,
    'target': string,
    'totto_id': int32,
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
example_id تانسور رشته
gem_id تانسور رشته
gem_parent_id تانسور رشته
هایلایت شده_سلول ها دنباله (سکانس (تنسور)) (هیچ، هیچکدام) int32
همپوشانی_زیر مجموعه تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
جمله_حاشیه ها توالی
جمله_حواشی/جمله_پایانی تانسور رشته
جمله_حواشی/جمله_اصلی تانسور رشته
جمله_حواشی/جمله_بعد_ابهام تانسور رشته
جمله_حواشی/جمله_پس از_حذف تانسور رشته
جدول توالی
جدول / دهانه_ستون تانسور int32
table/is_header تانسور بوول
table/row_span تانسور int32
جدول/مقدار تانسور رشته
table_page_title تانسور رشته
جدول_بخش_متن تانسور رشته
table_section_title تانسور رشته
table_webpage_url تانسور رشته
هدف تانسور رشته
totto_id تانسور int32
  • نقل قول :
@inproceedings{parikh2020totto,
  title=ToTTo: A Controlled Table-To-Text Generation Dataset,
  author={Parikh, Ankur and Wang, Xuezhi and Gehrmann, Sebastian and Faruqui, Manaal and Dhingra, Bhuwan and Yang, Diyi and Das, Dipanjan},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  pages={1173--1186},
  year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/web_nlg_en

  • توضیحات پیکربندی : WebNLG یک مجموعه داده دوزبانه (انگلیسی، روسی) از مجموعه های سه گانه موازی DBpedia و متون کوتاه است که حدود 450 ویژگی مختلف DBpedia را پوشش می دهد. داده‌های WebNLG در ابتدا برای ترویج توسعه کلامی‌کننده‌های RDF که قادر به تولید متن کوتاه و مدیریت برنامه‌ریزی خرد بودند، ایجاد شد.

  • حجم دانلود : 12.57 MiB

  • حجم مجموعه داده : 19.91 MiB

  • ذخیره خودکار ( اسناد ): بله

  • تقسیم ها :

شکاف مثال ها
'challenge_test_numbers' 500
'challenge_test_scramble' 500
'challenge_train_sample' 502
'challenge_validation_sample' 499
'test' 1779
'train' 35,426
'validation' 1667
  • ساختار ویژگی :
FeaturesDict({
    'category': string,
    'gem_id': string,
    'gem_parent_id': string,
    'input': Sequence(string),
    'references': Sequence(string),
    'target': string,
    'webnlg_id': string,
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
دسته بندی تانسور رشته
gem_id تانسور رشته
gem_parent_id تانسور رشته
ورودی دنباله (تنسور) (هیچ یک،) رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
هدف تانسور رشته
webnlg_id تانسور رشته
  • نقل قول :
@inproceedings{gardent2017creating,
  author = "Gardent, Claire
    and Shimorina, Anastasia
    and Narayan, Shashi
    and Perez-Beltrachini, Laura",
  title = "Creating Training Corpora for NLG Micro-Planners",
  booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
  year = "2017",
  publisher = "Association for Computational Linguistics",
  pages = "179--188",
  location = "Vancouver, Canada",
  doi = "10.18653/v1/P17-1017",
  url = "http://www.aclweb.org/anthology/P17-1017"
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/web_nlg_ru

  • توضیحات پیکربندی : WebNLG یک مجموعه داده دوزبانه (انگلیسی، روسی) از مجموعه های سه گانه موازی DBpedia و متون کوتاه است که حدود 450 ویژگی مختلف DBpedia را پوشش می دهد. داده‌های WebNLG در ابتدا برای ترویج توسعه کلامی‌کننده‌های RDF که قادر به تولید متن کوتاه و مدیریت برنامه‌ریزی خرد بودند، ایجاد شد.

  • حجم دانلود : 7.49 MiB

  • حجم مجموعه داده : 11.30 MiB

  • ذخیره خودکار ( اسناد ): بله

  • تقسیم ها :

شکاف مثال ها
'challenge_test_scramble' 500
'challenge_train_sample' 501
'challenge_validation_sample' 500
'test' 1,102
'train' 14630
'validation' 790
  • ساختار ویژگی :
FeaturesDict({
    'category': string,
    'gem_id': string,
    'gem_parent_id': string,
    'input': Sequence(string),
    'references': Sequence(string),
    'target': string,
    'webnlg_id': string,
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
دسته بندی تانسور رشته
gem_id تانسور رشته
gem_parent_id تانسور رشته
ورودی دنباله (تنسور) (هیچ یک،) رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
هدف تانسور رشته
webnlg_id تانسور رشته
  • نقل قول :
@inproceedings{gardent2017creating,
  author = "Gardent, Claire
    and Shimorina, Anastasia
    and Narayan, Shashi
    and Perez-Beltrachini, Laura",
  title = "Creating Training Corpora for NLG Micro-Planners",
  booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
  year = "2017",
  publisher = "Association for Computational Linguistics",
  pages = "179--188",
  location = "Vancouver, Canada",
  doi = "10.18653/v1/P17-1017",
  url = "http://www.aclweb.org/anthology/P17-1017"
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/wiki_auto_asset_turk

  • توضیحات پیکربندی : WikiAuto مجموعه‌ای از جملات تراز شده را از ویکی‌پدیای انگلیسی و ویکی‌پدیای ساده انگلیسی به عنوان منبعی برای آموزش سیستم‌های ساده‌سازی جملات ارائه می‌کند. ASSET و TURK مجموعه داده های ساده سازی با کیفیتی هستند که برای آزمایش استفاده می شوند.

  • حجم دانلود : 121.01 MiB

  • حجم مجموعه داده : 202.40 MiB

  • Auto-cached ( documentation ): Yes (challenge_test_asset_backtranslation, challenge_test_asset_bfp02, challenge_test_asset_bfp05, challenge_test_asset_nopunc, challenge_test_turk_backtranslation, challenge_test_turk_bfp02, challenge_test_turk_bfp05, challenge_test_turk_nopunc, challenge_train_sample, challenge_validation_sample, test_asset, test_turk, validation), Only when shuffle_files=False (train)

  • تقسیم ها :

شکاف مثال ها
'challenge_test_asset_backtranslation' 359
'challenge_test_asset_bfp02' 359
'challenge_test_asset_bfp05' 359
'challenge_test_asset_nopunc' 359
'challenge_test_turk_backtranslation' 359
'challenge_test_turk_bfp02' 359
'challenge_test_turk_bfp05' 359
'challenge_test_turk_nopunc' 359
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test_asset' 359
'test_turk' 359
'train' 483801
'validation' 20000
  • ساختار ویژگی :
FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'target': string,
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
منبع تانسور رشته
هدف تانسور رشته
  • نقل قول :
@inproceedings{jiang-etal-2020-neural,
    title = "Neural {CRF} Model for Sentence Alignment in Text Simplification",
    author = "Jiang, Chao  and
      Maddela, Mounica  and
      Lan, Wuwei  and
      Zhong, Yang  and
      Xu, Wei",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.709",
    doi = "10.18653/v1/2020.acl-main.709",
    pages = "7943--7960",
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/xsum

  • توضیحات پیکربندی : مجموعه داده برای وظیفه خلاصه‌سازی انتزاعی در شکل شدید آن است، یعنی خلاصه کردن یک سند در یک جمله.

  • حجم دانلود : 246.31 MiB

  • حجم مجموعه داده : 78.89 MiB

  • ذخیره خودکار ( اسناد ): بله

  • تقسیم ها :

شکاف مثال ها
'challenge_test_backtranslation' 500
'challenge_test_bfp_02' 500
'challenge_test_bfp_05' 500
'challenge_test_covid' 401
'challenge_test_nopunc' 500
'challenge_train_sample' 500
'challenge_validation_sample' 500
'test' 1,166
'train' 23,206
'validation' 1,117
  • ساختار ویژگی :
FeaturesDict({
    'document': string,
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'target': string,
    'xsum_id': string,
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
سند تانسور رشته
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
هدف تانسور رشته
xsum_id تانسور رشته
  • نقل قول :
@inproceedings{Narayan2018dont,
  author = "Shashi Narayan and Shay B. Cohen and Mirella Lapata",
  title = "Don't Give Me the Details, Just the Summary! {T}opic-Aware Convolutional Neural Networks for Extreme Summarization",
  booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing ",
  year = "2018",
  address = "Brussels, Belgium",
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/wiki_lingua_arabic_ar

  • توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستم‌های خلاصه‌سازی انتزاعی چند زبانه است.

  • حجم دانلود : 56.25 MiB

  • حجم مجموعه داده : 291.42 MiB

  • ذخیره خودکار ( اسناد ): خیر

  • تقسیم ها :

شکاف مثال ها
'test' 5,841
'train' 20,441
'validation' 2919
  • ساختار ویژگی :
FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'ar': Text(shape=(), dtype=string),
        'en': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'ar': Text(shape=(), dtype=string),
        'en': Text(shape=(), dtype=string),
    }),
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
منبع تانسور رشته
source_aligned ترجمه
source_aligned/ar متن رشته
source_aligned/en متن رشته
هدف تانسور رشته
target_aligned ترجمه
target_aligned/ar متن رشته
target_aligned/en متن رشته
  • نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/wiki_lingua_chinese_zh

  • توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستم‌های خلاصه‌سازی انتزاعی چند زبانه است.

  • حجم دانلود : 31.38 MiB

  • حجم مجموعه داده : 122.06 MiB

  • ذخیره خودکار ( اسناد ): بله

  • تقسیم ها :

شکاف مثال ها
'test' 3775
'train' 13,211
'validation' 1,886
  • ساختار ویژگی :
FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'zh': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'zh': Text(shape=(), dtype=string),
    }),
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
منبع تانسور رشته
source_aligned ترجمه
source_aligned/en متن رشته
source_aligned/zh متن رشته
هدف تانسور رشته
target_aligned ترجمه
target_aligned/en متن رشته
target_aligned/zh متن رشته
  • نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/wiki_lingua_czech_cs

  • توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستم‌های خلاصه‌سازی انتزاعی چند زبانه است.

  • حجم دانلود : 13.84 MiB

  • حجم مجموعه داده : 58.05 MiB

  • ذخیره خودکار ( اسناد ): بله

  • تقسیم ها :

شکاف مثال ها
'test' 1,438
'train' 5,033
'validation' 718
  • ساختار ویژگی :
FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'cs': Text(shape=(), dtype=string),
        'en': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'cs': Text(shape=(), dtype=string),
        'en': Text(shape=(), dtype=string),
    }),
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
منبع تانسور رشته
source_aligned ترجمه
source_aligned/cs متن رشته
source_aligned/en متن رشته
هدف تانسور رشته
target_aligned ترجمه
target_aligned/cs متن رشته
target_aligned/en متن رشته
  • نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/wiki_lingua_dutch_nl

  • توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستم‌های خلاصه‌سازی انتزاعی چند زبانه است.

  • حجم دانلود : 53.88 MiB

  • حجم مجموعه داده : 237.97 MiB

  • ذخیره خودکار ( مستندات ): بله (تست، اعتبارسنجی)، فقط زمانی که shuffle_files=False (قطار)

  • تقسیم ها :

شکاف مثال ها
'test' 6,248
'train' 21,866
'validation' 3,123
  • ساختار ویژگی :
FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'nl': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'nl': Text(shape=(), dtype=string),
    }),
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
منبع تانسور رشته
source_aligned ترجمه
source_aligned/en متن رشته
source_aligned/nl متن رشته
هدف تانسور رشته
target_aligned ترجمه
target_aligned/en متن رشته
target_aligned/nl متن رشته
  • نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/wiki_lingua_english_en

  • توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستم‌های خلاصه‌سازی انتزاعی چند زبانه است.

  • حجم دانلود : 112.56 MiB

  • حجم مجموعه داده : 657.51 MiB

  • ذخیره خودکار ( اسناد ): خیر

  • تقسیم ها :

شکاف مثال ها
'test' 28614
'train' 99,020
'validation' 13,823
  • ساختار ویژگی :
FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
    }),
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
منبع تانسور رشته
source_aligned ترجمه
source_aligned/en متن رشته
هدف تانسور رشته
target_aligned ترجمه
target_aligned/en متن رشته
  • نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/wiki_lingua_french_fr

  • توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستم‌های خلاصه‌سازی انتزاعی چند زبانه است.

  • حجم دانلود : 113.26 MiB

  • حجم مجموعه داده : 522.28 MiB

  • ذخیره خودکار ( اسناد ): خیر

  • تقسیم ها :

شکاف مثال ها
'test' 12731
'train' 44,556
'validation' 6,364
  • ساختار ویژگی :
FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'fr': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'fr': Text(shape=(), dtype=string),
    }),
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
منبع تانسور رشته
source_aligned ترجمه
source_aligned/en متن رشته
source_aligned/fr متن رشته
هدف تانسور رشته
target_aligned ترجمه
target_aligned/en متن رشته
target_aligned/fr متن رشته
  • نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/wiki_lingua_german_de

  • توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستم‌های خلاصه‌سازی انتزاعی چند زبانه است.

  • حجم دانلود : 102.65 MiB

  • حجم مجموعه داده : 452.46 MiB

  • ذخیره خودکار ( اسناد ): خیر

  • تقسیم ها :

شکاف مثال ها
'test' 11669
'train' 40,839
'validation' 5,833
  • ساختار ویژگی :
FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'de': Text(shape=(), dtype=string),
        'en': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'de': Text(shape=(), dtype=string),
        'en': Text(shape=(), dtype=string),
    }),
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
منبع تانسور رشته
source_aligned ترجمه
source_aligned/de متن رشته
source_aligned/en متن رشته
هدف تانسور رشته
target_aligned ترجمه
target_aligned/de متن رشته
target_aligned/en متن رشته
  • نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/wiki_lingua_hindi_hi

  • توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستم‌های خلاصه‌سازی انتزاعی چند زبانه است.

  • حجم دانلود : 20.07 MiB

  • حجم مجموعه داده : 138.06 MiB

  • ذخیره خودکار ( اسناد ): بله

  • تقسیم ها :

شکاف مثال ها
'test' 1,984
'train' 6942
'validation' 991
  • ساختار ویژگی :
FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'hi': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'hi': Text(shape=(), dtype=string),
    }),
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
منبع تانسور رشته
source_aligned ترجمه
source_aligned/en متن رشته
source_aligned/سلام متن رشته
هدف تانسور رشته
target_aligned ترجمه
target_aligned/en متن رشته
target_aligned/سلام متن رشته
  • نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/wiki_lingua_i Indonesia_id

  • توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستم‌های خلاصه‌سازی انتزاعی چند زبانه است.

  • حجم دانلود : 80.08 MiB

  • حجم مجموعه داده : 370.63 MiB

  • ذخیره خودکار ( اسناد ): خیر

  • تقسیم ها :

شکاف مثال ها
'test' 9,497
'train' 33,237
'validation' 4,747
  • ساختار ویژگی :
FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'id': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'id': Text(shape=(), dtype=string),
    }),
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
منبع تانسور رشته
source_aligned ترجمه
source_aligned/en متن رشته
source_aligned/id متن رشته
هدف تانسور رشته
target_aligned ترجمه
target_aligned/en متن رشته
target_aligned/id متن رشته
  • نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/wiki_lingua_italian_it

  • توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستم‌های خلاصه‌سازی انتزاعی چند زبانه است.

  • حجم دانلود : 84.80 MiB

  • حجم مجموعه داده : 374.40 MiB

  • ذخیره خودکار ( اسناد ): خیر

  • تقسیم ها :

شکاف مثال ها
'test' 10,189
'train' 35661
'validation' 5,093
  • ساختار ویژگی :
FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'it': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'it': Text(shape=(), dtype=string),
    }),
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
منبع تانسور رشته
source_aligned ترجمه
source_aligned/en متن رشته
source_aligned/it متن رشته
هدف تانسور رشته
target_aligned ترجمه
target_aligned/en متن رشته
target_aligned/it متن رشته
  • نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/wiki_lingua_japanese_ja

  • توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستم‌های خلاصه‌سازی انتزاعی چند زبانه است.

  • حجم دانلود : 21.75 MiB

  • حجم مجموعه داده : 103.19 MiB

  • ذخیره خودکار ( اسناد ): بله

  • تقسیم ها :

شکاف مثال ها
'test' 2,530
'train' 8853
'validation' 1264
  • ساختار ویژگی :
FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'ja': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'ja': Text(shape=(), dtype=string),
    }),
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
منبع تانسور رشته
source_aligned ترجمه
source_aligned/en متن رشته
source_aligned/ja متن رشته
هدف تانسور رشته
target_aligned ترجمه
target_aligned/en متن رشته
target_aligned/ja متن رشته
  • نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/wiki_lingua_korean_ko

  • توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستم‌های خلاصه‌سازی انتزاعی چند زبانه است.

  • حجم دانلود : 22.26 MiB

  • حجم مجموعه داده : 102.35 MiB

  • ذخیره خودکار ( اسناد ): بله

  • تقسیم ها :

شکاف مثال ها
'test' 2,436
'train' 8,524
'validation' 1,216
  • ساختار ویژگی :
FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'ko': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'ko': Text(shape=(), dtype=string),
    }),
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
منبع تانسور رشته
source_aligned ترجمه
source_aligned/en متن رشته
source_aligned/ko متن رشته
هدف تانسور رشته
target_aligned ترجمه
target_aligned/en متن رشته
target_aligned/ko متن رشته
  • نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/wiki_lingua_portuguese_pt

  • توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستم‌های خلاصه‌سازی انتزاعی چند زبانه است.

  • حجم دانلود : 131.17 MiB

  • حجم مجموعه داده : 570.46 MiB

  • ذخیره خودکار ( اسناد ): خیر

  • تقسیم ها :

شکاف مثال ها
'test' 16,331
'train' 57,159
'validation' 8,165
  • ساختار ویژگی :
FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'pt': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'pt': Text(shape=(), dtype=string),
    }),
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
منبع تانسور رشته
source_aligned ترجمه
source_aligned/en متن رشته
source_aligned/pt متن رشته
هدف تانسور رشته
target_aligned ترجمه
target_aligned/en متن رشته
target_aligned/pt متن رشته
  • نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/wiki_lingua_russian_ru

  • توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستم‌های خلاصه‌سازی انتزاعی چند زبانه است.

  • حجم دانلود : 101.36 MiB

  • حجم مجموعه داده : 564.69 MiB

  • ذخیره خودکار ( اسناد ): خیر

  • تقسیم ها :

شکاف مثال ها
'test' 10,580
'train' 37,028
'validation' 5,288
  • ساختار ویژگی :
FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'ru': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'ru': Text(shape=(), dtype=string),
    }),
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
منبع تانسور رشته
source_aligned ترجمه
source_aligned/en متن رشته
source_aligned/ru متن رشته
هدف تانسور رشته
target_aligned ترجمه
target_aligned/en متن رشته
target_aligned/ru متن رشته
  • نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/wiki_lingua_spanish_es

  • توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستم‌های خلاصه‌سازی انتزاعی چند زبانه است.

  • حجم دانلود : 189.06 MiB

  • حجم مجموعه داده : 849.75 MiB

  • ذخیره خودکار ( اسناد ): خیر

  • تقسیم ها :

شکاف مثال ها
'test' 22632
'train' 79,212
'validation' 11,316
  • ساختار ویژگی :
FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'es': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'es': Text(shape=(), dtype=string),
    }),
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
منبع تانسور رشته
source_aligned ترجمه
source_aligned/en متن رشته
source_aligned/es متن رشته
هدف تانسور رشته
target_aligned ترجمه
target_aligned/en متن رشته
target_aligned/es متن رشته
  • نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/wiki_lingua_thai_th

  • توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستم‌های خلاصه‌سازی انتزاعی چند زبانه است.

  • حجم دانلود : 28.60 MiB

  • حجم مجموعه داده : 193.77 MiB

  • ذخیره خودکار ( مستندات ): بله (تست، اعتبارسنجی)، فقط زمانی که shuffle_files=False (قطار)

  • تقسیم ها :

شکاف مثال ها
'test' 2950
'train' 10,325
'validation' 1,475
  • ساختار ویژگی :
FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'th': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'th': Text(shape=(), dtype=string),
    }),
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
منبع تانسور رشته
source_aligned ترجمه
source_aligned/en متن رشته
source_aligned/th متن رشته
هدف تانسور رشته
target_aligned ترجمه
target_aligned/en متن رشته
target_aligned/th متن رشته
  • نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/wiki_lingua_turkish_tr

  • توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستم‌های خلاصه‌سازی انتزاعی چند زبانه است.

  • حجم دانلود : 6.73 MiB

  • حجم مجموعه داده : 30.75 MiB

  • ذخیره خودکار ( اسناد ): بله

  • تقسیم ها :

شکاف مثال ها
'test' 900
'train' 3,148
'validation' 449
  • ساختار ویژگی :
FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'tr': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'tr': Text(shape=(), dtype=string),
    }),
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
منبع تانسور رشته
source_aligned ترجمه
source_aligned/en متن رشته
source_aligned/tr متن رشته
هدف تانسور رشته
target_aligned ترجمه
target_aligned/en متن رشته
target_aligned/tr متن رشته
  • نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/wiki_lingua_vietnamese_vi

  • توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستم‌های خلاصه‌سازی انتزاعی چند زبانه است.

  • حجم دانلود : 36.27 MiB

  • حجم مجموعه داده : 179.77 MiB

  • ذخیره خودکار ( اسناد ): بله

  • تقسیم ها :

شکاف مثال ها
'test' 3,917
'train' 13707
'validation' 1,957
  • ساختار ویژگی :
FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'vi': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'vi': Text(shape=(), dtype=string),
    }),
})
  • مستندات ویژگی :
ویژگی کلاس شکل نوع D شرح
FeaturesDict
gem_id تانسور رشته
gem_parent_id تانسور رشته
منابع دنباله (تنسور) (هیچ یک،) رشته
منبع تانسور رشته
source_aligned ترجمه
source_aligned/en متن رشته
source_aligned/vi متن رشته
هدف تانسور رشته
target_aligned ترجمه
target_aligned/en متن رشته
target_aligned/vi متن رشته
  • نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."