טבעי_שאלות

תיאור :

קורפוס NQ מכיל שאלות ממשתמשים אמיתיים, והוא דורש ממערכות QA לקרוא ולהבין מאמר שלם בוויקיפדיה שאולי יכיל את התשובה לשאלה ואולי לא. ההכללה של שאלות משתמש אמיתיות, והדרישה שהפתרונות צריכים לקרוא עמוד שלם כדי למצוא את התשובה, גורמות ל-NQ להיות משימה מציאותית ומאתגרת יותר ממערכי נתונים קודמים של QA.

תיעוד נוסף : חקור על ניירות עם קוד
דף הבית : https://ai.google.com/research/NaturalQuestions/dataset
קוד מקור : tfds.datasets.natural_questions.Builder
גרסאות :
- 0.0.2 : אין הערות שחרור.
- 0.1.0 (ברירת מחדל): אין הערות שחרור.
גודל הורדה : 41.97 GiB
שמירה אוטומטית במטמון ( תיעוד ): לא
פיצולים :

לְפַצֵל	דוגמאות
`'train'`	307,373
`'validation'`	7,830

מפתחות בפיקוח (ראה as_supervised doc ): None
איור ( tfds.show_examples ): לא נתמך.
ציטוט :

@article{47761,
title = {Natural Questions: a Benchmark for Question Answering Research},
author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov},
year = {2019},
journal = {Transactions of the Association of Computational Linguistics}
}

natural_questions/default (תצורת ברירת מחדל)

תיאור תצורה : תצורת ברירת מחדל natural_questions
גודל ערכת נתונים: 90.26 GiB
מבנה תכונה :

FeaturesDict({
    'annotations': Sequence({
        'id': string,
        'long_answer': FeaturesDict({
            'end_byte': int64,
            'end_token': int64,
            'start_byte': int64,
            'start_token': int64,
        }),
        'short_answers': Sequence({
            'end_byte': int64,
            'end_token': int64,
            'start_byte': int64,
            'start_token': int64,
            'text': Text(shape=(), dtype=string),
        }),
        'yes_no_answer': ClassLabel(shape=(), dtype=int64, num_classes=2),
    }),
    'document': FeaturesDict({
        'html': Text(shape=(), dtype=string),
        'title': Text(shape=(), dtype=string),
        'tokens': Sequence({
            'is_html': bool,
            'token': Text(shape=(), dtype=string),
        }),
        'url': Text(shape=(), dtype=string),
    }),
    'id': string,
    'question': FeaturesDict({
        'text': Text(shape=(), dtype=string),
        'tokens': Sequence(string),
    }),
})

תיעוד תכונה :

תכונה	מעמד	צוּרָה	Dtype
	FeaturesDict
הערות	סדר פעולות
הערות/מזהה	מוֹתֵחַ		חוּט
הערות/תשובה_ארוכה	FeaturesDict
annotations/long_answer/end_byte	מוֹתֵחַ		int64
הערות/long_answer/end_token	מוֹתֵחַ		int64
annotations/long_answer/start_byte	מוֹתֵחַ		int64
הערות/long_answer/start_token	מוֹתֵחַ		int64
הערות/תשובות_קצרות	סדר פעולות
הערות/short_answers/end_byte	מוֹתֵחַ		int64
הערות/short_answers/end_token	מוֹתֵחַ		int64
הערות/short_answers/start_byte	מוֹתֵחַ		int64
הערות/short_answers/start_token	מוֹתֵחַ		int64
הערות/short_answers/text	טֶקסט		חוּט
הערות/כן_לא_תשובה	ClassLabel		int64
מסמך	FeaturesDict
מסמך/html	טֶקסט		חוּט
כותרת המסמך	טֶקסט		חוּט
מסמך/אסימונים	סדר פעולות
document/tokens/is_html	מוֹתֵחַ		bool
מסמך/אסימונים/אסימון	טֶקסט		חוּט
מסמך/כתובת אתר	טֶקסט		חוּט
תְעוּדַת זֶהוּת	מוֹתֵחַ		חוּט
שְׁאֵלָה	FeaturesDict
שאלה/טקסט	טֶקסט		חוּט
שאלה/אסימונים	רצף (טנזור)	(אף אחד,)	חוּט

דוגמאות ( tfds.as_dataframe ):

natural_questions/longt5

תיאור תצורה : natural_questions מעובדות מראש כמו ב-longT5 benchmark
גודל ערכת נתונים : 8.91 GiB
מבנה תכונה :

FeaturesDict({
    'all_answers': Sequence(Text(shape=(), dtype=string)),
    'answer': Text(shape=(), dtype=string),
    'context': Text(shape=(), dtype=string),
    'id': Text(shape=(), dtype=string),
    'question': Text(shape=(), dtype=string),
    'title': Text(shape=(), dtype=string),
})

תיעוד תכונה :

תכונה	מעמד	צוּרָה	Dtype
	FeaturesDict
כל_התשובות	רצף (טקסט)	(אף אחד,)	חוּט
תשובה	טֶקסט		חוּט
הֶקשֵׁר	טֶקסט		חוּט
תְעוּדַת זֶהוּת	טֶקסט		חוּט
שְׁאֵלָה	טֶקסט		חוּט
כותרת	טֶקסט		חוּט

דוגמאות ( tfds.as_dataframe ):