TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

natural_questions

Description:

The NQ corpus contains questions from real users, and it requires QA systems to read and comprehend an entire Wikipedia article that may or may not contain the answer to the question. The inclusion of real user questions, and the requirement that solutions should read an entire page to find the answer, cause NQ to be a more realistic and challenging task than prior QA datasets.

Additional Documentation: Explore on Papers With Code
Homepage: https://ai.google.com/research/NaturalQuestions/dataset
Source code: tfds.datasets.natural_questions.Builder
Versions:
- 0.0.2: No release notes.
- 0.1.0 (default): No release notes.
Download size: 41.97 GiB
Auto-cached (documentation): No
Splits:

Split	Examples
`'train'`	307,373
`'validation'`	7,830

Supervised keys (See as_supervised doc): None
Figure (tfds.show_examples): Not supported.
Citation:

@article{47761,
title = {Natural Questions: a Benchmark for Question Answering Research},
author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov},
year = {2019},
journal = {Transactions of the Association of Computational Linguistics}
}

natural_questions/default (default config)

Config description: Default natural_questions config
Dataset size: 90.26 GiB
Feature structure:

FeaturesDict({
    'annotations': Sequence({
        'id': string,
        'long_answer': FeaturesDict({
            'end_byte': int64,
            'end_token': int64,
            'start_byte': int64,
            'start_token': int64,
        }),
        'short_answers': Sequence({
            'end_byte': int64,
            'end_token': int64,
            'start_byte': int64,
            'start_token': int64,
            'text': Text(shape=(), dtype=string),
        }),
        'yes_no_answer': ClassLabel(shape=(), dtype=int64, num_classes=2),
    }),
    'document': FeaturesDict({
        'html': Text(shape=(), dtype=string),
        'title': Text(shape=(), dtype=string),
        'tokens': Sequence({
            'is_html': bool,
            'token': Text(shape=(), dtype=string),
        }),
        'url': Text(shape=(), dtype=string),
    }),
    'id': string,
    'question': FeaturesDict({
        'text': Text(shape=(), dtype=string),
        'tokens': Sequence(string),
    }),
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
annotations	Sequence
annotations/id	Tensor		string
annotations/long_answer	FeaturesDict
annotations/long_answer/end_byte	Tensor		int64
annotations/long_answer/end_token	Tensor		int64
annotations/long_answer/start_byte	Tensor		int64
annotations/long_answer/start_token	Tensor		int64
annotations/short_answers	Sequence
annotations/short_answers/end_byte	Tensor		int64
annotations/short_answers/end_token	Tensor		int64
annotations/short_answers/start_byte	Tensor		int64
annotations/short_answers/start_token	Tensor		int64
annotations/short_answers/text	Text		string
annotations/yes_no_answer	ClassLabel		int64
document	FeaturesDict
document/html	Text		string
document/title	Text		string
document/tokens	Sequence
document/tokens/is_html	Tensor		bool
document/tokens/token	Text		string
document/url	Text		string
id	Tensor		string
question	FeaturesDict
question/text	Text		string
question/tokens	Sequence(Tensor)	(None,)	string

Examples (tfds.as_dataframe):

natural_questions/longt5

Config description: natural_questions preprocessed as in the longT5 benchmark
Dataset size: 8.91 GiB
Feature structure:

FeaturesDict({
    'all_answers': Sequence(Text(shape=(), dtype=string)),
    'answer': Text(shape=(), dtype=string),
    'context': Text(shape=(), dtype=string),
    'id': Text(shape=(), dtype=string),
    'question': Text(shape=(), dtype=string),
    'title': Text(shape=(), dtype=string),
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
all_answers	Sequence(Text)	(None,)	string
answer	Text		string
context	Text		string
id	Text		string
question	Text		string
title	Text		string

Examples (tfds.as_dataframe):

natural_questions Stay organized with collections Save and categorize content based on your preferences.

natural_questions/default (default config)

natural_questions/longt5

natural_questions