TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

squad

Description:

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

Additional Documentation: Explore on Papers With Code
Homepage: https://rajpurkar.github.io/SQuAD-explorer/
Source code: tfds.datasets.squad.Builder
Versions:
- 3.0.0 (default): Fixes issue with small number of examples (19) where answer spans are misaligned due to context white-space removal.
Supervised keys (See as_supervised doc): None
Figure (tfds.show_examples): Not supported.
Citation:

@article{2016arXiv160605250R,
       author = { {Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev},
                 Konstantin and {Liang}, Percy},
        title = "{SQuAD: 100,000+ Questions for Machine Comprehension of Text}",
      journal = {arXiv e-prints},
         year = 2016,
          eid = {arXiv:1606.05250},
        pages = {arXiv:1606.05250},
archivePrefix = {arXiv},
       eprint = {1606.05250},
}

squad/v1.1 (default config)

Config description: Version 1.1.0 of SQUAD
Download size: 33.51 MiB
Dataset size: 94.06 MiB
Auto-cached (documentation): Yes
Splits:

Split	Examples
`'train'`	87,599
`'validation'`	10,570

Feature structure:

FeaturesDict({
    'answers': Sequence({
        'answer_start': int32,
        'text': Text(shape=(), dtype=string),
    }),
    'context': Text(shape=(), dtype=string),
    'id': string,
    'question': Text(shape=(), dtype=string),
    'title': Text(shape=(), dtype=string),
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
answers	Sequence
answers/answer_start	Tensor	int32
answers/text	Text	string
context	Text	string
id	Tensor	string
question	Text	string
title	Text	string

Examples (tfds.as_dataframe):

squad/v2.0

Config description: Version 2.0.0 of SQUAD
Download size: 44.34 MiB
Dataset size: 148.54 MiB
Auto-cached (documentation): Yes (validation), Only when shuffle_files=False (train)
Splits:

Split	Examples
`'train'`	130,319
`'validation'`	11,873

Feature structure:

FeaturesDict({
    'answers': Sequence({
        'answer_start': int32,
        'text': Text(shape=(), dtype=string),
    }),
    'context': Text(shape=(), dtype=string),
    'id': string,
    'is_impossible': bool,
    'plausible_answers': Sequence({
        'answer_start': int32,
        'text': Text(shape=(), dtype=string),
    }),
    'question': Text(shape=(), dtype=string),
    'title': Text(shape=(), dtype=string),
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
answers	Sequence
answers/answer_start	Tensor	int32
answers/text	Text	string
context	Text	string
id	Tensor	string
is_impossible	Tensor	bool
plausible_answers	Sequence
plausible_answers/answer_start	Tensor	int32
plausible_answers/text	Text	string
question	Text	string
title	Text	string

Examples (tfds.as_dataframe):