natural_questions

説明:

NQ コーパスには、実際のユーザーからの質問が含まれており、QA システムは、質問に対する回答が含まれている場合と含まれていない場合があるウィキペディアの記事全体を読んで理解する必要があります。実際のユーザーの質問を含めること、および解決策が答えを見つけるためにページ全体を読む必要があるという要件により、NQ は以前の QA データセットよりも現実的で困難なタスクになります。

追加のドキュメント:コードを使用したペーパーの探索
ホームページ: https://ai.google.com/research/NaturalQuestions/dataset
ソースコード: tfds.datasets.natural_questions.Builder
バージョン:
- 0.0.2 : リリースノートはありません。
- 0.1.0 (デフォルト): リリースノートはありません。
ダウンロードサイズ: 41.97 GiB
自動キャッシュ(ドキュメント): いいえ
スプリット:

スプリット	例
`'train'`	307,373
`'validation'`	7,830

監視されたキー( as_supervised docを参照): None
図( tfds.show_examples ): サポートされていません。
引用：

@article{47761,
title = {Natural Questions: a Benchmark for Question Answering Research},
author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov},
year = {2019},
journal = {Transactions of the Association of Computational Linguistics}
}

natural_questions/default (デフォルト設定)

構成の説明: デフォルトの natural_questions 構成
データセットサイズ: 90.26 GiB
機能構造:

FeaturesDict({
    'annotations': Sequence({
        'id': string,
        'long_answer': FeaturesDict({
            'end_byte': int64,
            'end_token': int64,
            'start_byte': int64,
            'start_token': int64,
        }),
        'short_answers': Sequence({
            'end_byte': int64,
            'end_token': int64,
            'start_byte': int64,
            'start_token': int64,
            'text': Text(shape=(), dtype=string),
        }),
        'yes_no_answer': ClassLabel(shape=(), dtype=int64, num_classes=2),
    }),
    'document': FeaturesDict({
        'html': Text(shape=(), dtype=string),
        'title': Text(shape=(), dtype=string),
        'tokens': Sequence({
            'is_html': bool,
            'token': Text(shape=(), dtype=string),
        }),
        'url': Text(shape=(), dtype=string),
    }),
    'id': string,
    'question': FeaturesDict({
        'text': Text(shape=(), dtype=string),
        'tokens': Sequence(string),
    }),
})

機能のドキュメント:

特徴	クラス	形	Dtype
	特徴辞書
注釈	順序
注釈/ID	テンソル		弦
注釈/ロングアンサー	特徴辞書
注釈/long_answer/end_byte	テンソル		int64
注釈/long_answer/end_token	テンソル		int64
注釈/long_answer/start_byte	テンソル		int64
注釈/long_answer/start_token	テンソル		int64
注釈/短い回答	順序
注釈/short_answers/end_byte	テンソル		int64
注釈/short_answers/end_token	テンソル		int64
注釈/short_answers/start_byte	テンソル		int64
注釈/short_answers/start_token	テンソル		int64
注釈/短い回答/テキスト	文章		弦
注釈/yes_no_answer	クラスラベル		int64
書類	特徴辞書
ドキュメント/html	文章		弦
ドキュメントのタイトル	文章		弦
ドキュメント/トークン	順序
ドキュメント/トークン/is_html	テンソル		ブール
ドキュメント/トークン/トークン	文章		弦
ドキュメント/URL	文章		弦
ID	テンソル		弦
質問	特徴辞書
質問/テキスト	文章		弦
質問/トークン	シーケンス(テンソル)	（なし、）	弦

例( tfds.as_dataframe ):

natural_questions/longt5

構成の説明: longT5 ベンチマークのように前処理された natural_questions
データセットサイズ: 8.91 GiB
機能構造:

FeaturesDict({
    'all_answers': Sequence(Text(shape=(), dtype=string)),
    'answer': Text(shape=(), dtype=string),
    'context': Text(shape=(), dtype=string),
    'id': Text(shape=(), dtype=string),
    'question': Text(shape=(), dtype=string),
    'title': Text(shape=(), dtype=string),
})

機能のドキュメント:

特徴	クラス	形	Dtype
	特徴辞書
all_answers	シーケンス(テキスト)	（なし、）	弦
答え	文章		弦
コンテクスト	文章		弦
ID	文章		弦
質問	文章		弦
タイトル	文章		弦

例( tfds.as_dataframe ):