Google I / O kehrt vom 18. bis 20. Mai zurück! Reservieren Sie Platz und erstellen Sie Ihren Zeitplan Registrieren Sie sich jetzt
Diese Seite wurde von der Cloud Translation API übersetzt.
Switch to English

TensorFlow-Datensätze

TFDS bietet eine Sammlung gebrauchsfertiger Datensätze zur Verwendung mit TensorFlow, Jax und anderen Frameworks für maschinelles Lernen.

Es behandelt das deterministische Herunterladen und Aufbereiten der Daten sowie dastf.data.Dataset einestf.data.Dataset (oder np.array ).

Ansicht auf TensorFlow.org In Google Colab ausführen Quelle auf GitHub anzeigen Notizbuch herunterladen

Installation

TFDS existiert in zwei Paketen:

  • pip install tensorflow-datasets : Die stabile Version, die alle paar Monate veröffentlicht wird.
  • pip install tfds-nightly : pip install tfds-nightly jeden Tag veröffentlicht und enthält die letzten Versionen der Datensätze.

Dieses Colab verwendet tfds-nightly :

pip install -q tfds-nightly tensorflow matplotlib
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

import tensorflow_datasets as tfds

Finden Sie verfügbare Datensätze

Alle Dataset Builder sind Unterklassen von tfds.core.DatasetBuilder . Um die Liste der verfügbaren Builder tfds.list_builders() , verwenden Sie tfds.list_builders() oder sehen Sie sich unseren Katalog an .

tfds.list_builders()
['abstract_reasoning',
 'accentdb',
 'aeslc',
 'aflw2k3d',
 'ag_news_subset',
 'ai2_arc',
 'ai2_arc_with_ir',
 'amazon_us_reviews',
 'anli',
 'arc',
 'bair_robot_pushing_small',
 'bccd',
 'beans',
 'big_patent',
 'bigearthnet',
 'billsum',
 'binarized_mnist',
 'binary_alpha_digits',
 'blimp',
 'bool_q',
 'c4',
 'caltech101',
 'caltech_birds2010',
 'caltech_birds2011',
 'cars196',
 'cassava',
 'cats_vs_dogs',
 'celeb_a',
 'celeb_a_hq',
 'cfq',
 'cherry_blossoms',
 'chexpert',
 'cifar10',
 'cifar100',
 'cifar10_1',
 'cifar10_corrupted',
 'citrus_leaves',
 'cityscapes',
 'civil_comments',
 'clevr',
 'clic',
 'clinc_oos',
 'cmaterdb',
 'cnn_dailymail',
 'coco',
 'coco_captions',
 'coil100',
 'colorectal_histology',
 'colorectal_histology_large',
 'common_voice',
 'coqa',
 'cos_e',
 'cosmos_qa',
 'covid19sum',
 'crema_d',
 'curated_breast_imaging_ddsm',
 'cycle_gan',
 'dart',
 'davis',
 'deep_weeds',
 'definite_pronoun_resolution',
 'dementiabank',
 'diabetic_retinopathy_detection',
 'div2k',
 'dmlab',
 'dolphin_number_word',
 'downsampled_imagenet',
 'drop',
 'dsprites',
 'dtd',
 'duke_ultrasound',
 'e2e_cleaned',
 'efron_morris75',
 'emnist',
 'eraser_multi_rc',
 'esnli',
 'eurosat',
 'fashion_mnist',
 'flic',
 'flores',
 'food101',
 'forest_fires',
 'fuss',
 'gap',
 'geirhos_conflict_stimuli',
 'gem',
 'genomics_ood',
 'german_credit_numeric',
 'gigaword',
 'glue',
 'goemotions',
 'gpt3',
 'gref',
 'groove',
 'gtzan',
 'gtzan_music_speech',
 'hellaswag',
 'higgs',
 'horses_or_humans',
 'howell',
 'i_naturalist2017',
 'imagenet2012',
 'imagenet2012_corrupted',
 'imagenet2012_real',
 'imagenet2012_subset',
 'imagenet_a',
 'imagenet_r',
 'imagenet_resized',
 'imagenet_v2',
 'imagenette',
 'imagewang',
 'imdb_reviews',
 'irc_disentanglement',
 'iris',
 'kitti',
 'kmnist',
 'lambada',
 'lfw',
 'librispeech',
 'librispeech_lm',
 'libritts',
 'ljspeech',
 'lm1b',
 'lost_and_found',
 'lsun',
 'lvis',
 'malaria',
 'math_dataset',
 'mctaco',
 'mlqa',
 'mnist',
 'mnist_corrupted',
 'movie_lens',
 'movie_rationales',
 'movielens',
 'moving_mnist',
 'multi_news',
 'multi_nli',
 'multi_nli_mismatch',
 'natural_questions',
 'natural_questions_open',
 'newsroom',
 'nsynth',
 'nyu_depth_v2',
 'ogbg_molpcba',
 'omniglot',
 'open_images_challenge2019_detection',
 'open_images_v4',
 'openbookqa',
 'opinion_abstracts',
 'opinosis',
 'opus',
 'oxford_flowers102',
 'oxford_iiit_pet',
 'para_crawl',
 'patch_camelyon',
 'paws_wiki',
 'paws_x_wiki',
 'pet_finder',
 'pg19',
 'piqa',
 'places365_small',
 'plant_leaves',
 'plant_village',
 'plantae_k',
 'qa4mre',
 'qasc',
 'quac',
 'quickdraw_bitmap',
 'race',
 'radon',
 'reddit',
 'reddit_disentanglement',
 'reddit_tifu',
 'resisc45',
 'robonet',
 'rock_paper_scissors',
 'rock_you',
 's3o4d',
 'salient_span_wikipedia',
 'samsum',
 'savee',
 'scan',
 'scene_parse150',
 'scicite',
 'scientific_papers',
 'sentiment140',
 'shapes3d',
 'siscore',
 'smallnorb',
 'snli',
 'so2sat',
 'speech_commands',
 'spoken_digit',
 'squad',
 'stanford_dogs',
 'stanford_online_products',
 'star_cfq',
 'starcraft_video',
 'stl10',
 'story_cloze',
 'sun397',
 'super_glue',
 'svhn_cropped',
 'tao',
 'ted_hrlr_translate',
 'ted_multi_translate',
 'tedlium',
 'tf_flowers',
 'the300w_lp',
 'tiny_shakespeare',
 'titanic',
 'trec',
 'trivia_qa',
 'tydi_qa',
 'uc_merced',
 'ucf101',
 'vctk',
 'vgg_face2',
 'visual_domain_decathlon',
 'voc',
 'voxceleb',
 'voxforge',
 'waymo_open_dataset',
 'web_nlg',
 'web_questions',
 'wider_face',
 'wiki40b',
 'wiki_bio',
 'wiki_table_questions',
 'wiki_table_text',
 'wikiann',
 'wikihow',
 'wikipedia',
 'wikipedia_toxicity_subtypes',
 'wine_quality',
 'winogrande',
 'wmt13_translate',
 'wmt14_translate',
 'wmt15_translate',
 'wmt16_translate',
 'wmt17_translate',
 'wmt18_translate',
 'wmt19_translate',
 'wmt_t2t_translate',
 'wmt_translate',
 'wordnet',
 'wsc273',
 'xnli',
 'xquad',
 'xsum',
 'xtreme_pawsx',
 'xtreme_xnli',
 'yelp_polarity_reviews',
 'yes_no',
 'youtube_vis',
 'huggingface:acronym_identification',
 'huggingface:ade_corpus_v2',
 'huggingface:adversarial_qa',
 'huggingface:aeslc',
 'huggingface:afrikaans_ner_corpus',
 'huggingface:ag_news',
 'huggingface:ai2_arc',
 'huggingface:air_dialogue',
 'huggingface:ajgt_twitter_ar',
 'huggingface:allegro_reviews',
 'huggingface:allocine',
 'huggingface:alt',
 'huggingface:amazon_polarity',
 'huggingface:amazon_reviews_multi',
 'huggingface:amazon_us_reviews',
 'huggingface:ambig_qa',
 'huggingface:amttl',
 'huggingface:anli',
 'huggingface:app_reviews',
 'huggingface:aqua_rat',
 'huggingface:aquamuse',
 'huggingface:ar_cov19',
 'huggingface:ar_res_reviews',
 'huggingface:ar_sarcasm',
 'huggingface:arabic_billion_words',
 'huggingface:arabic_pos_dialect',
 'huggingface:arabic_speech_corpus',
 'huggingface:arcd',
 'huggingface:arsentd_lev',
 'huggingface:art',
 'huggingface:arxiv_dataset',
 'huggingface:aslg_pc12',
 'huggingface:asnq',
 'huggingface:asset',
 'huggingface:assin',
 'huggingface:assin2',
 'huggingface:atomic',
 'huggingface:autshumato',
 'huggingface:bbc_hindi_nli',
 'huggingface:bc2gm_corpus',
 'huggingface:best2009',
 'huggingface:bianet',
 'huggingface:bible_para',
 'huggingface:big_patent',
 'huggingface:billsum',
 'huggingface:bing_coronavirus_query_set',
 'huggingface:biomrc',
 'huggingface:blended_skill_talk',
 'huggingface:blimp',
 'huggingface:blog_authorship_corpus',
 'huggingface:bn_hate_speech',
 'huggingface:bookcorpus',
 'huggingface:bookcorpusopen',
 'huggingface:boolq',
 'huggingface:bprec',
 'huggingface:break_data',
 'huggingface:brwac',
 'huggingface:bsd_ja_en',
 'huggingface:bswac',
 'huggingface:c3',
 'huggingface:c4',
 'huggingface:cail2018',
 'huggingface:caner',
 'huggingface:capes',
 'huggingface:catalonia_independence',
 'huggingface:cawac',
 'huggingface:cc100',
 'huggingface:cc_news',
 'huggingface:cdsc',
 'huggingface:cdt',
 'huggingface:cfq',
 'huggingface:chr_en',
 'huggingface:cifar10',
 'huggingface:cifar100',
 'huggingface:circa',
 'huggingface:civil_comments',
 'huggingface:clickbait_news_bg',
 'huggingface:climate_fever',
 'huggingface:clinc_oos',
 'huggingface:clue',
 'huggingface:cmrc2018',
 'huggingface:cnn_dailymail',
 'huggingface:coached_conv_pref',
 'huggingface:coarse_discourse',
 'huggingface:codah',
 'huggingface:code_search_net',
 'huggingface:com_qa',
 'huggingface:common_gen',
 'huggingface:commonsense_qa',
 'huggingface:compguesswhat',
 'huggingface:conceptnet5',
 'huggingface:conll2000',
 'huggingface:conll2002',
 'huggingface:conll2003',
 'huggingface:conv_ai',
 'huggingface:conv_ai_2',
 'huggingface:conv_ai_3',
 'huggingface:coqa',
 'huggingface:cord19',
 'huggingface:cornell_movie_dialog',
 'huggingface:cos_e',
 'huggingface:cosmos_qa',
 'huggingface:counter',
 'huggingface:covid_qa_castorini',
 'huggingface:covid_qa_deepset',
 'huggingface:covid_qa_ucsd',
 'huggingface:covid_tweets_japanese',
 'huggingface:craigslist_bargains',
 'huggingface:crawl_domain',
 'huggingface:crd3',
 'huggingface:crime_and_punish',
 'huggingface:crows_pairs',
 'huggingface:cs_restaurants',
 'huggingface:curiosity_dialogs',
 'huggingface:daily_dialog',
 'huggingface:dane',
 'huggingface:danish_political_comments',
 'huggingface:dart',
 'huggingface:datacommons_factcheck',
 'huggingface:dbpedia_14',
 'huggingface:dbrd',
 'huggingface:deal_or_no_dialog',
 'huggingface:definite_pronoun_resolution',
 'huggingface:dengue_filipino',
 'huggingface:dialog_re',
 'huggingface:diplomacy_detection',
 'huggingface:disaster_response_messages',
 'huggingface:discofuse',
 'huggingface:discovery',
 'huggingface:doc2dial',
 'huggingface:docred',
 'huggingface:doqa',
 'huggingface:dream',
 'huggingface:drop',
 'huggingface:duorc',
 'huggingface:dutch_social',
 'huggingface:dyk',
 'huggingface:e2e_nlg',
 'huggingface:e2e_nlg_cleaned',
 'huggingface:ecb',
 'huggingface:ehealth_kd',
 'huggingface:eitb_parcc',
 'huggingface:eli5',
 'huggingface:emea',
 'huggingface:emo',
 'huggingface:emotion',
 'huggingface:emotone_ar',
 'huggingface:empathetic_dialogues',
 'huggingface:enriched_web_nlg',
 'huggingface:eraser_multi_rc',
 'huggingface:esnli',
 'huggingface:eth_py150_open',
 'huggingface:ethos',
 'huggingface:euronews',
 'huggingface:europa_eac_tm',
 'huggingface:europa_ecdc_tm',
 'huggingface:event2Mind',
 'huggingface:evidence_infer_treatment',
 'huggingface:exams',
 'huggingface:factckbr',
 'huggingface:fake_news_english',
 'huggingface:fake_news_filipino',
 'huggingface:farsi_news',
 'huggingface:fever',
 'huggingface:finer',
 'huggingface:flores',
 'huggingface:flue',
 'huggingface:fquad',
 'huggingface:freebase_qa',
 'huggingface:gap',
 'huggingface:gem',
 'huggingface:generated_reviews_enth',
 'huggingface:generics_kb',
 'huggingface:german_legal_entity_recognition',
 'huggingface:germaner',
 'huggingface:germeval_14',
 'huggingface:giga_fren',
 'huggingface:gigaword',
 'huggingface:glucose',
 'huggingface:glue',
 'huggingface:gnad10',
 'huggingface:go_emotions',
 'huggingface:google_wellformed_query',
 'huggingface:grail_qa',
 'huggingface:great_code',
 'huggingface:guardian_authorship',
 'huggingface:gutenberg_time',
 'huggingface:hans',
 'huggingface:hansards',
 'huggingface:hard',
 'huggingface:harem',
 'huggingface:has_part',
 'huggingface:hate_offensive',
 'huggingface:hate_speech18',
 'huggingface:hate_speech_filipino',
 'huggingface:hate_speech_offensive',
 'huggingface:hate_speech_pl',
 'huggingface:hate_speech_portuguese',
 'huggingface:hatexplain',
 'huggingface:hausa_voa_ner',
 'huggingface:hausa_voa_topics',
 'huggingface:hda_nli_hindi',
 'huggingface:head_qa',
 'huggingface:health_fact',
 'huggingface:hebrew_projectbenyehuda',
 'huggingface:hebrew_sentiment',
 'huggingface:hebrew_this_world',
 'huggingface:hellaswag',
 'huggingface:hind_encorp',
 'huggingface:hindi_discourse',
 'huggingface:hippocorpus',
 'huggingface:hkcancor',
 'huggingface:hope_edi',
 'huggingface:hotpot_qa',
 'huggingface:hover',
 'huggingface:hrenwac_para',
 'huggingface:hrwac',
 'huggingface:humicroedit',
 'huggingface:hybrid_qa',
 'huggingface:hyperpartisan_news_detection',
 'huggingface:id_clickbait',
 'huggingface:id_liputan6',
 'huggingface:id_nergrit_corpus',
 'huggingface:id_newspapers_2018',
 'huggingface:id_panl_bppt',
 'huggingface:id_puisi',
 'huggingface:igbo_english_machine_translation',
 'huggingface:igbo_monolingual',
 'huggingface:igbo_ner',
 'huggingface:ilist',
 'huggingface:imdb',
 'huggingface:imdb_urdu_reviews',
 'huggingface:imppres',
 'huggingface:indic_glue',
 'huggingface:indonlu',
 'huggingface:inquisitive_qg',
 'huggingface:interpress_news_category_tr',
 'huggingface:irc_disentangle',
 'huggingface:isixhosa_ner_corpus',
 'huggingface:isizulu_ner_corpus',
 'huggingface:iwslt2017',
 'huggingface:jeopardy',
 'huggingface:jfleg',
 'huggingface:jigsaw_toxicity_pred',
 'huggingface:jnlpba',
 'huggingface:journalists_questions',
 'huggingface:kannada_news',
 'huggingface:kd_conv',
 'huggingface:kde4',
 'huggingface:kelm',
 'huggingface:kilt_tasks',
 'huggingface:kilt_wikipedia',
 'huggingface:kinnews_kirnews',
 'huggingface:kor_3i4k',
 'huggingface:kor_hate',
 'huggingface:kor_ner',
 'huggingface:kor_nli',
 'huggingface:kor_nlu',
 'huggingface:kor_qpair',
 'huggingface:kor_sae',
 'huggingface:kor_sarcasm',
 'huggingface:labr',
 'huggingface:lama',
 'huggingface:lambada',
 'huggingface:large_spanish_corpus',
 'huggingface:lc_quad',
 'huggingface:lener_br',
 'huggingface:liar',
 'huggingface:librispeech_asr',
 'huggingface:librispeech_lm',
 'huggingface:limit',
 'huggingface:lince',
 'huggingface:linnaeus',
 'huggingface:liveqa',
 'huggingface:lj_speech',
 'huggingface:lm1b',
 'huggingface:lst20',
 'huggingface:mac_morpho',
 'huggingface:makhzan',
 'huggingface:math_dataset',
 'huggingface:math_qa',
 'huggingface:matinf',
 'huggingface:mc_taco',
 'huggingface:md_gender_bias',
 'huggingface:med_hop',
 'huggingface:medal',
 'huggingface:medical_dialog',
 'huggingface:medical_questions_pairs',
 'huggingface:menyo20k_mt',
 'huggingface:meta_woz',
 'huggingface:metooma',
 'huggingface:metrec',
 'huggingface:mkb',
 'huggingface:mkqa',
 'huggingface:mlqa',
 'huggingface:mlsum',
 'huggingface:mnist',
 'huggingface:mocha',
 'huggingface:movie_rationales',
 'huggingface:mrqa',
 'huggingface:ms_marco',
 'huggingface:ms_terms',
 'huggingface:msr_genomics_kbcomp',
 'huggingface:msr_sqa',
 'huggingface:msr_text_compression',
 'huggingface:msr_zhen_translation_parity',
 'huggingface:msra_ner',
 'huggingface:mt_eng_vietnamese',
 'huggingface:muchocine',
 'huggingface:multi_booked',
 'huggingface:multi_news',
 'huggingface:multi_nli',
 'huggingface:multi_nli_mismatch',
 'huggingface:multi_para_crawl',
 'huggingface:multi_re_qa',
 'huggingface:multi_woz_v22',
 'huggingface:multi_x_science_sum',
 'huggingface:mutual_friends',
 'huggingface:mwsc',
 'huggingface:myanmar_news',
 'huggingface:narrativeqa',
 'huggingface:narrativeqa_manual',
 'huggingface:natural_questions',
 'huggingface:ncbi_disease',
 'huggingface:nchlt',
 'huggingface:ncslgr',
 'huggingface:nell',
 'huggingface:neural_code_search',
 'huggingface:news_commentary',
 'huggingface:newsgroup',
 'huggingface:newsph',
 'huggingface:newsph_nli',
 'huggingface:newsqa',
 'huggingface:newsroom',
 'huggingface:nkjp-ner',
 'huggingface:nli_tr',
 'huggingface:norwegian_ner',
 'huggingface:nq_open',
 'huggingface:nsmc',
 'huggingface:numer_sense',
 'huggingface:numeric_fused_head',
 'huggingface:oclar',
 'huggingface:offcombr',
 'huggingface:offenseval2020_tr',
 'huggingface:offenseval_dravidian',
 'huggingface:ofis_publik',
 'huggingface:ohsumed',
 'huggingface:ollie',
 'huggingface:omp',
 'huggingface:onestop_english',
 'huggingface:open_subtitles',
 'huggingface:openbookqa',
 'huggingface:openwebtext',
 'huggingface:opinosis',
 'huggingface:opus100',
 'huggingface:opus_books',
 'huggingface:opus_dgt',
 'huggingface:opus_dogc',
 'huggingface:opus_elhuyar',
 'huggingface:opus_euconst',
 'huggingface:opus_finlex',
 'huggingface:opus_fiskmo',
 'huggingface:opus_gnome',
 'huggingface:opus_infopankki',
 'huggingface:opus_memat',
 'huggingface:opus_montenegrinsubs',
 'huggingface:opus_openoffice',
 'huggingface:opus_paracrawl',
 'huggingface:opus_rf',
 'huggingface:opus_tedtalks',
 'huggingface:opus_ubuntu',
 'huggingface:opus_wikipedia',
 'huggingface:opus_xhosanavy',
 'huggingface:orange_sum',
 'huggingface:oscar',
 'huggingface:para_crawl',
 'huggingface:para_pat',
 'huggingface:paws',
 'huggingface:paws-x',
 'huggingface:pec',
 'huggingface:peer_read',
 'huggingface:peoples_daily_ner',
 'huggingface:per_sent',
 'huggingface:persian_ner',
 'huggingface:pg19',
 'huggingface:php',
 'huggingface:piaf',
 'huggingface:pib',
 'huggingface:piqa',
 'huggingface:pn_summary',
 'huggingface:poem_sentiment',
 'huggingface:polemo2',
 'huggingface:poleval2019_cyberbullying',
 'huggingface:poleval2019_mt',
 'huggingface:polsum',
 'huggingface:polyglot_ner',
 'huggingface:prachathai67k',
 'huggingface:pragmeval',
 'huggingface:proto_qa',
 'huggingface:psc',
 'huggingface:ptb_text_only',
 'huggingface:pubmed',
 'huggingface:pubmed_qa',
 'huggingface:py_ast',
 'huggingface:qa4mre',
 'huggingface:qa_srl',
 'huggingface:qa_zre',
 'huggingface:qangaroo',
 'huggingface:qanta',
 'huggingface:qasc',
 'huggingface:qed',
 'huggingface:qed_amara',
 'huggingface:quac',
 'huggingface:quail',
 'huggingface:quarel',
 'huggingface:quartz',
 'huggingface:quora',
 'huggingface:quoref',
 'huggingface:race',
 'huggingface:re_dial',
 'huggingface:reasoning_bg',
 'huggingface:recipe_nlg',
 'huggingface:reclor',
 'huggingface:reddit',
 'huggingface:reddit_tifu',
 'huggingface:refresd',
 'huggingface:reuters21578',
 'huggingface:roman_urdu',
 'huggingface:ronec',
 'huggingface:ropes',
 'huggingface:rotten_tomatoes',
 'huggingface:s2orc',
 'huggingface:samsum',
 'huggingface:sanskrit_classic',
 'huggingface:saudinewsnet',
 'huggingface:scan',
 'huggingface:scb_mt_enth_2020',
 'huggingface:schema_guided_dstc8',
 'huggingface:scicite',
 'huggingface:scielo',
 'huggingface:scientific_papers',
 'huggingface:scifact',
 'huggingface:sciq',
 'huggingface:scitail',
 'huggingface:scitldr',
 'huggingface:search_qa',
 'huggingface:selqa',
 'huggingface:sem_eval_2010_task_8',
 'huggingface:sem_eval_2014_task_1',
 'huggingface:sem_eval_2020_task_11',
 'huggingface:sent_comp',
 'huggingface:senti_lex',
 'huggingface:senti_ws',
 'huggingface:sentiment140',
 'huggingface:sepedi_ner',
 'huggingface:sesotho_ner_corpus',
 'huggingface:setimes',
 'huggingface:setswana_ner_corpus',
 'huggingface:sharc',
 'huggingface:sharc_modified',
 'huggingface:sick',
 'huggingface:silicone',
 'huggingface:simple_questions_v2',
 'huggingface:siswati_ner_corpus',
 'huggingface:smartdata',
 'huggingface:sms_spam',
 'huggingface:snips_built_in_intents',
 'huggingface:snli',
 'huggingface:snow_simplified_japanese_corpus',
 'huggingface:so_stacksample',
 'huggingface:social_bias_frames',
 'huggingface:social_i_qa',
 'huggingface:sofc_materials_articles',
 'huggingface:sogou_news',
 'huggingface:spanish_billion_words',
 'huggingface:spc',
 'huggingface:species_800',
 'huggingface:spider',
 'huggingface:squad',
 'huggingface:squad_adversarial',
 'huggingface:squad_es',
 'huggingface:squad_it',
 'huggingface:squad_kor_v1',
 'huggingface:squad_kor_v2',
 'huggingface:squad_v1_pt',
 'huggingface:squad_v2',
 'huggingface:squadshifts',
 'huggingface:srwac',
 'huggingface:stereoset',
 'huggingface:stsb_mt_sv',
 'huggingface:style_change_detection',
 'huggingface:super_glue',
 'huggingface:swag',
 'huggingface:swahili',
 'huggingface:swahili_news',
 'huggingface:swda',
 'huggingface:swedish_ner_corpus',
 'huggingface:swedish_reviews',
 'huggingface:tab_fact',
 'huggingface:tamilmixsentiment',
 'huggingface:tanzil',
 'huggingface:tapaco',
 'huggingface:tashkeela',
 'huggingface:taskmaster1',
 'huggingface:taskmaster2',
 'huggingface:taskmaster3',
 'huggingface:tatoeba',
 'huggingface:ted_hrlr',
 'huggingface:ted_iwlst2013',
 'huggingface:ted_multi',
 'huggingface:ted_talks_iwslt',
 'huggingface:telugu_books',
 'huggingface:telugu_news',
 'huggingface:tep_en_fa_para',
 'huggingface:thai_toxicity_tweet',
 'huggingface:thainer',
 'huggingface:thaiqa_squad',
 'huggingface:thaisum',
 'huggingface:tilde_model',
 'huggingface:times_of_india_news_headlines',
 'huggingface:tiny_shakespeare',
 'huggingface:tlc',
 'huggingface:tmu_gfm_dataset',
 'huggingface:totto',
 'huggingface:trec',
 'huggingface:trivia_qa',
 'huggingface:tsac',
 'huggingface:ttc4900',
 'huggingface:tunizi',
 'huggingface:tuple_ie',
 'huggingface:turk',
 'huggingface:turkish_movie_sentiment',
 'huggingface:turkish_ner',
 'huggingface:turkish_product_reviews',
 'huggingface:turkish_shrinked_ner',
 'huggingface:turku_ner_corpus',
 'huggingface:tweet_eval',
 'huggingface:tweet_qa',
 'huggingface:tweets_ar_en_parallel',
 'huggingface:tweets_hate_speech_detection',
 'huggingface:twi_text_c3',
 'huggingface:twi_wordsim353',
 'huggingface:tydiqa',
 'huggingface:ubuntu_dialogs_corpus',
 'huggingface:udhr',
 'huggingface:um005',
 'huggingface:un_ga',
 'huggingface:un_multi',
 'huggingface:un_pc',
 'huggingface:universal_dependencies',
 'huggingface:universal_morphologies',
 'huggingface:urdu_fake_news',
 'huggingface:urdu_sentiment_corpus',
 'huggingface:web_nlg',
 'huggingface:web_of_science',
 'huggingface:web_questions',
 'huggingface:weibo_ner',
 'huggingface:wi_locness',
 'huggingface:wiki40b',
 'huggingface:wiki_asp',
 'huggingface:wiki_atomic_edits',
 'huggingface:wiki_auto',
 'huggingface:wiki_bio',
 'huggingface:wiki_dpr',
 'huggingface:wiki_hop',
 'huggingface:wiki_lingua',
 'huggingface:wiki_movies',
 'huggingface:wiki_qa',
 'huggingface:wiki_qa_ar',
 'huggingface:wiki_snippets',
 'huggingface:wiki_source',
 'huggingface:wiki_split',
 'huggingface:wiki_summary',
 'huggingface:wikiann',
 'huggingface:wikicorpus',
 'huggingface:wikihow',
 'huggingface:wikipedia',
 'huggingface:wikisql',
 'huggingface:wikitext',
 'huggingface:wikitext_tl39',
 'huggingface:wili_2018',
 'huggingface:wino_bias',
 'huggingface:winograd_wsc',
 'huggingface:winogrande',
 'huggingface:wiqa',
 'huggingface:wisesight1000',
 'huggingface:wisesight_sentiment',
 'huggingface:wmt14',
 'huggingface:wmt15',
 'huggingface:wmt16',
 'huggingface:wmt17',
 'huggingface:wmt18',
 'huggingface:wmt19',
 'huggingface:wmt20_mlqe_task1',
 'huggingface:wmt20_mlqe_task2',
 'huggingface:wmt20_mlqe_task3',
 'huggingface:wmt_t2t',
 'huggingface:wnut_17',
 'huggingface:wongnai_reviews',
 'huggingface:woz_dialogue',
 'huggingface:wrbsc',
 'huggingface:x_stance',
 'huggingface:xcopa',
 'huggingface:xed_en_fi',
 'huggingface:xglue',
 'huggingface:xnli',
 'huggingface:xor_tydi_qa',
 'huggingface:xquad',
 'huggingface:xquad_r',
 'huggingface:xsum',
 'huggingface:xsum_factuality',
 'huggingface:xtreme',
 'huggingface:yahoo_answers_qa',
 'huggingface:yahoo_answers_topics',
 'huggingface:yelp_polarity',
 'huggingface:yelp_review_full',
 'huggingface:yoruba_bbc_topics',
 'huggingface:yoruba_gv_ner',
 'huggingface:yoruba_text_c3',
 'huggingface:yoruba_wordsim353',
 'huggingface:youtube_caption_corrections',
 'huggingface:zest']

Laden Sie einen Datensatz

tfds.load

Der einfachste Weg, ein Dataset zu tfds.load ist tfds.load . Es wird:

  1. Laden Sie die Daten herunter und speichern Sie sie als tfrecord Dateien.
  2. Laden Sie das tfrecord und erstellen Sie dastf.data.Dataset .
ds = tfds.load('mnist', split='train', shuffle_files=True)
assert isinstance(ds, tf.data.Dataset)
print(ds)
<_OptionsDataset shapes: {image: (28, 28, 1), label: ()}, types: {image: tf.uint8, label: tf.int64}>

Einige häufige Argumente:

  • split= : Welche Aufteilung soll gelesen werden (zB 'train' , ['train', 'test'] , 'train[80%:]' , ...). Weitere Informationen finden Sie in unserem Split-API-Handbuch .
  • shuffle_files= : shuffle_files= ob die Dateien zwischen den einzelnen Epochen shuffle_files= (TFDS speichert große Datenmengen in mehreren kleineren Dateien).
  • data_dir= : data_dir= des Datasets (standardmäßig ~/tensorflow_datasets/ )
  • with_info=True : Gibt die tfds.core.DatasetInfo die Dataset-Metadaten enthält
  • download=False : Download deaktivieren

tfds.builder

tfds.load ist ein dünner Wrapper um tfds.core.DatasetBuilder . Sie können dieselbe Ausgabe mit der API tfds.core.DatasetBuilder :

builder = tfds.builder('mnist')
# 1. Create the tfrecord files (no-op if already exists)
builder.download_and_prepare()
# 2. Load the `tf.data.Dataset`
ds = builder.as_dataset(split='train', shuffle_files=True)
print(ds)
<_OptionsDataset shapes: {image: (28, 28, 1), label: ()}, types: {image: tf.uint8, label: tf.int64}>

tfds build CLI

Wenn Sie ein bestimmtes Dataset generieren möchten, können Sie die tfds verwenden . Beispielsweise:

tfds build mnist

Informationen zu verfügbaren Flags finden Sie im Dokument .

Iterieren Sie über einen Datensatz

Wie diktiert

Standardmäßig ist dastf.data.Dataset enthält Objekt ein dict von tf.Tensor s:

ds = tfds.load('mnist', split='train')
ds = ds.take(1)  # Only take a single example

for example in ds:  # example is `{'image': tf.Tensor, 'label': tf.Tensor}`
  print(list(example.keys()))
  image = example["image"]
  label = example["label"]
  print(image.shape, label)
['image', 'label']
(28, 28, 1) tf.Tensor(4, shape=(), dtype=int64)

dict Namen und der Struktur der dict finden Sie in der Dokumentation zum Datensatz in unserem Katalog . Zum Beispiel: Mnist-Dokumentation .

Als Tupel ( as_supervised=True )

Wenn Sie as_supervised=True , können Sie stattdessen ein Tupel (features, label) für überwachte Datasets abrufen.

ds = tfds.load('mnist', split='train', as_supervised=True)
ds = ds.take(1)

for image, label in ds:  # example is (image, label)
  print(image.shape, label)
(28, 28, 1) tf.Tensor(4, shape=(), dtype=int64)

Als numpy ( tfds.as_numpy )

Verwendet tfds.as_numpy zum Konvertieren:

  • tf.Tensor -> np.array
  • tf.data.Dataset -> Iterator[Tree[np.array]] ( Tree kann beliebig verschachtelt sein Dict , Tuple )
ds = tfds.load('mnist', split='train', as_supervised=True)
ds = ds.take(1)

for image, label in tfds.as_numpy(ds):
  print(type(image), type(label), label)
<class 'numpy.ndarray'> <class 'numpy.int64'> 4

Wie gestapelt tf.Tensor ( batch_size=-1 )

Mit batch_size=-1 können Sie den gesamten Datensatz in einen einzelnen Stapel laden.

Dies kann mit as_supervised=True und tfds.as_numpy , um die Daten als (np.array, np.array) :

image, label = tfds.as_numpy(tfds.load(
    'mnist',
    split='test',
    batch_size=-1,
    as_supervised=True,
))

print(type(image), image.shape)
<class 'numpy.ndarray'> (10000, 28, 28, 1)

Achten Sie darauf, dass Ihr Datensatz in den Speicher passt und dass alle Beispiele dieselbe Form haben.

Benchmarking Ihrer Datensätze

Das Benchmarking eines Datasets ist ein einfacher Aufruf von tfds.benchmark für eine beliebigetf.data.Dataset tfds.as_numpy (z. B.tf.data.Dataset , tfds.as_numpy , ...).

ds = tfds.load('mnist', split='train')
ds = ds.batch(32).prefetch(1)

tfds.benchmark(ds, batch_size=32)
tfds.benchmark(ds, batch_size=32)  # Second epoch much faster due to auto-caching
************ Summary ************

Examples/sec (First included) 37648.72 ex/sec (total: 60000 ex, 1.59 sec)
Examples/sec (First only) 53.84 ex/sec (total: 32 ex, 0.59 sec)
Examples/sec (First excluded) 60009.42 ex/sec (total: 59968 ex, 1.00 sec)

************ Summary ************

Examples/sec (First included) 300819.05 ex/sec (total: 60000 ex, 0.20 sec)
Examples/sec (First only) 2555.37 ex/sec (total: 32 ex, 0.01 sec)
Examples/sec (First excluded) 320799.77 ex/sec (total: 59968 ex, 0.19 sec)

Erstellen Sie eine End-to-End-Pipeline

Um weiter zu gehen, können Sie schauen:

Visualisierung

tfds.as_dataframe

tf.data.Dataset Objekte können mit tfds.as_dataframe in pandas.DataFrame tfds.as_dataframe werden, um in Colab visualisiert zu werden.

  • Fügen Sie tfds.core.DatasetInfo als zweites Argument von tfds.as_dataframe , um Bilder, Audio, Texte, Videos, ...
  • Verwenden Sie ds.take(x) um nur die ersten x Beispiele anzuzeigen. pandas.DataFrame lädt den gesamten Datensatz in den Speicher und kann sehr teuer in der Anzeige sein.
ds, info = tfds.load('mnist', split='train', with_info=True)

tfds.as_dataframe(ds.take(4), info)

tfds.show_examples

tfds.show_examples gibt eine matplotlib.figure.Figure (nur Bilddatensätze werden jetzt unterstützt):

ds, info = tfds.load('mnist', split='train', with_info=True)

fig = tfds.show_examples(ds, info)

png

Greifen Sie auf die Dataset-Metadaten zu

Alle Builder enthalten ein tfds.core.DatasetInfo Objekt, das die Dataset-Metadaten enthält.

Der Zugriff erfolgt über:

ds, info = tfds.load('mnist', with_info=True)
builder = tfds.builder('mnist')
info = builder.info

Die Datensatzinformationen enthalten zusätzliche Informationen zum Datensatz (Version, Zitat, Homepage, Beschreibung, ...).

print(info)
tfds.core.DatasetInfo(
    name='mnist',
    full_name='mnist/3.0.1',
    description="""
    The MNIST database of handwritten digits.
    """,
    homepage='http://yann.lecun.com/exdb/mnist/',
    data_path='gs://tensorflow-datasets/datasets/mnist/3.0.1',
    download_size=11.06 MiB,
    dataset_size=21.00 MiB,
    features=FeaturesDict({
        'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    }),
    supervised_keys=('image', 'label'),
    splits={
        'test': <SplitInfo num_examples=10000, num_shards=1>,
        'train': <SplitInfo num_examples=60000, num_shards=1>,
    },
    citation="""@article{lecun2010mnist,
      title={MNIST handwritten digit database},
      author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
      journal={ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist},
      volume={2},
      year={2010}
    }""",
)

Features Metadaten (Beschriftungsnamen, Bildform, ...)

tfds.features.FeatureDict :

info.features
FeaturesDict({
    'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

Anzahl der Klassen, Labelnamen:

print(info.features["label"].num_classes)
print(info.features["label"].names)
print(info.features["label"].int2str(7))  # Human readable version (8 -> 'cat')
print(info.features["label"].str2int('7'))
10
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
7
7

Formen, dtypen:

print(info.features.shape)
print(info.features.dtype)
print(info.features['image'].shape)
print(info.features['image'].dtype)
{'image': (28, 28, 1), 'label': ()}
{'image': tf.uint8, 'label': tf.int64}
(28, 28, 1)
<dtype: 'uint8'>

Geteilte Metadaten (zB geteilte Namen, Anzahl der Beispiele, ...)

tfds.core.SplitDict :

print(info.splits)
{'test': <SplitInfo num_examples=10000, num_shards=1>, 'train': <SplitInfo num_examples=60000, num_shards=1>}

Verfügbare Splits:

print(list(info.splits.keys()))
['test', 'train']

Informieren Sie sich über die individuelle Aufteilung:

print(info.splits['train'].num_examples)
print(info.splits['train'].filenames)
print(info.splits['train'].num_shards)
60000
['mnist-train.tfrecord-00000-of-00001']
1

Es funktioniert auch mit der Subsplit-API:

print(info.splits['train[15%:75%]'].num_examples)
print(info.splits['train[15%:75%]'].file_instructions)
36000
[FileInstruction(filename='mnist-train.tfrecord-00000-of-00001', skip=9000, take=36000, num_examples=36000)]

Fehlerbehebung

Manueller Download (wenn der Download fehlschlägt)

Wenn der Download aus irgendeinem Grund fehlschlägt (z. B. offline, ...). Sie können die Daten jederzeit manuell herunterladen und im manual_dir (standardmäßig ~/tensorflow_datasets/download/manual/ .

Um herauszufinden, welche URLs heruntergeladen werden müssen, schauen Sie in:

NonMatchingChecksumError

TFDS stellen Determinismus sicher, indem sie die Prüfsummen der heruntergeladenen URLs validieren. Wenn NonMatchingChecksumError wird, kann dies NonMatchingChecksumError anzeigen:

  • Die Website ist möglicherweise nicht erreichbar (z. B. 503 status code ). Bitte überprüfen Sie die URL.
  • Versuchen Sie es später bei Google Drive-URLs erneut, da Drive Downloads manchmal ablehnt, wenn zu viele Personen auf dieselbe URL zugreifen. Siehe Fehler
  • Die ursprünglichen Datasets-Dateien wurden möglicherweise aktualisiert. In diesem Fall sollte der TFDS-Dataset-Builder aktualisiert werden. Bitte öffnen Sie eine neue Github-Ausgabe oder PR:
    • Registrieren Sie die neuen Prüfsummen bei tfds build --register_checksums
    • Aktualisieren Sie schließlich den Datensatzgenerierungscode.
    • Aktualisieren Sie den Datensatz VERSION
    • Aktualisieren Sie den Datensatz RELEASE_NOTES : Was hat dazu geführt, dass sich die Prüfsummen geändert haben? Haben sich einige Beispiele geändert?
    • Stellen Sie sicher, dass der Datensatz noch erstellt werden kann.
    • Senden Sie uns eine PR

Zitat

Wenn Sie tensorflow-datasets für ein Papier verwenden, tensorflow-datasets Sie bitte zusätzlich zu den für die verwendeten Datensätze spezifischen Zitaten (die im Datensatzkatalog enthalten sind) das folgende Zitat hinzu.

@misc{TFDS,
  title = { {TensorFlow Datasets}, A collection of ready-to-use datasets},
  howpublished = {\url{https://www.tensorflow.org/datasets} },
}