Ajuda a proteger a Grande Barreira de Corais com TensorFlow em Kaggle Junte Desafio

Conjuntos de dados TensorFlow

TFDS fornece uma coleção de conjuntos de dados prontos para uso com TensorFlow, Jax e outras estruturas de aprendizado de máquina.

Ele lida com o download e preparar os dados determinística e construindo uma tf.data.Dataset (ou np.array ).

Ver no TensorFlow.org Executar no Google Colab Ver fonte no GitHub Baixar caderno

Instalação

TFDS existe em dois pacotes:

  • pip install tensorflow-datasets : A versão estável, lançado a cada poucos meses.
  • pip install tfds-nightly : Lançado a cada dia, contém as últimas versões dos conjuntos de dados.

Este colab usa tfds-nightly :

pip install -q tfds-nightly tensorflow matplotlib
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

import tensorflow_datasets as tfds

Encontre conjuntos de dados disponíveis

Todos os construtores de conjuntos de dados são subclasse de tfds.core.DatasetBuilder . Para obter a lista de construtores disponíveis, utilize tfds.list_builders() ou olhada no nosso catálogo .

tfds.list_builders()
['abstract_reasoning',
 'accentdb',
 'aeslc',
 'aflw2k3d',
 'ag_news_subset',
 'ai2_arc',
 'ai2_arc_with_ir',
 'amazon_us_reviews',
 'anli',
 'arc',
 'bair_robot_pushing_small',
 'bccd',
 'beans',
 'big_patent',
 'bigearthnet',
 'billsum',
 'binarized_mnist',
 'binary_alpha_digits',
 'blimp',
 'booksum',
 'bool_q',
 'c4',
 'caltech101',
 'caltech_birds2010',
 'caltech_birds2011',
 'cardiotox',
 'cars196',
 'cassava',
 'cats_vs_dogs',
 'celeb_a',
 'celeb_a_hq',
 'cfq',
 'cherry_blossoms',
 'chexpert',
 'cifar10',
 'cifar100',
 'cifar10_1',
 'cifar10_corrupted',
 'citrus_leaves',
 'cityscapes',
 'civil_comments',
 'clevr',
 'clic',
 'clinc_oos',
 'cmaterdb',
 'cnn_dailymail',
 'coco',
 'coco_captions',
 'coil100',
 'colorectal_histology',
 'colorectal_histology_large',
 'common_voice',
 'coqa',
 'cos_e',
 'cosmos_qa',
 'covid19',
 'covid19sum',
 'crema_d',
 'curated_breast_imaging_ddsm',
 'cycle_gan',
 'd4rl_adroit_door',
 'd4rl_adroit_hammer',
 'd4rl_adroit_pen',
 'd4rl_adroit_relocate',
 'd4rl_mujoco_ant',
 'd4rl_mujoco_halfcheetah',
 'd4rl_mujoco_hopper',
 'd4rl_mujoco_walker2d',
 'dart',
 'davis',
 'deep_weeds',
 'definite_pronoun_resolution',
 'dementiabank',
 'diabetic_retinopathy_detection',
 'div2k',
 'dmlab',
 'doc_nli',
 'dolphin_number_word',
 'domainnet',
 'downsampled_imagenet',
 'drop',
 'dsprites',
 'dtd',
 'duke_ultrasound',
 'e2e_cleaned',
 'efron_morris75',
 'emnist',
 'eraser_multi_rc',
 'esnli',
 'eurosat',
 'fashion_mnist',
 'flic',
 'flores',
 'food101',
 'forest_fires',
 'fuss',
 'gap',
 'geirhos_conflict_stimuli',
 'gem',
 'genomics_ood',
 'german_credit_numeric',
 'gigaword',
 'glue',
 'goemotions',
 'gov_report',
 'gpt3',
 'gref',
 'groove',
 'gsm8k',
 'gtzan',
 'gtzan_music_speech',
 'hellaswag',
 'higgs',
 'horses_or_humans',
 'howell',
 'i_naturalist2017',
 'i_naturalist2018',
 'imagenet2012',
 'imagenet2012_corrupted',
 'imagenet2012_multilabel',
 'imagenet2012_real',
 'imagenet2012_subset',
 'imagenet_a',
 'imagenet_r',
 'imagenet_resized',
 'imagenet_v2',
 'imagenette',
 'imagewang',
 'imdb_reviews',
 'irc_disentanglement',
 'iris',
 'istella',
 'kddcup99',
 'kitti',
 'kmnist',
 'lambada',
 'lfw',
 'librispeech',
 'librispeech_lm',
 'libritts',
 'ljspeech',
 'lm1b',
 'lost_and_found',
 'lsun',
 'lvis',
 'malaria',
 'math_dataset',
 'math_qa',
 'mctaco',
 'mlqa',
 'mnist',
 'mnist_corrupted',
 'movie_lens',
 'movie_rationales',
 'movielens',
 'moving_mnist',
 'mslr_web',
 'multi_news',
 'multi_nli',
 'multi_nli_mismatch',
 'natural_questions',
 'natural_questions_open',
 'newsroom',
 'nsynth',
 'nyu_depth_v2',
 'ogbg_molpcba',
 'omniglot',
 'open_images_challenge2019_detection',
 'open_images_v4',
 'openbookqa',
 'opinion_abstracts',
 'opinosis',
 'opus',
 'oxford_flowers102',
 'oxford_iiit_pet',
 'para_crawl',
 'pass',
 'patch_camelyon',
 'paws_wiki',
 'paws_x_wiki',
 'penguins',
 'pet_finder',
 'pg19',
 'piqa',
 'places365_small',
 'plant_leaves',
 'plant_village',
 'plantae_k',
 'protein_net',
 'qa4mre',
 'qasc',
 'quac',
 'quickdraw_bitmap',
 'race',
 'radon',
 'reddit',
 'reddit_disentanglement',
 'reddit_tifu',
 'ref_coco',
 'resisc45',
 'rlu_atari',
 'rlu_dmlab_explore_object_rewards_few',
 'rlu_dmlab_explore_object_rewards_many',
 'rlu_dmlab_rooms_select_nonmatching_object',
 'rlu_dmlab_rooms_watermaze',
 'rlu_dmlab_seekavoid_arena01',
 'rlu_rwrl',
 'robonet',
 'robosuite_panda_pick_place_can',
 'rock_paper_scissors',
 'rock_you',
 's3o4d',
 'salient_span_wikipedia',
 'samsum',
 'savee',
 'scan',
 'scene_parse150',
 'schema_guided_dialogue',
 'scicite',
 'scientific_papers',
 'sentiment140',
 'shapes3d',
 'siscore',
 'smallnorb',
 'snli',
 'so2sat',
 'speech_commands',
 'spoken_digit',
 'squad',
 'squad_question_generation',
 'stanford_dogs',
 'stanford_online_products',
 'star_cfq',
 'starcraft_video',
 'stl10',
 'story_cloze',
 'summscreen',
 'sun397',
 'super_glue',
 'svhn_cropped',
 'symmetric_solids',
 'tao',
 'ted_hrlr_translate',
 'ted_multi_translate',
 'tedlium',
 'tf_flowers',
 'the300w_lp',
 'tiny_shakespeare',
 'titanic',
 'trec',
 'trivia_qa',
 'tydi_qa',
 'uc_merced',
 'ucf101',
 'vctk',
 'visual_domain_decathlon',
 'voc',
 'voxceleb',
 'voxforge',
 'waymo_open_dataset',
 'web_nlg',
 'web_questions',
 'wider_face',
 'wiki40b',
 'wiki_bio',
 'wiki_table_questions',
 'wiki_table_text',
 'wikiann',
 'wikihow',
 'wikipedia',
 'wikipedia_toxicity_subtypes',
 'wine_quality',
 'winogrande',
 'wit',
 'wit_kaggle',
 'wmt13_translate',
 'wmt14_translate',
 'wmt15_translate',
 'wmt16_translate',
 'wmt17_translate',
 'wmt18_translate',
 'wmt19_translate',
 'wmt_t2t_translate',
 'wmt_translate',
 'wordnet',
 'wsc273',
 'xnli',
 'xquad',
 'xsum',
 'xtreme_pawsx',
 'xtreme_xnli',
 'yelp_polarity_reviews',
 'yes_no',
 'youtube_vis',
 'huggingface:acronym_identification',
 'huggingface:ade_corpus_v2',
 'huggingface:adversarial_qa',
 'huggingface:aeslc',
 'huggingface:afrikaans_ner_corpus',
 'huggingface:ag_news',
 'huggingface:ai2_arc',
 'huggingface:air_dialogue',
 'huggingface:ajgt_twitter_ar',
 'huggingface:allegro_reviews',
 'huggingface:allocine',
 'huggingface:alt',
 'huggingface:amazon_polarity',
 'huggingface:amazon_reviews_multi',
 'huggingface:amazon_us_reviews',
 'huggingface:ambig_qa',
 'huggingface:amttl',
 'huggingface:anli',
 'huggingface:app_reviews',
 'huggingface:aqua_rat',
 'huggingface:aquamuse',
 'huggingface:ar_cov19',
 'huggingface:ar_res_reviews',
 'huggingface:ar_sarcasm',
 'huggingface:arabic_billion_words',
 'huggingface:arabic_pos_dialect',
 'huggingface:arabic_speech_corpus',
 'huggingface:arcd',
 'huggingface:arsentd_lev',
 'huggingface:art',
 'huggingface:arxiv_dataset',
 'huggingface:aslg_pc12',
 'huggingface:asnq',
 'huggingface:asset',
 'huggingface:assin',
 'huggingface:assin2',
 'huggingface:atomic',
 'huggingface:autshumato',
 'huggingface:bbc_hindi_nli',
 'huggingface:bc2gm_corpus',
 'huggingface:best2009',
 'huggingface:bianet',
 'huggingface:bible_para',
 'huggingface:big_patent',
 'huggingface:billsum',
 'huggingface:bing_coronavirus_query_set',
 'huggingface:biomrc',
 'huggingface:blended_skill_talk',
 'huggingface:blimp',
 'huggingface:blog_authorship_corpus',
 'huggingface:bn_hate_speech',
 'huggingface:bookcorpus',
 'huggingface:bookcorpusopen',
 'huggingface:boolq',
 'huggingface:bprec',
 'huggingface:break_data',
 'huggingface:brwac',
 'huggingface:bsd_ja_en',
 'huggingface:bswac',
 'huggingface:c3',
 'huggingface:c4',
 'huggingface:cail2018',
 'huggingface:caner',
 'huggingface:capes',
 'huggingface:catalonia_independence',
 'huggingface:cawac',
 'huggingface:cc100',
 'huggingface:cc_news',
 'huggingface:cdsc',
 'huggingface:cdt',
 'huggingface:cfq',
 'huggingface:chr_en',
 'huggingface:cifar10',
 'huggingface:cifar100',
 'huggingface:circa',
 'huggingface:civil_comments',
 'huggingface:clickbait_news_bg',
 'huggingface:climate_fever',
 'huggingface:clinc_oos',
 'huggingface:clue',
 'huggingface:cmrc2018',
 'huggingface:cnn_dailymail',
 'huggingface:coached_conv_pref',
 'huggingface:coarse_discourse',
 'huggingface:codah',
 'huggingface:code_search_net',
 'huggingface:com_qa',
 'huggingface:common_gen',
 'huggingface:commonsense_qa',
 'huggingface:compguesswhat',
 'huggingface:conceptnet5',
 'huggingface:conll2000',
 'huggingface:conll2002',
 'huggingface:conll2003',
 'huggingface:conv_ai',
 'huggingface:conv_ai_2',
 'huggingface:conv_ai_3',
 'huggingface:coqa',
 'huggingface:cord19',
 'huggingface:cornell_movie_dialog',
 'huggingface:cos_e',
 'huggingface:cosmos_qa',
 'huggingface:counter',
 'huggingface:covid_qa_castorini',
 'huggingface:covid_qa_deepset',
 'huggingface:covid_qa_ucsd',
 'huggingface:covid_tweets_japanese',
 'huggingface:craigslist_bargains',
 'huggingface:crawl_domain',
 'huggingface:crd3',
 'huggingface:crime_and_punish',
 'huggingface:crows_pairs',
 'huggingface:cs_restaurants',
 'huggingface:curiosity_dialogs',
 'huggingface:daily_dialog',
 'huggingface:dane',
 'huggingface:danish_political_comments',
 'huggingface:dart',
 'huggingface:datacommons_factcheck',
 'huggingface:dbpedia_14',
 'huggingface:dbrd',
 'huggingface:deal_or_no_dialog',
 'huggingface:definite_pronoun_resolution',
 'huggingface:dengue_filipino',
 'huggingface:dialog_re',
 'huggingface:diplomacy_detection',
 'huggingface:disaster_response_messages',
 'huggingface:discofuse',
 'huggingface:discovery',
 'huggingface:doc2dial',
 'huggingface:docred',
 'huggingface:doqa',
 'huggingface:dream',
 'huggingface:drop',
 'huggingface:duorc',
 'huggingface:dutch_social',
 'huggingface:dyk',
 'huggingface:e2e_nlg',
 'huggingface:e2e_nlg_cleaned',
 'huggingface:ecb',
 'huggingface:ehealth_kd',
 'huggingface:eitb_parcc',
 'huggingface:eli5',
 'huggingface:emea',
 'huggingface:emo',
 'huggingface:emotion',
 'huggingface:emotone_ar',
 'huggingface:empathetic_dialogues',
 'huggingface:enriched_web_nlg',
 'huggingface:eraser_multi_rc',
 'huggingface:esnli',
 'huggingface:eth_py150_open',
 'huggingface:ethos',
 'huggingface:euronews',
 'huggingface:europa_eac_tm',
 'huggingface:europa_ecdc_tm',
 'huggingface:event2Mind',
 'huggingface:evidence_infer_treatment',
 'huggingface:exams',
 'huggingface:factckbr',
 'huggingface:fake_news_english',
 'huggingface:fake_news_filipino',
 'huggingface:farsi_news',
 'huggingface:fever',
 'huggingface:finer',
 'huggingface:flores',
 'huggingface:flue',
 'huggingface:fquad',
 'huggingface:freebase_qa',
 'huggingface:gap',
 'huggingface:gem',
 'huggingface:generated_reviews_enth',
 'huggingface:generics_kb',
 'huggingface:german_legal_entity_recognition',
 'huggingface:germaner',
 'huggingface:germeval_14',
 'huggingface:giga_fren',
 'huggingface:gigaword',
 'huggingface:glucose',
 'huggingface:glue',
 'huggingface:gnad10',
 'huggingface:go_emotions',
 'huggingface:google_wellformed_query',
 'huggingface:grail_qa',
 'huggingface:great_code',
 'huggingface:guardian_authorship',
 'huggingface:gutenberg_time',
 'huggingface:hans',
 'huggingface:hansards',
 'huggingface:hard',
 'huggingface:harem',
 'huggingface:has_part',
 'huggingface:hate_offensive',
 'huggingface:hate_speech18',
 'huggingface:hate_speech_filipino',
 'huggingface:hate_speech_offensive',
 'huggingface:hate_speech_pl',
 'huggingface:hate_speech_portuguese',
 'huggingface:hatexplain',
 'huggingface:hausa_voa_ner',
 'huggingface:hausa_voa_topics',
 'huggingface:hda_nli_hindi',
 'huggingface:head_qa',
 'huggingface:health_fact',
 'huggingface:hebrew_projectbenyehuda',
 'huggingface:hebrew_sentiment',
 'huggingface:hebrew_this_world',
 'huggingface:hellaswag',
 'huggingface:hind_encorp',
 'huggingface:hindi_discourse',
 'huggingface:hippocorpus',
 'huggingface:hkcancor',
 'huggingface:hope_edi',
 'huggingface:hotpot_qa',
 'huggingface:hover',
 'huggingface:hrenwac_para',
 'huggingface:hrwac',
 'huggingface:humicroedit',
 'huggingface:hybrid_qa',
 'huggingface:hyperpartisan_news_detection',
 'huggingface:id_clickbait',
 'huggingface:id_liputan6',
 'huggingface:id_nergrit_corpus',
 'huggingface:id_newspapers_2018',
 'huggingface:id_panl_bppt',
 'huggingface:id_puisi',
 'huggingface:igbo_english_machine_translation',
 'huggingface:igbo_monolingual',
 'huggingface:igbo_ner',
 'huggingface:ilist',
 'huggingface:imdb',
 'huggingface:imdb_urdu_reviews',
 'huggingface:imppres',
 'huggingface:indic_glue',
 'huggingface:indonlu',
 'huggingface:inquisitive_qg',
 'huggingface:interpress_news_category_tr',
 'huggingface:irc_disentangle',
 'huggingface:isixhosa_ner_corpus',
 'huggingface:isizulu_ner_corpus',
 'huggingface:iwslt2017',
 'huggingface:jeopardy',
 'huggingface:jfleg',
 'huggingface:jigsaw_toxicity_pred',
 'huggingface:jnlpba',
 'huggingface:journalists_questions',
 'huggingface:kannada_news',
 'huggingface:kd_conv',
 'huggingface:kde4',
 'huggingface:kelm',
 'huggingface:kilt_tasks',
 'huggingface:kilt_wikipedia',
 'huggingface:kinnews_kirnews',
 'huggingface:kor_3i4k',
 'huggingface:kor_hate',
 'huggingface:kor_ner',
 'huggingface:kor_nli',
 'huggingface:kor_nlu',
 'huggingface:kor_qpair',
 'huggingface:kor_sae',
 'huggingface:kor_sarcasm',
 'huggingface:labr',
 'huggingface:lama',
 'huggingface:lambada',
 'huggingface:large_spanish_corpus',
 'huggingface:lc_quad',
 'huggingface:lener_br',
 'huggingface:liar',
 'huggingface:librispeech_asr',
 'huggingface:librispeech_lm',
 'huggingface:limit',
 'huggingface:lince',
 'huggingface:linnaeus',
 'huggingface:liveqa',
 'huggingface:lj_speech',
 'huggingface:lm1b',
 'huggingface:lst20',
 'huggingface:mac_morpho',
 'huggingface:makhzan',
 'huggingface:math_dataset',
 'huggingface:math_qa',
 'huggingface:matinf',
 'huggingface:mc_taco',
 'huggingface:md_gender_bias',
 'huggingface:med_hop',
 'huggingface:medal',
 'huggingface:medical_dialog',
 'huggingface:medical_questions_pairs',
 'huggingface:menyo20k_mt',
 'huggingface:meta_woz',
 'huggingface:metooma',
 'huggingface:metrec',
 'huggingface:mkb',
 'huggingface:mkqa',
 'huggingface:mlqa',
 'huggingface:mlsum',
 'huggingface:mnist',
 'huggingface:mocha',
 'huggingface:movie_rationales',
 'huggingface:mrqa',
 'huggingface:ms_marco',
 'huggingface:ms_terms',
 'huggingface:msr_genomics_kbcomp',
 'huggingface:msr_sqa',
 'huggingface:msr_text_compression',
 'huggingface:msr_zhen_translation_parity',
 'huggingface:msra_ner',
 'huggingface:mt_eng_vietnamese',
 'huggingface:muchocine',
 'huggingface:multi_booked',
 'huggingface:multi_news',
 'huggingface:multi_nli',
 'huggingface:multi_nli_mismatch',
 'huggingface:multi_para_crawl',
 'huggingface:multi_re_qa',
 'huggingface:multi_woz_v22',
 'huggingface:multi_x_science_sum',
 'huggingface:mutual_friends',
 'huggingface:mwsc',
 'huggingface:myanmar_news',
 'huggingface:narrativeqa',
 'huggingface:narrativeqa_manual',
 'huggingface:natural_questions',
 'huggingface:ncbi_disease',
 'huggingface:nchlt',
 'huggingface:ncslgr',
 'huggingface:nell',
 'huggingface:neural_code_search',
 'huggingface:news_commentary',
 'huggingface:newsgroup',
 'huggingface:newsph',
 'huggingface:newsph_nli',
 'huggingface:newsqa',
 'huggingface:newsroom',
 'huggingface:nkjp-ner',
 'huggingface:nli_tr',
 'huggingface:norwegian_ner',
 'huggingface:nq_open',
 'huggingface:nsmc',
 'huggingface:numer_sense',
 'huggingface:numeric_fused_head',
 'huggingface:oclar',
 'huggingface:offcombr',
 'huggingface:offenseval2020_tr',
 'huggingface:offenseval_dravidian',
 'huggingface:ofis_publik',
 'huggingface:ohsumed',
 'huggingface:ollie',
 'huggingface:omp',
 'huggingface:onestop_english',
 'huggingface:open_subtitles',
 'huggingface:openbookqa',
 'huggingface:openwebtext',
 'huggingface:opinosis',
 'huggingface:opus100',
 'huggingface:opus_books',
 'huggingface:opus_dgt',
 'huggingface:opus_dogc',
 'huggingface:opus_elhuyar',
 'huggingface:opus_euconst',
 'huggingface:opus_finlex',
 'huggingface:opus_fiskmo',
 'huggingface:opus_gnome',
 'huggingface:opus_infopankki',
 'huggingface:opus_memat',
 'huggingface:opus_montenegrinsubs',
 'huggingface:opus_openoffice',
 'huggingface:opus_paracrawl',
 'huggingface:opus_rf',
 'huggingface:opus_tedtalks',
 'huggingface:opus_ubuntu',
 'huggingface:opus_wikipedia',
 'huggingface:opus_xhosanavy',
 'huggingface:orange_sum',
 'huggingface:oscar',
 'huggingface:para_crawl',
 'huggingface:para_pat',
 'huggingface:paws',
 'huggingface:paws-x',
 'huggingface:pec',
 'huggingface:peer_read',
 'huggingface:peoples_daily_ner',
 'huggingface:per_sent',
 'huggingface:persian_ner',
 'huggingface:pg19',
 'huggingface:php',
 'huggingface:piaf',
 'huggingface:pib',
 'huggingface:piqa',
 'huggingface:pn_summary',
 'huggingface:poem_sentiment',
 'huggingface:polemo2',
 'huggingface:poleval2019_cyberbullying',
 'huggingface:poleval2019_mt',
 'huggingface:polsum',
 'huggingface:polyglot_ner',
 'huggingface:prachathai67k',
 'huggingface:pragmeval',
 'huggingface:proto_qa',
 'huggingface:psc',
 'huggingface:ptb_text_only',
 'huggingface:pubmed',
 'huggingface:pubmed_qa',
 'huggingface:py_ast',
 'huggingface:qa4mre',
 'huggingface:qa_srl',
 'huggingface:qa_zre',
 'huggingface:qangaroo',
 'huggingface:qanta',
 'huggingface:qasc',
 'huggingface:qed',
 'huggingface:qed_amara',
 'huggingface:quac',
 'huggingface:quail',
 'huggingface:quarel',
 'huggingface:quartz',
 'huggingface:quora',
 'huggingface:quoref',
 'huggingface:race',
 'huggingface:re_dial',
 'huggingface:reasoning_bg',
 'huggingface:recipe_nlg',
 'huggingface:reclor',
 'huggingface:reddit',
 'huggingface:reddit_tifu',
 'huggingface:refresd',
 'huggingface:reuters21578',
 'huggingface:roman_urdu',
 'huggingface:ronec',
 'huggingface:ropes',
 'huggingface:rotten_tomatoes',
 'huggingface:s2orc',
 'huggingface:samsum',
 'huggingface:sanskrit_classic',
 'huggingface:saudinewsnet',
 'huggingface:scan',
 'huggingface:scb_mt_enth_2020',
 'huggingface:schema_guided_dstc8',
 'huggingface:scicite',
 'huggingface:scielo',
 'huggingface:scientific_papers',
 'huggingface:scifact',
 'huggingface:sciq',
 'huggingface:scitail',
 'huggingface:scitldr',
 'huggingface:search_qa',
 'huggingface:selqa',
 'huggingface:sem_eval_2010_task_8',
 'huggingface:sem_eval_2014_task_1',
 'huggingface:sem_eval_2020_task_11',
 'huggingface:sent_comp',
 'huggingface:senti_lex',
 'huggingface:senti_ws',
 'huggingface:sentiment140',
 'huggingface:sepedi_ner',
 'huggingface:sesotho_ner_corpus',
 'huggingface:setimes',
 'huggingface:setswana_ner_corpus',
 'huggingface:sharc',
 'huggingface:sharc_modified',
 'huggingface:sick',
 'huggingface:silicone',
 'huggingface:simple_questions_v2',
 'huggingface:siswati_ner_corpus',
 'huggingface:smartdata',
 'huggingface:sms_spam',
 'huggingface:snips_built_in_intents',
 'huggingface:snli',
 'huggingface:snow_simplified_japanese_corpus',
 'huggingface:so_stacksample',
 'huggingface:social_bias_frames',
 'huggingface:social_i_qa',
 'huggingface:sofc_materials_articles',
 'huggingface:sogou_news',
 'huggingface:spanish_billion_words',
 'huggingface:spc',
 'huggingface:species_800',
 'huggingface:spider',
 'huggingface:squad',
 'huggingface:squad_adversarial',
 'huggingface:squad_es',
 'huggingface:squad_it',
 'huggingface:squad_kor_v1',
 'huggingface:squad_kor_v2',
 'huggingface:squad_v1_pt',
 'huggingface:squad_v2',
 'huggingface:squadshifts',
 'huggingface:srwac',
 'huggingface:stereoset',
 'huggingface:stsb_mt_sv',
 'huggingface:style_change_detection',
 'huggingface:super_glue',
 'huggingface:swag',
 'huggingface:swahili',
 'huggingface:swahili_news',
 'huggingface:swda',
 'huggingface:swedish_ner_corpus',
 'huggingface:swedish_reviews',
 'huggingface:tab_fact',
 'huggingface:tamilmixsentiment',
 'huggingface:tanzil',
 'huggingface:tapaco',
 'huggingface:tashkeela',
 'huggingface:taskmaster1',
 'huggingface:taskmaster2',
 'huggingface:taskmaster3',
 'huggingface:tatoeba',
 'huggingface:ted_hrlr',
 'huggingface:ted_iwlst2013',
 'huggingface:ted_multi',
 'huggingface:ted_talks_iwslt',
 'huggingface:telugu_books',
 'huggingface:telugu_news',
 'huggingface:tep_en_fa_para',
 'huggingface:thai_toxicity_tweet',
 'huggingface:thainer',
 'huggingface:thaiqa_squad',
 'huggingface:thaisum',
 'huggingface:tilde_model',
 'huggingface:times_of_india_news_headlines',
 'huggingface:tiny_shakespeare',
 'huggingface:tlc',
 'huggingface:tmu_gfm_dataset',
 'huggingface:totto',
 'huggingface:trec',
 'huggingface:trivia_qa',
 'huggingface:tsac',
 'huggingface:ttc4900',
 'huggingface:tunizi',
 'huggingface:tuple_ie',
 'huggingface:turk',
 'huggingface:turkish_movie_sentiment',
 'huggingface:turkish_ner',
 'huggingface:turkish_product_reviews',
 'huggingface:turkish_shrinked_ner',
 'huggingface:turku_ner_corpus',
 'huggingface:tweet_eval',
 'huggingface:tweet_qa',
 'huggingface:tweets_ar_en_parallel',
 'huggingface:tweets_hate_speech_detection',
 'huggingface:twi_text_c3',
 'huggingface:twi_wordsim353',
 'huggingface:tydiqa',
 'huggingface:ubuntu_dialogs_corpus',
 'huggingface:udhr',
 'huggingface:um005',
 'huggingface:un_ga',
 'huggingface:un_multi',
 'huggingface:un_pc',
 'huggingface:universal_dependencies',
 'huggingface:universal_morphologies',
 'huggingface:urdu_fake_news',
 'huggingface:urdu_sentiment_corpus',
 'huggingface:web_nlg',
 'huggingface:web_of_science',
 'huggingface:web_questions',
 'huggingface:weibo_ner',
 'huggingface:wi_locness',
 'huggingface:wiki40b',
 'huggingface:wiki_asp',
 'huggingface:wiki_atomic_edits',
 'huggingface:wiki_auto',
 'huggingface:wiki_bio',
 'huggingface:wiki_dpr',
 'huggingface:wiki_hop',
 'huggingface:wiki_lingua',
 'huggingface:wiki_movies',
 'huggingface:wiki_qa',
 'huggingface:wiki_qa_ar',
 'huggingface:wiki_snippets',
 'huggingface:wiki_source',
 'huggingface:wiki_split',
 'huggingface:wiki_summary',
 'huggingface:wikiann',
 'huggingface:wikicorpus',
 'huggingface:wikihow',
 'huggingface:wikipedia',
 'huggingface:wikisql',
 'huggingface:wikitext',
 'huggingface:wikitext_tl39',
 'huggingface:wili_2018',
 'huggingface:wino_bias',
 'huggingface:winograd_wsc',
 'huggingface:winogrande',
 'huggingface:wiqa',
 'huggingface:wisesight1000',
 'huggingface:wisesight_sentiment',
 'huggingface:wmt14',
 'huggingface:wmt15',
 'huggingface:wmt16',
 'huggingface:wmt17',
 'huggingface:wmt18',
 'huggingface:wmt19',
 'huggingface:wmt20_mlqe_task1',
 'huggingface:wmt20_mlqe_task2',
 'huggingface:wmt20_mlqe_task3',
 'huggingface:wmt_t2t',
 'huggingface:wnut_17',
 'huggingface:wongnai_reviews',
 'huggingface:woz_dialogue',
 'huggingface:wrbsc',
 'huggingface:x_stance',
 'huggingface:xcopa',
 'huggingface:xed_en_fi',
 'huggingface:xglue',
 'huggingface:xnli',
 'huggingface:xor_tydi_qa',
 'huggingface:xquad',
 'huggingface:xquad_r',
 'huggingface:xsum',
 'huggingface:xsum_factuality',
 'huggingface:xtreme',
 'huggingface:yahoo_answers_qa',
 'huggingface:yahoo_answers_topics',
 'huggingface:yelp_polarity',
 'huggingface:yelp_review_full',
 'huggingface:yoruba_bbc_topics',
 'huggingface:yoruba_gv_ner',
 'huggingface:yoruba_text_c3',
 'huggingface:yoruba_wordsim353',
 'huggingface:youtube_caption_corrections',
 'huggingface:zest']

Carregar um conjunto de dados

tfds.load

A maneira mais fácil de carregar um conjunto de dados é tfds.load . Será:

  1. Baixar os dados e salvá-lo como tfrecord arquivos.
  2. Carregar o tfrecord e criar o tf.data.Dataset .
ds = tfds.load('mnist', split='train', shuffle_files=True)
assert isinstance(ds, tf.data.Dataset)
print(ds)
<_OptionsDataset shapes: {image: (28, 28, 1), label: ()}, types: {image: tf.uint8, label: tf.int64}>
2021-12-04 12:22:13.107028: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

Alguns argumentos comuns:

  • split= : que dividiu a ler (por exemplo, 'train' , ['train', 'test'] , 'train[80%:]' ...). Veja o nosso guia de divisão API .
  • shuffle_files= : Controlar se a embaralhar os arquivos entre cada época (TFDS armazenar grandes conjuntos de dados em vários arquivos menores).
  • data_dir= : Local onde o conjunto de dados é salvo (o padrão é ~/tensorflow_datasets/ )
  • with_info=True : Retorna o tfds.core.DatasetInfo contendo conjunto de dados de metadados
  • download=False : Download Disable

tfds.builder

tfds.load é um wrapper fino ao redor tfds.core.DatasetBuilder . Você pode obter o mesmo resultado usando o tfds.core.DatasetBuilder API:

builder = tfds.builder('mnist')
# 1. Create the tfrecord files (no-op if already exists)
builder.download_and_prepare()
# 2. Load the `tf.data.Dataset`
ds = builder.as_dataset(split='train', shuffle_files=True)
print(ds)
<_OptionsDataset shapes: {image: (28, 28, 1), label: ()}, types: {image: tf.uint8, label: tf.int64}>

tfds build CLI

Se você deseja gerar um conjunto de dados específico, você pode usar o tfds linha de comando . Por exemplo:

tfds build mnist

Veja o doc para bandeiras disponíveis.

Iterar sobre um conjunto de dados

Como dict

Por padrão, o tf.data.Dataset objeto contém um dict de tf.Tensor s:

ds = tfds.load('mnist', split='train')
ds = ds.take(1)  # Only take a single example

for example in ds:  # example is `{'image': tf.Tensor, 'label': tf.Tensor}`
  print(list(example.keys()))
  image = example["image"]
  label = example["label"]
  print(image.shape, label)
['image', 'label']
(28, 28, 1) tf.Tensor(4, shape=(), dtype=int64)
2021-12-04 12:22:14.934497: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

Para descobrir os dict nomes-chave e estrutura, olhada na documentação do conjunto de dados no nosso catálogo . Por exemplo: documentação mnist .

Como tuplo ( as_supervised=True )

Usando as_supervised=True , você pode obter uma tupla (features, label) em vez de conjuntos de dados supervisionadas.

ds = tfds.load('mnist', split='train', as_supervised=True)
ds = ds.take(1)

for image, label in ds:  # example is (image, label)
  print(image.shape, label)
(28, 28, 1) tf.Tensor(4, shape=(), dtype=int64)
2021-12-04 12:22:15.545132: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

Como numpy ( tfds.as_numpy )

Usos tfds.as_numpy para converter:

ds = tfds.load('mnist', split='train', as_supervised=True)
ds = ds.take(1)

for image, label in tfds.as_numpy(ds):
  print(type(image), type(label), label)
<class 'numpy.ndarray'> <class 'numpy.int64'> 4
2021-12-04 12:22:16.171056: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

Como tf.Tensor em lote ( batch_size=-1 )

Usando batch_size=-1 , você pode carregar o conjunto de dados completo em um único lote.

Isto pode ser combinado com as_supervised=True e tfds.as_numpy para obter o os dados como (np.array, np.array) :

image, label = tfds.as_numpy(tfds.load(
    'mnist',
    split='test',
    batch_size=-1,
    as_supervised=True,
))

print(type(image), image.shape)
<class 'numpy.ndarray'> (10000, 28, 28, 1)

Tenha cuidado para que seu conjunto de dados possa caber na memória e que todos os exemplos tenham o mesmo formato.

Compare seus conjuntos de dados

Aferimento um conjunto de dados é um simples tfds.benchmark chamada em qualquer iterable (por exemplo tf.data.Dataset , tfds.as_numpy , ...).

ds = tfds.load('mnist', split='train')
ds = ds.batch(32).prefetch(1)

tfds.benchmark(ds, batch_size=32)
tfds.benchmark(ds, batch_size=32)  # Second epoch much faster due to auto-caching
************ Summary ************

Examples/sec (First included) 46340.60 ex/sec (total: 60000 ex, 1.29 sec)
Examples/sec (First only) 148.15 ex/sec (total: 32 ex, 0.22 sec)
Examples/sec (First excluded) 55589.21 ex/sec (total: 59968 ex, 1.08 sec)

************ Summary ************

Examples/sec (First included) 278502.78 ex/sec (total: 60000 ex, 0.22 sec)
Examples/sec (First only) 1525.77 ex/sec (total: 32 ex, 0.02 sec)
Examples/sec (First excluded) 308374.75 ex/sec (total: 59968 ex, 0.19 sec)
  • Não se esqueça de normalizar os resultados por tamanho de lote com o batch_size= kwarg.
  • No resumo, o primeiro lote de aquecimento é separado dos outros para capturar tf.data.Dataset tempo de configuração adicional (por exemplo tampões de inicialização, ...).
  • Observe como a segunda iteração é muito mais rápido devido à TFDS auto-caching .
  • tfds.benchmark retorna um tfds.core.BenchmarkResult que pode ser inspeccionada, para posterior análise.

Construir pipeline de ponta a ponta

Para ir mais longe, você pode olhar:

Visualização

tfds.as_dataframe

tf.data.Dataset objetos podem ser convertidos em pandas.DataFrame com tfds.as_dataframe para ser visualizado em Colab .

  • Adicione o tfds.core.DatasetInfo como segundo argumento do tfds.as_dataframe para visualizar imagens, áudio, textos, vídeos, ...
  • Uso ds.take(x) para exibir apenas os primeiros x exemplos. pandas.DataFrame irá carregar o conjunto completo de dados in-memory, e pode ser muito caro para exibição.
ds, info = tfds.load('mnist', split='train', with_info=True)

tfds.as_dataframe(ds.take(4), info)
2021-12-04 12:22:19.580797: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

tfds.show_examples

tfds.show_examples retornos uma matplotlib.figure.Figure (apenas conjuntos de dados de imagem suportados agora):

ds, info = tfds.load('mnist', split='train', with_info=True)

fig = tfds.show_examples(ds, info)
2021-12-04 12:22:20.564763: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

png

Acesse os metadados do conjunto de dados

Todos os construtores incluem um tfds.core.DatasetInfo objeto que contém os metadados do conjunto de dados.

Ele pode ser acessado através de:

ds, info = tfds.load('mnist', with_info=True)
builder = tfds.builder('mnist')
info = builder.info

As informações do conjunto de dados contém informações adicionais sobre o conjunto de dados (versão, citação, página inicial, descrição, ...).

print(info)
tfds.core.DatasetInfo(
    name='mnist',
    full_name='mnist/3.0.1',
    description="""
    The MNIST database of handwritten digits.
    """,
    homepage='http://yann.lecun.com/exdb/mnist/',
    data_path='gs://tensorflow-datasets/datasets/mnist/3.0.1',
    download_size=11.06 MiB,
    dataset_size=21.00 MiB,
    features=FeaturesDict({
        'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    }),
    supervised_keys=('image', 'label'),
    disable_shuffling=False,
    splits={
        'test': <SplitInfo num_examples=10000, num_shards=1>,
        'train': <SplitInfo num_examples=60000, num_shards=1>,
    },
    citation="""@article{lecun2010mnist,
      title={MNIST handwritten digit database},
      author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
      journal={ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist},
      volume={2},
      year={2010}
    }""",
)

Metadados de recursos (nomes de rótulos, formato de imagem, ...)

Acesse o tfds.features.FeatureDict :

info.features
FeaturesDict({
    'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

Número de classes, nomes de rótulos:

print(info.features["label"].num_classes)
print(info.features["label"].names)
print(info.features["label"].int2str(7))  # Human readable version (8 -> 'cat')
print(info.features["label"].str2int('7'))
10
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
7
7

Formas, tipos d:

print(info.features.shape)
print(info.features.dtype)
print(info.features['image'].shape)
print(info.features['image'].dtype)
{'image': (28, 28, 1), 'label': ()}
{'image': tf.uint8, 'label': tf.int64}
(28, 28, 1)
<dtype: 'uint8'>

Metadados de divisão (por exemplo, nomes de divisão, número de exemplos, ...)

Acesse o tfds.core.SplitDict :

print(info.splits)
{'test': <SplitInfo num_examples=10000, num_shards=1>, 'train': <SplitInfo num_examples=60000, num_shards=1>}

Divisões disponíveis:

print(list(info.splits.keys()))
['test', 'train']

Obtenha informações sobre a divisão individual:

print(info.splits['train'].num_examples)
print(info.splits['train'].filenames)
print(info.splits['train'].num_shards)
60000
['mnist-train.tfrecord-00000-of-00001']
1

Ele também funciona com a API subsplit:

print(info.splits['train[15%:75%]'].num_examples)
print(info.splits['train[15%:75%]'].file_instructions)
36000
[FileInstruction(filename='mnist-train.tfrecord-00000-of-00001', skip=9000, take=36000, num_examples=36000)]

Solução de problemas

Download manual (se o download falhar)

Se o download falhar por algum motivo (por exemplo, offline, ...). Você sempre pode baixar manualmente os dados a si mesmo e colocá-lo no manual_dir (o padrão é ~/tensorflow_datasets/download/manual/ .

Para descobrir quais URLs baixar, verifique:

fixação NonMatchingChecksumError

O TFDS garante o determinismo validando as somas de verificação dos urls baixados. Se NonMatchingChecksumError é levantada, pode indicar:

  • O site pode ser baixo (por exemplo, 503 status code ). Por favor, verifique o url.
  • Para URLs do Google Drive, tente novamente mais tarde, pois o Drive às vezes rejeita downloads quando muitas pessoas acessam o mesmo URL. Veja bug
  • Os arquivos de conjuntos de dados originais podem ter sido atualizados. Neste caso, o construtor do conjunto de dados TFDS deve ser atualizado. Abra um novo problema ou RP do Github:
    • Registrar as novas somas de verificação com tfds build --register_checksums
    • Eventualmente, atualize o código de geração do conjunto de dados.
    • Atualizar o conjunto de dados VERSION
    • Atualizar o conjunto de dados RELEASE_NOTES : O que causou as somas de verificação para a mudança? Alguns exemplos mudaram?
    • Certifique-se de que o conjunto de dados ainda possa ser construído.
    • Envie-nos um PR

Citação

Se você estiver usando tensorflow-datasets para um papel, por favor inclua a seguinte citação, para além de qualquer citação específica para os conjuntos de dados usados (que podem ser encontradas no catálogo conjunto de dados ).

@misc{TFDS,
  title = { {TensorFlow Datasets}, A collection of ready-to-use datasets},
  howpublished = {\url{https://www.tensorflow.org/datasets} },
}