Datasets

Usage

# See all registered datasets
tfds.list_builders()

# Load a given dataset by name, along with the DatasetInfo
data, info = tfds.load("mnist", with_info=True)
train_data, test_data = data['train'], data['test']
assert isinstance(train_data, tf.data.Dataset)
assert info.features['label'].num_classes == 10
assert info.splits['train'].num_examples == 60000

# You can also access a builder directly
builder = tfds.builder("mnist")
assert builder.info.splits['train'].num_examples == 60000
builder.download_and_prepare()
datasets = builder.as_dataset()

# If you need NumPy arrays
np_datasets = tfds.as_numpy(datasets)

All Datasets


audio

"groove"

The Groove MIDI Dataset (GMD) is composed of 13.6 hours of aligned MIDI and (synthesized) audio of human-performed, tempo-aligned expressive drumming captured on a Roland TD-11 V-Drum electronic drum kit.

groove is configured with tfds.audio.groove.GrooveConfig and has the following configurations predefined (defaults to the first one):

  • "full-midionly" (v1.0.0) (Size: 3.11 MiB): Groove dataset without audio, unsplit.

  • "full-16000hz" (v1.0.0) (Size: 4.76 GiB): Groove dataset with audio, unsplit.

  • "2bar-midionly" (v1.0.0) (Size: 3.11 MiB): Groove dataset without audio, split into 2-bar chunks.

  • "2bar-16000hz" (v1.0.0) (Size: 4.76 GiB): Groove dataset with audio, split into 2-bar chunks.

  • "4bar-midionly" (v1.0.0) (Size: 3.11 MiB): Groove dataset without audio, split into 4-bar chunks.

"groove/full-midionly"

FeaturesDict({
    'bpm': Tensor(shape=(), dtype=tf.int32),
    'drummer': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    'id': Tensor(shape=(), dtype=tf.string),
    'midi': Tensor(shape=(), dtype=tf.string),
    'style': FeaturesDict({
        'primary': ClassLabel(shape=(), dtype=tf.int64, num_classes=18),
        'secondary': Tensor(shape=(), dtype=tf.string),
    }),
    'time_signature': ClassLabel(shape=(), dtype=tf.int64, num_classes=5),
    'type': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
})

"groove/full-16000hz"

FeaturesDict({
    'audio': Tensor(shape=[None], dtype=tf.float32),
    'bpm': Tensor(shape=(), dtype=tf.int32),
    'drummer': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    'id': Tensor(shape=(), dtype=tf.string),
    'midi': Tensor(shape=(), dtype=tf.string),
    'style': FeaturesDict({
        'primary': ClassLabel(shape=(), dtype=tf.int64, num_classes=18),
        'secondary': Tensor(shape=(), dtype=tf.string),
    }),
    'time_signature': ClassLabel(shape=(), dtype=tf.int64, num_classes=5),
    'type': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
})

"groove/2bar-midionly"

FeaturesDict({
    'bpm': Tensor(shape=(), dtype=tf.int32),
    'drummer': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    'id': Tensor(shape=(), dtype=tf.string),
    'midi': Tensor(shape=(), dtype=tf.string),
    'style': FeaturesDict({
        'primary': ClassLabel(shape=(), dtype=tf.int64, num_classes=18),
        'secondary': Tensor(shape=(), dtype=tf.string),
    }),
    'time_signature': ClassLabel(shape=(), dtype=tf.int64, num_classes=5),
    'type': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
})

"groove/2bar-16000hz"

FeaturesDict({
    'audio': Tensor(shape=[None], dtype=tf.float32),
    'bpm': Tensor(shape=(), dtype=tf.int32),
    'drummer': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    'id': Tensor(shape=(), dtype=tf.string),
    'midi': Tensor(shape=(), dtype=tf.string),
    'style': FeaturesDict({
        'primary': ClassLabel(shape=(), dtype=tf.int64, num_classes=18),
        'secondary': Tensor(shape=(), dtype=tf.string),
    }),
    'time_signature': ClassLabel(shape=(), dtype=tf.int64, num_classes=5),
    'type': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
})

"groove/4bar-midionly"

FeaturesDict({
    'bpm': Tensor(shape=(), dtype=tf.int32),
    'drummer': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    'id': Tensor(shape=(), dtype=tf.string),
    'midi': Tensor(shape=(), dtype=tf.string),
    'style': FeaturesDict({
        'primary': ClassLabel(shape=(), dtype=tf.int64, num_classes=18),
        'secondary': Tensor(shape=(), dtype=tf.string),
    }),
    'time_signature': ClassLabel(shape=(), dtype=tf.int64, num_classes=5),
    'type': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
})

Statistics

Split Examples
ALL 21,415
TRAIN 17,261
VALIDATION 2,121
TEST 2,033

Urls

Supervised keys (for as_supervised=True)

None

Citation

@inproceedings{groove2019,
    Author = {Jon Gillick and Adam Roberts and Jesse Engel and Douglas Eck and David Bamman},
    Title = {Learning to Groove with Inverse Sequence Transformations},
    Booktitle   = {International Conference on Machine Learning (ICML)}
    Year = {2019},
}

"nsynth"

The NSynth Dataset is an audio dataset containing ~300k musical notes, each with a unique pitch, timbre, and envelope. Each note is annotated with three additional pieces of information based on a combination of human evaluation and heuristic algorithms: -Source: The method of sound production for the note's instrument. -Family: The high-level family of which the note's instrument is a member. -Qualities: Sonic qualities of the note.

The dataset is split into train, valid, and test sets, with no instruments overlapping between the train set and the valid/test sets.

Features

FeaturesDict({
    'audio': Tensor(shape=(64000,), dtype=tf.float32),
    'id': Tensor(shape=(), dtype=tf.string),
    'instrument': FeaturesDict({
        'family': ClassLabel(shape=(), dtype=tf.int64, num_classes=11),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1006),
        'source': ClassLabel(shape=(), dtype=tf.int64, num_classes=3),
    }),
    'pitch': ClassLabel(shape=(), dtype=tf.int64, num_classes=128),
    'qualities': FeaturesDict({
        'bright': Tensor(shape=(), dtype=tf.bool),
        'dark': Tensor(shape=(), dtype=tf.bool),
        'distortion': Tensor(shape=(), dtype=tf.bool),
        'fast_decay': Tensor(shape=(), dtype=tf.bool),
        'long_release': Tensor(shape=(), dtype=tf.bool),
        'multiphonic': Tensor(shape=(), dtype=tf.bool),
        'nonlinear_env': Tensor(shape=(), dtype=tf.bool),
        'percussive': Tensor(shape=(), dtype=tf.bool),
        'reverb': Tensor(shape=(), dtype=tf.bool),
        'tempo-synced': Tensor(shape=(), dtype=tf.bool),
    }),
    'velocity': ClassLabel(shape=(), dtype=tf.int64, num_classes=128),
})

Statistics

Split Examples
ALL 305,979
TRAIN 289,205
VALID 12,678
TEST 4,096

Urls

Supervised keys (for as_supervised=True)

None

Citation

@InProceedings{pmlr-v70-engel17a,
  title =    {Neural Audio Synthesis of Musical Notes with {W}ave{N}et Autoencoders},
  author =   {Jesse Engel and Cinjon Resnick and Adam Roberts and Sander Dieleman and Mohammad Norouzi and Douglas Eck and Karen Simonyan},
  booktitle =    {Proceedings of the 34th International Conference on Machine Learning},
  pages =    {1068--1077},
  year =     {2017},
  editor =   {Doina Precup and Yee Whye Teh},
  volume =   {70},
  series =   {Proceedings of Machine Learning Research},
  address =      {International Convention Centre, Sydney, Australia},
  month =    {06--11 Aug},
  publisher =    {PMLR},
  pdf =      {http://proceedings.mlr.press/v70/engel17a/engel17a.pdf},
  url =      {http://proceedings.mlr.press/v70/engel17a.html},
}

image

"abstract_reasoning"

Procedurally Generated Matrices (PGM) data from the paper Measuring Abstract Reasoning in Neural Networks, Barrett, Hill, Santoro et al. 2018. The goal is to infer the correct answer from the context panels based on abstract reasoning.

To use this data set, please download all the *.tar.gz files from the data set page and place them in ~/tensorflow_datasets/abstract_reasoning/.

$R$ denotes the set of relation types (progression, XOR, OR, AND, consistent union), $O$ denotes the object types (shape, line), and $A$ denotes the attribute types (size, colour, position, number). The structure of a matrix, $S$, is the set of triples $S={[r, o, a]}$ that determine the challenge posed by a particular matrix.

abstract_reasoning is configured with tfds.image.abstract_reasoning.AbstractReasoningConfig and has the following configurations predefined (defaults to the first one):

  • "neutral" (v0.0.2) (Size: ?? GiB): The structures encoding the matrices in both the
    training and testing sets contain any triples $[r, o, a]$ for $r \in R$,
    $o \in O$, and $a \in A$. Training and testing sets are disjoint, with
    separation occurring at the level of the input variables (i.e. pixel
    manifestations).

  • "interpolation" (v0.0.2) (Size: ?? GiB): As in the neutral split, $S$ consisted of any
    triples $[r, o, a]$. For interpolation, in the training set, when the
    attribute was "colour" or "size" (i.e., the ordered attributes), the values of
    the attributes were restricted to even-indexed members of a discrete set,
    whereas in the test set only odd-indexed values were permitted. Note that all
    $S$ contained some triple $[r, o, a]$ with the colour or size attribute .
    Thus, generalisation is required for every question in the test set.

  • "extrapolation" (v0.0.2) (Size: ?? GiB): Same as in interpolation, but the values of
    the attributes were restricted to the lower half of the discrete set during
    training, whereas in the test set they took values in the upper half.

  • "attr.rel.pairs" (v0.0.2) (Size: ?? GiB): All $S$ contained at least two triples,
    $([r_1,o_1,a_1],[r_2,o_2,a_2]) = (t_1, t_2)$, of which 400 are viable. We
    randomly allocated 360 to the training set and 40 to the test set. Members
    $(t_1, t_2)$ of the 40 held-out pairs did not occur together in structures $S$
    in the training set, and all structures $S$ had at least one such pair
    $(t_1, t_2)$ as a subset.

  • "attr.rels" (v0.0.2) (Size: ?? GiB): In our dataset, there are 29 possible unique
    triples $[r,o,a]$. We allocated seven of these for the test set, at random,
    but such that each of the attributes was represented exactly once in this set.
    These held-out triples never occurred in questions in the training set, and
    every $S$ in the test set contained at least one of them.

  • "attrs.pairs" (v0.0.2) (Size: ?? GiB): $S$ contained at least two triples. There are 20
    (unordered) viable pairs of attributes $(a_1, a_2)$ such that for some
    $r_i, o_i, ([r_1,o_1,a_1],[r_2,o_2,a_2])$ is a viable triple pair
    $([r_1,o_1,a_1],[r_2,o_2,a_2]) = (t_1, t_2)$. We allocated 16 of these pairs
    for training and four for testing. For a pair $(a_1, a_2)$ in the test set,
    $S$ in the training set contained triples with $a_1$ or $a_2$. In the test
    set, all $S$ contained triples with $a_1$ and $a_2$.

  • "attrs.shape.color" (v0.0.2) (Size: ?? GiB): Held-out attribute shape-colour. $S$ in
    the training set contained no triples with $o$=shape and $a$=colour.
    All structures governing puzzles in the test set contained at least one triple
    with $o$=shape and $a$=colour.

  • "attrs.line.type" (v0.0.2) (Size: ?? GiB): Held-out attribute line-type. $S$ in
    the training set contained no triples with $o$=line and $a$=type.
    All structures governing puzzles in the test set contained at least one triple
    with $o$=line and $a$=type.

"abstract_reasoning/neutral"

FeaturesDict({
    'answers': Video(shape=(8, 160, 160, 1), dtype=tf.uint8, feature=Image(shape=(160, 160, 1), dtype=tf.uint8)),
    'context': Video(shape=(8, 160, 160, 1), dtype=tf.uint8, feature=Image(shape=(160, 160, 1), dtype=tf.uint8)),
    'filename': Text(shape=(), dtype=tf.string, encoder=None),
    'meta_target': Tensor(shape=[12], dtype=tf.int64),
    'relation_structure_encoded': Tensor(shape=[4, 12], dtype=tf.int64),
    'target': ClassLabel(shape=(), dtype=tf.int64, num_classes=8),
})

"abstract_reasoning/interpolation"

FeaturesDict({
    'answers': Video(shape=(8, 160, 160, 1), dtype=tf.uint8, feature=Image(shape=(160, 160, 1), dtype=tf.uint8)),
    'context': Video(shape=(8, 160, 160, 1), dtype=tf.uint8, feature=Image(shape=(160, 160, 1), dtype=tf.uint8)),
    'filename': Text(shape=(), dtype=tf.string, encoder=None),
    'meta_target': Tensor(shape=[12], dtype=tf.int64),
    'relation_structure_encoded': Tensor(shape=[4, 12], dtype=tf.int64),
    'target': ClassLabel(shape=(), dtype=tf.int64, num_classes=8),
})

"abstract_reasoning/extrapolation"

FeaturesDict({
    'answers': Video(shape=(8, 160, 160, 1), dtype=tf.uint8, feature=Image(shape=(160, 160, 1), dtype=tf.uint8)),
    'context': Video(shape=(8, 160, 160, 1), dtype=tf.uint8, feature=Image(shape=(160, 160, 1), dtype=tf.uint8)),
    'filename': Text(shape=(), dtype=tf.string, encoder=None),
    'meta_target': Tensor(shape=[12], dtype=tf.int64),
    'relation_structure_encoded': Tensor(shape=[4, 12], dtype=tf.int64),
    'target': ClassLabel(shape=(), dtype=tf.int64, num_classes=8),
})

"abstract_reasoning/attr.rel.pairs"

FeaturesDict({
    'answers': Video(shape=(8, 160, 160, 1), dtype=tf.uint8, feature=Image(shape=(160, 160, 1), dtype=tf.uint8)),
    'context': Video(shape=(8, 160, 160, 1), dtype=tf.uint8, feature=Image(shape=(160, 160, 1), dtype=tf.uint8)),
    'filename': Text(shape=(), dtype=tf.string, encoder=None),
    'meta_target': Tensor(shape=[12], dtype=tf.int64),
    'relation_structure_encoded': Tensor(shape=[4, 12], dtype=tf.int64),
    'target': ClassLabel(shape=(), dtype=tf.int64, num_classes=8),
})

"abstract_reasoning/attr.rels"

FeaturesDict({
    'answers': Video(shape=(8, 160, 160, 1), dtype=tf.uint8, feature=Image(shape=(160, 160, 1), dtype=tf.uint8)),
    'context': Video(shape=(8, 160, 160, 1), dtype=tf.uint8, feature=Image(shape=(160, 160, 1), dtype=tf.uint8)),
    'filename': Text(shape=(), dtype=tf.string, encoder=None),
    'meta_target': Tensor(shape=[12], dtype=tf.int64),
    'relation_structure_encoded': Tensor(shape=[4, 12], dtype=tf.int64),
    'target': ClassLabel(shape=(), dtype=tf.int64, num_classes=8),
})

"abstract_reasoning/attrs.pairs"

FeaturesDict({
    'answers': Video(shape=(8, 160, 160, 1), dtype=tf.uint8, feature=Image(shape=(160, 160, 1), dtype=tf.uint8)),
    'context': Video(shape=(8, 160, 160, 1), dtype=tf.uint8, feature=Image(shape=(160, 160, 1), dtype=tf.uint8)),
    'filename': Text(shape=(), dtype=tf.string, encoder=None),
    'meta_target': Tensor(shape=[12], dtype=tf.int64),
    'relation_structure_encoded': Tensor(shape=[4, 12], dtype=tf.int64),
    'target': ClassLabel(shape=(), dtype=tf.int64, num_classes=8),
})

"abstract_reasoning/attrs.shape.color"

FeaturesDict({
    'answers': Video(shape=(8, 160, 160, 1), dtype=tf.uint8, feature=Image(shape=(160, 160, 1), dtype=tf.uint8)),
    'context': Video(shape=(8, 160, 160, 1), dtype=tf.uint8, feature=Image(shape=(160, 160, 1), dtype=tf.uint8)),
    'filename': Text(shape=(), dtype=tf.string, encoder=None),
    'meta_target': Tensor(shape=[12], dtype=tf.int64),
    'relation_structure_encoded': Tensor(shape=[4, 12], dtype=tf.int64),
    'target': ClassLabel(shape=(), dtype=tf.int64, num_classes=8),
})

"abstract_reasoning/attrs.line.type"

FeaturesDict({
    'answers': Video(shape=(8, 160, 160, 1), dtype=tf.uint8, feature=Image(shape=(160, 160, 1), dtype=tf.uint8)),
    'context': Video(shape=(8, 160, 160, 1), dtype=tf.uint8, feature=Image(shape=(160, 160, 1), dtype=tf.uint8)),
    'filename': Text(shape=(), dtype=tf.string, encoder=None),
    'meta_target': Tensor(shape=[12], dtype=tf.int64),
    'relation_structure_encoded': Tensor(shape=[4, 12], dtype=tf.int64),
    'target': ClassLabel(shape=(), dtype=tf.int64, num_classes=8),
})

Statistics

None computed

Urls

Supervised keys (for as_supervised=True)

None

Citation

@InProceedings{pmlr-v80-barrett18a,
  title =    {Measuring abstract reasoning in neural networks},
  author =   {Barrett, David and Hill, Felix and Santoro, Adam and Morcos, Ari and Lillicrap, Timothy},
  booktitle =    {Proceedings of the 35th International Conference on Machine Learning},
  pages =    {511--520},
  year =     {2018},
  editor =   {Dy, Jennifer and Krause, Andreas},
  volume =   {80},
  series =   {Proceedings of Machine Learning Research},
  address =      {Stockholmsmassan, Stockholm Sweden},
  month =    {10--15 Jul},
  publisher =    {PMLR},
  pdf =      {http://proceedings.mlr.press/v80/barrett18a/barrett18a.pdf},
  url =      {http://proceedings.mlr.press/v80/barrett18a.html},
  abstract =     {Whether neural networks can learn abstract reasoning or whetherthey merely rely on superficial statistics is a topic of recent debate. Here, we propose a dataset and challenge designed to probe abstract reasoning, inspired by a well-known human IQ test. To succeed at this challenge, models must cope with various generalisation 'regimes' in which the training data and test questions differ in clearly-defined ways. We show that popular models such as ResNets perform poorly, even when the training and test sets differ only minimally, and we present a novel architecture, with structure designed to encourage reasoning, that does significantly better. When we vary the way in which the test questions and training data differ, we find that our model is notably proficient at certain forms of generalisation, but notably weak at others. We further show that the model's ability to generalise improves markedly if it is trained to predict symbolic explanations for its answers. Altogether, we introduce and explore ways to both measure and induce stronger abstract reasoning in neural networks. Our freely-available dataset should motivate further progress in this direction.}
}

"bigearthnet"

The BigEarthNet is a new large-scale Sentinel-2 benchmark archive, consisting of 590,326 Sentinel-2 image patches. The image patch size on the ground is 1.2 x 1.2 km with variable image size depending on the channel resolution. This is a multi-label dataset with 43 imbalanced labels.

To construct the BigEarthNet, 125 Sentinel-2 tiles acquired between June 2017 and May 2018 over the 10 countries (Austria, Belgium, Finland, Ireland, Kosovo, Lithuania, Luxembourg, Portugal, Serbia, Switzerland) of Europe were initially selected. All the tiles were atmospherically corrected by the Sentinel-2 Level 2A product generation and formatting tool (sen2cor). Then, they were divided into 590,326 non-overlapping image patches. Each image patch was annotated by the multiple land-cover classes (i.e., multi-labels) that were provided from the CORINE Land Cover database of the year 2018 (CLC 2018).

Bands and pixel resolution in meters: B01: Coastal aerosol; 60m B02: Blue; 10m B03: Green; 10m B04: Red; 10m B05: Vegetation red edge; 20m B06: Vegetation red edge; 20m B07: Vegetation red edge; 20m B08: NIR; 10m B09: Water vapor; 60m B11: SWIR; 20m B12: SWIR; 20m B8A: Narrow NIR; 20m

License: Community Data License Agreement - Permissive, Version 1.0.

URL: http://bigearth.net/

bigearthnet is configured with tfds.image.bigearthnet.BigearthnetConfig and has the following configurations predefined (defaults to the first one):

  • "rgb" (v0.0.2) (Size: ?? GiB): Sentinel-2 RGB channels

  • "all" (v0.0.2) (Size: ?? GiB): 13 Sentinel-2 channels

"bigearthnet/rgb"

FeaturesDict({
    'filename': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(120, 120, 3), dtype=tf.uint8),
    'labels': Sequence(shape=(None,), dtype=tf.int64, feature=ClassLabel(shape=(), dtype=tf.int64, num_classes=43)),
    'metadata': FeaturesDict({
        'acquisition_date': Text(shape=(), dtype=tf.string, encoder=None),
        'coordinates': FeaturesDict({
            'lrx': Tensor(shape=(), dtype=tf.int64),
            'lry': Tensor(shape=(), dtype=tf.int64),
            'ulx': Tensor(shape=(), dtype=tf.int64),
            'uly': Tensor(shape=(), dtype=tf.int64),
        }),
        'projection': Text(shape=(), dtype=tf.string, encoder=None),
        'tile_source': Text(shape=(), dtype=tf.string, encoder=None),
    }),
})

"bigearthnet/all"

FeaturesDict({
    'B01': Tensor(shape=[20, 20], dtype=tf.float32),
    'B02': Tensor(shape=[120, 120], dtype=tf.float32),
    'B03': Tensor(shape=[120, 120], dtype=tf.float32),
    'B04': Tensor(shape=[120, 120], dtype=tf.float32),
    'B05': Tensor(shape=[60, 60], dtype=tf.float32),
    'B06': Tensor(shape=[60, 60], dtype=tf.float32),
    'B07': Tensor(shape=[60, 60], dtype=tf.float32),
    'B08': Tensor(shape=[120, 120], dtype=tf.float32),
    'B09': Tensor(shape=[20, 20], dtype=tf.float32),
    'B11': Tensor(shape=[60, 60], dtype=tf.float32),
    'B12': Tensor(shape=[60, 60], dtype=tf.float32),
    'B8A': Tensor(shape=[60, 60], dtype=tf.float32),
    'filename': Text(shape=(), dtype=tf.string, encoder=None),
    'labels': Sequence(shape=(None,), dtype=tf.int64, feature=ClassLabel(shape=(), dtype=tf.int64, num_classes=43)),
    'metadata': FeaturesDict({
        'acquisition_date': Text(shape=(), dtype=tf.string, encoder=None),
        'coordinates': FeaturesDict({
            'lrx': Tensor(shape=(), dtype=tf.int64),
            'lry': Tensor(shape=(), dtype=tf.int64),
            'ulx': Tensor(shape=(), dtype=tf.int64),
            'uly': Tensor(shape=(), dtype=tf.int64),
        }),
        'projection': Text(shape=(), dtype=tf.string, encoder=None),
        'tile_source': Text(shape=(), dtype=tf.string, encoder=None),
    }),
})

Statistics

None computed

Urls

Supervised keys (for as_supervised=True)

None

Citation

@article{Sumbul2019BigEarthNetAL,
  title={BigEarthNet: A Large-Scale Benchmark Archive For Remote Sensing Image Understanding},
  author={Gencer Sumbul and Marcela Charfuelan and Beg{"u}m Demir and Volker Markl},
  journal={CoRR},
  year={2019},
  volume={abs/1902.06148}
}

"caltech101"

Caltech-101 consists of pictures of objects belonging to 101 classes, plus one background clutter class. Each image is labelled with a single object. Each class contains roughly 40 to 800 images, totalling around 9k images. Images are of variable sizes, with typical edge lengths of 200-300 pixels. This version contains image-level labels only. The original dataset also contains bounding boxes.

Features

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'image/file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=102),
})

Statistics

Split Examples
ALL 9,801
TEST 6,741
TRAIN 3,060

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@article{FeiFei2004LearningGV,
  title={Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories},
  author={Li Fei-Fei and Rob Fergus and Pietro Perona},
  journal={Computer Vision and Pattern Recognition Workshop},
  year={2004},
}

"cats_vs_dogs"

A large set of images of cats and dogs.There are 1738 corrupted images that are dropped.

Features

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'image/filename': Text(shape=(), dtype=tf.string, encoder=None),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
})

Statistics

Split Examples
TRAIN 23,262
ALL 23,262

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@Inproceedings (Conference){asirra-a-captcha-that-exploits-interest-aligned-manual-image-categorization,
author = {Elson, Jeremy and Douceur, John (JD) and Howell, Jon and Saul, Jared},
title = {Asirra: A CAPTCHA that Exploits Interest-Aligned Manual Image Categorization},
booktitle = {Proceedings of 14th ACM Conference on Computer and Communications Security (CCS)},
year = {2007},
month = {October},
publisher = {Association for Computing Machinery, Inc.},
url = {https://www.microsoft.com/en-us/research/publication/asirra-a-captcha-that-exploits-interest-aligned-manual-image-categorization/},
edition = {Proceedings of 14th ACM Conference on Computer and Communications Security (CCS)},
}

"celeb_a"

CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations, including - 10,177 number of identities, - 202,599 number of face images, and - 5 landmark locations, 40 binary attributes annotations per image.

The dataset can be employed as the training and test sets for the following computer vision tasks: face attribute recognition, face detection, and landmark (or facial part) localization.

Features

FeaturesDict({
    'attributes': FeaturesDict({
        '5_o_Clock_Shadow': Tensor(shape=(), dtype=tf.bool),
        'Arched_Eyebrows': Tensor(shape=(), dtype=tf.bool),
        'Attractive': Tensor(shape=(), dtype=tf.bool),
        'Bags_Under_Eyes': Tensor(shape=(), dtype=tf.bool),
        'Bald': Tensor(shape=(), dtype=tf.bool),
        'Bangs': Tensor(shape=(), dtype=tf.bool),
        'Big_Lips': Tensor(shape=(), dtype=tf.bool),
        'Big_Nose': Tensor(shape=(), dtype=tf.bool),
        'Black_Hair': Tensor(shape=(), dtype=tf.bool),
        'Blond_Hair': Tensor(shape=(), dtype=tf.bool),
        'Blurry': Tensor(shape=(), dtype=tf.bool),
        'Brown_Hair': Tensor(shape=(), dtype=tf.bool),
        'Bushy_Eyebrows': Tensor(shape=(), dtype=tf.bool),
        'Chubby': Tensor(shape=(), dtype=tf.bool),
        'Double_Chin': Tensor(shape=(), dtype=tf.bool),
        'Eyeglasses': Tensor(shape=(), dtype=tf.bool),
        'Goatee': Tensor(shape=(), dtype=tf.bool),
        'Gray_Hair': Tensor(shape=(), dtype=tf.bool),
        'Heavy_Makeup': Tensor(shape=(), dtype=tf.bool),
        'High_Cheekbones': Tensor(shape=(), dtype=tf.bool),
        'Male': Tensor(shape=(), dtype=tf.bool),
        'Mouth_Slightly_Open': Tensor(shape=(), dtype=tf.bool),
        'Mustache': Tensor(shape=(), dtype=tf.bool),
        'Narrow_Eyes': Tensor(shape=(), dtype=tf.bool),
        'No_Beard': Tensor(shape=(), dtype=tf.bool),
        'Oval_Face': Tensor(shape=(), dtype=tf.bool),
        'Pale_Skin': Tensor(shape=(), dtype=tf.bool),
        'Pointy_Nose': Tensor(shape=(), dtype=tf.bool),
        'Receding_Hairline': Tensor(shape=(), dtype=tf.bool),
        'Rosy_Cheeks': Tensor(shape=(), dtype=tf.bool),
        'Sideburns': Tensor(shape=(), dtype=tf.bool),
        'Smiling': Tensor(shape=(), dtype=tf.bool),
        'Straight_Hair': Tensor(shape=(), dtype=tf.bool),
        'Wavy_Hair': Tensor(shape=(), dtype=tf.bool),
        'Wearing_Earrings': Tensor(shape=(), dtype=tf.bool),
        'Wearing_Hat': Tensor(shape=(), dtype=tf.bool),
        'Wearing_Lipstick': Tensor(shape=(), dtype=tf.bool),
        'Wearing_Necklace': Tensor(shape=(), dtype=tf.bool),
        'Wearing_Necktie': Tensor(shape=(), dtype=tf.bool),
        'Young': Tensor(shape=(), dtype=tf.bool),
    }),
    'image': Image(shape=(218, 178, 3), dtype=tf.uint8),
    'landmarks': FeaturesDict({
        'lefteye_x': Tensor(shape=(), dtype=tf.int64),
        'lefteye_y': Tensor(shape=(), dtype=tf.int64),
        'leftmouth_x': Tensor(shape=(), dtype=tf.int64),
        'leftmouth_y': Tensor(shape=(), dtype=tf.int64),
        'nose_x': Tensor(shape=(), dtype=tf.int64),
        'nose_y': Tensor(shape=(), dtype=tf.int64),
        'righteye_x': Tensor(shape=(), dtype=tf.int64),
        'righteye_y': Tensor(shape=(), dtype=tf.int64),
        'rightmouth_x': Tensor(shape=(), dtype=tf.int64),
        'rightmouth_y': Tensor(shape=(), dtype=tf.int64),
    }),
})

Statistics

Split Examples
ALL 202,599
TRAIN 162,770
TEST 19,962
VALIDATION 19,867

Urls

Supervised keys (for as_supervised=True)

None

Citation

@inproceedings{conf/iccv/LiuLWT15,
  added-at = {2018-10-09T00:00:00.000+0200},
  author = {Liu, Ziwei and Luo, Ping and Wang, Xiaogang and Tang, Xiaoou},
  biburl = {https://www.bibsonomy.org/bibtex/250e4959be61db325d2f02c1d8cd7bfbb/dblp},
  booktitle = {ICCV},
  crossref = {conf/iccv/2015},
  ee = {http://doi.ieeecomputersociety.org/10.1109/ICCV.2015.425},
  interhash = {3f735aaa11957e73914bbe2ca9d5e702},
  intrahash = {50e4959be61db325d2f02c1d8cd7bfbb},
  isbn = {978-1-4673-8391-2},
  keywords = {dblp},
  pages = {3730-3738},
  publisher = {IEEE Computer Society},
  timestamp = {2018-10-11T11:43:28.000+0200},
  title = {Deep Learning Face Attributes in the Wild.},
  url = {http://dblp.uni-trier.de/db/conf/iccv/iccv2015.html#LiuLWT15},
  year = 2015
}

"celeb_a_hq"

High-quality version of the CELEBA dataset, consisting of 30000 images in 1024 x 1024 resolution.

WARNING: This dataset currently requires you to prepare images on your own.

celeb_a_hq is configured with tfds.image.celebahq.CelebaHQConfig and has the following configurations predefined (defaults to the first one):

  • "1024" (v0.1.0) (Size: ?? GiB): CelebaHQ images in 1024 x 1024 resolution

  • "512" (v0.1.0) (Size: ?? GiB): CelebaHQ images in 512 x 512 resolution

  • "256" (v0.1.0) (Size: ?? GiB): CelebaHQ images in 256 x 256 resolution

  • "128" (v0.1.0) (Size: ?? GiB): CelebaHQ images in 128 x 128 resolution

  • "64" (v0.1.0) (Size: ?? GiB): CelebaHQ images in 64 x 64 resolution

  • "32" (v0.1.0) (Size: ?? GiB): CelebaHQ images in 32 x 32 resolution

  • "16" (v0.1.0) (Size: ?? GiB): CelebaHQ images in 16 x 16 resolution

  • "8" (v0.1.0) (Size: ?? GiB): CelebaHQ images in 8 x 8 resolution

  • "4" (v0.1.0) (Size: ?? GiB): CelebaHQ images in 4 x 4 resolution

  • "2" (v0.1.0) (Size: ?? GiB): CelebaHQ images in 2 x 2 resolution

  • "1" (v0.1.0) (Size: ?? GiB): CelebaHQ images in 1 x 1 resolution

"celeb_a_hq/1024"

FeaturesDict({
    'image': Image(shape=(1024, 1024, 3), dtype=tf.uint8),
    'image/filename': Text(shape=(), dtype=tf.string, encoder=None),
})

"celeb_a_hq/512"

FeaturesDict({
    'image': Image(shape=(512, 512, 3), dtype=tf.uint8),
    'image/filename': Text(shape=(), dtype=tf.string, encoder=None),
})

"celeb_a_hq/256"

FeaturesDict({
    'image': Image(shape=(256, 256, 3), dtype=tf.uint8),
    'image/filename': Text(shape=(), dtype=tf.string, encoder=None),
})

"celeb_a_hq/128"

FeaturesDict({
    'image': Image(shape=(128, 128, 3), dtype=tf.uint8),
    'image/filename': Text(shape=(), dtype=tf.string, encoder=None),
})

"celeb_a_hq/64"

FeaturesDict({
    'image': Image(shape=(64, 64, 3), dtype=tf.uint8),
    'image/filename': Text(shape=(), dtype=tf.string, encoder=None),
})

"celeb_a_hq/32"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'image/filename': Text(shape=(), dtype=tf.string, encoder=None),
})

"celeb_a_hq/16"

FeaturesDict({
    'image': Image(shape=(16, 16, 3), dtype=tf.uint8),
    'image/filename': Text(shape=(), dtype=tf.string, encoder=None),
})

"celeb_a_hq/8"

FeaturesDict({
    'image': Image(shape=(8, 8, 3), dtype=tf.uint8),
    'image/filename': Text(shape=(), dtype=tf.string, encoder=None),
})

"celeb_a_hq/4"

FeaturesDict({
    'image': Image(shape=(4, 4, 3), dtype=tf.uint8),
    'image/filename': Text(shape=(), dtype=tf.string, encoder=None),
})

"celeb_a_hq/2"

FeaturesDict({
    'image': Image(shape=(2, 2, 3), dtype=tf.uint8),
    'image/filename': Text(shape=(), dtype=tf.string, encoder=None),
})

"celeb_a_hq/1"

FeaturesDict({
    'image': Image(shape=(1, 1, 3), dtype=tf.uint8),
    'image/filename': Text(shape=(), dtype=tf.string, encoder=None),
})

Statistics

Split Examples
TRAIN 30,000
ALL 30,000

Urls

Supervised keys (for as_supervised=True)

None

Citation

@article{DBLP:journals/corr/abs-1710-10196,
  author    = {Tero Karras and
               Timo Aila and
               Samuli Laine and
               Jaakko Lehtinen},
  title     = {Progressive Growing of GANs for Improved Quality, Stability, and Variation},
  journal   = {CoRR},
  volume    = {abs/1710.10196},
  year      = {2017},
  url       = {http://arxiv.org/abs/1710.10196},
  archivePrefix = {arXiv},
  eprint    = {1710.10196},
  timestamp = {Mon, 13 Aug 2018 16:46:42 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1710-10196},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

"cifar10"

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

Features

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

Statistics

Split Examples
ALL 60,000
TRAIN 50,000
TEST 10,000

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@TECHREPORT{Krizhevsky09learningmultiple,
    author = {Alex Krizhevsky},
    title = {Learning multiple layers of features from tiny images},
    institution = {},
    year = {2009}
}

"cifar100"

This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs).

Features

FeaturesDict({
    'coarse_label': ClassLabel(shape=(), dtype=tf.int64, num_classes=20),
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=100),
})

Statistics

Split Examples
ALL 60,000
TRAIN 50,000
TEST 10,000

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@TECHREPORT{Krizhevsky09learningmultiple,
    author = {Alex Krizhevsky},
    title = {Learning multiple layers of features from tiny images},
    institution = {},
    year = {2009}
}

"cifar10_corrupted"

Cifar10Corrupted is a dataset generated by adding 15 common corruptions to the test images in the Cifar10 dataset. This dataset wraps the corrupted Cifar10 test images uploaded by the original authors.

cifar10_corrupted is configured with tfds.image.cifar10_corrupted.Cifar10CorruptedConfig and has the following configurations predefined (defaults to the first one):

  • "brightness_1" (v0.0.1) (Size: 2.72 GiB): Corruption method: brightness, severity level: 1

  • "brightness_2" (v0.0.1) (Size: 2.72 GiB): Corruption method: brightness, severity level: 2

  • "brightness_3" (v0.0.1) (Size: 2.72 GiB): Corruption method: brightness, severity level: 3

  • "brightness_4" (v0.0.1) (Size: 2.72 GiB): Corruption method: brightness, severity level: 4

  • "brightness_5" (v0.0.1) (Size: 2.72 GiB): Corruption method: brightness, severity level: 5

  • "contrast_1" (v0.0.1) (Size: 2.72 GiB): Corruption method: contrast, severity level: 1

  • "contrast_2" (v0.0.1) (Size: 2.72 GiB): Corruption method: contrast, severity level: 2

  • "contrast_3" (v0.0.1) (Size: 2.72 GiB): Corruption method: contrast, severity level: 3

  • "contrast_4" (v0.0.1) (Size: 2.72 GiB): Corruption method: contrast, severity level: 4

  • "contrast_5" (v0.0.1) (Size: 2.72 GiB): Corruption method: contrast, severity level: 5

  • "defocus_blur_1" (v0.0.1) (Size: 2.72 GiB): Corruption method: defocus_blur, severity level: 1

  • "defocus_blur_2" (v0.0.1) (Size: 2.72 GiB): Corruption method: defocus_blur, severity level: 2

  • "defocus_blur_3" (v0.0.1) (Size: 2.72 GiB): Corruption method: defocus_blur, severity level: 3

  • "defocus_blur_4" (v0.0.1) (Size: 2.72 GiB): Corruption method: defocus_blur, severity level: 4

  • "defocus_blur_5" (v0.0.1) (Size: 2.72 GiB): Corruption method: defocus_blur, severity level: 5

  • "elastic_1" (v0.0.1) (Size: 2.72 GiB): Corruption method: elastic, severity level: 1

  • "elastic_2" (v0.0.1) (Size: 2.72 GiB): Corruption method: elastic, severity level: 2

  • "elastic_3" (v0.0.1) (Size: 2.72 GiB): Corruption method: elastic, severity level: 3

  • "elastic_4" (v0.0.1) (Size: 2.72 GiB): Corruption method: elastic, severity level: 4

  • "elastic_5" (v0.0.1) (Size: 2.72 GiB): Corruption method: elastic, severity level: 5

  • "fog_1" (v0.0.1) (Size: 2.72 GiB): Corruption method: fog, severity level: 1

  • "fog_2" (v0.0.1) (Size: 2.72 GiB): Corruption method: fog, severity level: 2

  • "fog_3" (v0.0.1) (Size: 2.72 GiB): Corruption method: fog, severity level: 3

  • "fog_4" (v0.0.1) (Size: 2.72 GiB): Corruption method: fog, severity level: 4

  • "fog_5" (v0.0.1) (Size: 2.72 GiB): Corruption method: fog, severity level: 5

  • "frost_1" (v0.0.1) (Size: 2.72 GiB): Corruption method: frost, severity level: 1

  • "frost_2" (v0.0.1) (Size: 2.72 GiB): Corruption method: frost, severity level: 2

  • "frost_3" (v0.0.1) (Size: 2.72 GiB): Corruption method: frost, severity level: 3

  • "frost_4" (v0.0.1) (Size: 2.72 GiB): Corruption method: frost, severity level: 4

  • "frost_5" (v0.0.1) (Size: 2.72 GiB): Corruption method: frost, severity level: 5

  • "frosted_glass_blur_1" (v0.0.1) (Size: 2.72 GiB): Corruption method: frosted_glass_blur, severity level: 1

  • "frosted_glass_blur_2" (v0.0.1) (Size: 2.72 GiB): Corruption method: frosted_glass_blur, severity level: 2

  • "frosted_glass_blur_3" (v0.0.1) (Size: 2.72 GiB): Corruption method: frosted_glass_blur, severity level: 3

  • "frosted_glass_blur_4" (v0.0.1) (Size: 2.72 GiB): Corruption method: frosted_glass_blur, severity level: 4

  • "frosted_glass_blur_5" (v0.0.1) (Size: 2.72 GiB): Corruption method: frosted_glass_blur, severity level: 5

  • "gaussian_noise_1" (v0.0.1) (Size: 2.72 GiB): Corruption method: gaussian_noise, severity level: 1

  • "gaussian_noise_2" (v0.0.1) (Size: 2.72 GiB): Corruption method: gaussian_noise, severity level: 2

  • "gaussian_noise_3" (v0.0.1) (Size: 2.72 GiB): Corruption method: gaussian_noise, severity level: 3

  • "gaussian_noise_4" (v0.0.1) (Size: 2.72 GiB): Corruption method: gaussian_noise, severity level: 4

  • "gaussian_noise_5" (v0.0.1) (Size: 2.72 GiB): Corruption method: gaussian_noise, severity level: 5

  • "impulse_noise_1" (v0.0.1) (Size: 2.72 GiB): Corruption method: impulse_noise, severity level: 1

  • "impulse_noise_2" (v0.0.1) (Size: 2.72 GiB): Corruption method: impulse_noise, severity level: 2

  • "impulse_noise_3" (v0.0.1) (Size: 2.72 GiB): Corruption method: impulse_noise, severity level: 3

  • "impulse_noise_4" (v0.0.1) (Size: 2.72 GiB): Corruption method: impulse_noise, severity level: 4

  • "impulse_noise_5" (v0.0.1) (Size: 2.72 GiB): Corruption method: impulse_noise, severity level: 5

  • "jpeg_compression_1" (v0.0.1) (Size: 2.72 GiB): Corruption method: jpeg_compression, severity level: 1

  • "jpeg_compression_2" (v0.0.1) (Size: 2.72 GiB): Corruption method: jpeg_compression, severity level: 2

  • "jpeg_compression_3" (v0.0.1) (Size: 2.72 GiB): Corruption method: jpeg_compression, severity level: 3

  • "jpeg_compression_4" (v0.0.1) (Size: 2.72 GiB): Corruption method: jpeg_compression, severity level: 4

  • "jpeg_compression_5" (v0.0.1) (Size: 2.72 GiB): Corruption method: jpeg_compression, severity level: 5

  • "motion_blur_1" (v0.0.1) (Size: 2.72 GiB): Corruption method: motion_blur, severity level: 1

  • "motion_blur_2" (v0.0.1) (Size: 2.72 GiB): Corruption method: motion_blur, severity level: 2

  • "motion_blur_3" (v0.0.1) (Size: 2.72 GiB): Corruption method: motion_blur, severity level: 3

  • "motion_blur_4" (v0.0.1) (Size: 2.72 GiB): Corruption method: motion_blur, severity level: 4

  • "motion_blur_5" (v0.0.1) (Size: 2.72 GiB): Corruption method: motion_blur, severity level: 5

  • "pixelate_1" (v0.0.1) (Size: 2.72 GiB): Corruption method: pixelate, severity level: 1

  • "pixelate_2" (v0.0.1) (Size: 2.72 GiB): Corruption method: pixelate, severity level: 2

  • "pixelate_3" (v0.0.1) (Size: 2.72 GiB): Corruption method: pixelate, severity level: 3

  • "pixelate_4" (v0.0.1) (Size: 2.72 GiB): Corruption method: pixelate, severity level: 4

  • "pixelate_5" (v0.0.1) (Size: 2.72 GiB): Corruption method: pixelate, severity level: 5

  • "shot_noise_1" (v0.0.1) (Size: 2.72 GiB): Corruption method: shot_noise, severity level: 1

  • "shot_noise_2" (v0.0.1) (Size: 2.72 GiB): Corruption method: shot_noise, severity level: 2

  • "shot_noise_3" (v0.0.1) (Size: 2.72 GiB): Corruption method: shot_noise, severity level: 3

  • "shot_noise_4" (v0.0.1) (Size: 2.72 GiB): Corruption method: shot_noise, severity level: 4

  • "shot_noise_5" (v0.0.1) (Size: 2.72 GiB): Corruption method: shot_noise, severity level: 5

  • "snow_1" (v0.0.1) (Size: 2.72 GiB): Corruption method: snow, severity level: 1

  • "snow_2" (v0.0.1) (Size: 2.72 GiB): Corruption method: snow, severity level: 2

  • "snow_3" (v0.0.1) (Size: 2.72 GiB): Corruption method: snow, severity level: 3

  • "snow_4" (v0.0.1) (Size: 2.72 GiB): Corruption method: snow, severity level: 4

  • "snow_5" (v0.0.1) (Size: 2.72 GiB): Corruption method: snow, severity level: 5

  • "zoom_blur_1" (v0.0.1) (Size: 2.72 GiB): Corruption method: zoom_blur, severity level: 1

  • "zoom_blur_2" (v0.0.1) (Size: 2.72 GiB): Corruption method: zoom_blur, severity level: 2

  • "zoom_blur_3" (v0.0.1) (Size: 2.72 GiB): Corruption method: zoom_blur, severity level: 3

  • "zoom_blur_4" (v0.0.1) (Size: 2.72 GiB): Corruption method: zoom_blur, severity level: 4

  • "zoom_blur_5" (v0.0.1) (Size: 2.72 GiB): Corruption method: zoom_blur, severity level: 5

"cifar10_corrupted/brightness_1"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/brightness_2"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/brightness_3"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/brightness_4"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/brightness_5"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/contrast_1"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/contrast_2"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/contrast_3"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/contrast_4"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/contrast_5"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/defocus_blur_1"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/defocus_blur_2"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/defocus_blur_3"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/defocus_blur_4"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/defocus_blur_5"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/elastic_1"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/elastic_2"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/elastic_3"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/elastic_4"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/elastic_5"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/fog_1"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/fog_2"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/fog_3"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/fog_4"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/fog_5"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/frost_1"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/frost_2"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/frost_3"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/frost_4"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/frost_5"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/frosted_glass_blur_1"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/frosted_glass_blur_2"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/frosted_glass_blur_3"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/frosted_glass_blur_4"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/frosted_glass_blur_5"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/gaussian_noise_1"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/gaussian_noise_2"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/gaussian_noise_3"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/gaussian_noise_4"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/gaussian_noise_5"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/impulse_noise_1"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/impulse_noise_2"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/impulse_noise_3"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/impulse_noise_4"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/impulse_noise_5"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/jpeg_compression_1"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/jpeg_compression_2"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/jpeg_compression_3"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/jpeg_compression_4"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/jpeg_compression_5"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/motion_blur_1"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/motion_blur_2"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/motion_blur_3"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/motion_blur_4"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/motion_blur_5"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/pixelate_1"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/pixelate_2"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/pixelate_3"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/pixelate_4"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/pixelate_5"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/shot_noise_1"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/shot_noise_2"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/shot_noise_3"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/shot_noise_4"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/shot_noise_5"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/snow_1"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/snow_2"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/snow_3"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/snow_4"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/snow_5"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/zoom_blur_1"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/zoom_blur_2"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/zoom_blur_3"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/zoom_blur_4"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"cifar10_corrupted/zoom_blur_5"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

Statistics

Split Examples
TEST 10,000
ALL 10,000

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@inproceedings{
  hendrycks2018benchmarking,
  title={Benchmarking Neural Network Robustness to Common Corruptions and Perturbations},
  author={Dan Hendrycks and Thomas Dietterich},
  booktitle={International Conference on Learning Representations},
  year={2019},
  url={https://openreview.net/forum?id=HJz6tiCqYm},
}

"clevr"

CLEVR is a diagnostic dataset that tests a range of visual reasoning abilities. It contains minimal biases and has detailed annotations describing the kind of reasoning each question requires.

Features

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'objects': Sequence({'size': TensorInfo(shape=(None,), dtype=tf.int64), 'color': TensorInfo(shape=(None,), dtype=tf.int64), 'shape': TensorInfo(shape=(None,), dtype=tf.int64), '3d_coords': TensorInfo(shape=(None, 3), dtype=tf.float32), 'pixel_coords': TensorInfo(shape=(None, 3), dtype=tf.float32), 'material': TensorInfo(shape=(None,), dtype=tf.int64), 'rotation': TensorInfo(shape=(None,), dtype=tf.float32)}),
})

Statistics

Split Examples
ALL 100,000
TRAIN 70,000
VALIDATION 15,000
TEST 15,000

Urls

Supervised keys (for as_supervised=True)

None

Citation

@inproceedings{johnson2017clevr,
  title={ {CLEVR}: A diagnostic dataset for compositional language and elementary visual reasoning},
  author={Johnson, Justin and Hariharan, Bharath and van der Maaten, Laurens and Fei-Fei, Li and Lawrence Zitnick, C and Girshick, Ross},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  year={2017}
}

"coco2014"

COCO is a large-scale object detection, segmentation, and captioning dataset. This version contains images, bounding boxes and labels for the 2014 version. Note: * Some images from the train and validation sets don't have annotations. * The test split don't have any annotations (only images). * Coco defines 91 classes but the data only had 80 classes.

Features

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'image/filename': Text(shape=(), dtype=tf.string, encoder=None),
    'objects': Sequence({'is_crowd': TensorInfo(shape=(None,), dtype=tf.bool), 'bbox': TensorInfo(shape=(None, 4), dtype=tf.float32), 'label': TensorInfo(shape=(None,), dtype=tf.int64)}),
})

Statistics

Split Examples
ALL 245,496
TRAIN 82,783
TEST2015 81,434
TEST 40,775
VALIDATION 40,504

Urls

Supervised keys (for as_supervised=True)

None

Citation

@article{DBLP:journals/corr/LinMBHPRDZ14,
  author    = {Tsung{-}Yi Lin and
               Michael Maire and
               Serge J. Belongie and
               Lubomir D. Bourdev and
               Ross B. Girshick and
               James Hays and
               Pietro Perona and
               Deva Ramanan and
               Piotr Doll{'{a}}r and
               C. Lawrence Zitnick},
  title     = {Microsoft {COCO:} Common Objects in Context},
  journal   = {CoRR},
  volume    = {abs/1405.0312},
  year      = {2014},
  url       = {http://arxiv.org/abs/1405.0312},
  archivePrefix = {arXiv},
  eprint    = {1405.0312},
  timestamp = {Mon, 13 Aug 2018 16:48:13 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/LinMBHPRDZ14},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

"colorectal_histology"

Classification of textures in colorectal cancer histology. Each example is a 150 x 150 x 3 RGB image of one of 8 classes.

Features

FeaturesDict({
    'filename': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(150, 150, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=8),
})

Statistics

Split Examples
TRAIN 5,000
ALL 5,000

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@article{kather2016multi,
  title={Multi-class texture analysis in colorectal cancer histology},
  author={Kather, Jakob Nikolas and Weis, Cleo-Aron and Bianconi, Francesco and Melchers, Susanne M and Schad, Lothar R and Gaiser, Timo and Marx, Alexander and Z{"o}llner, Frank Gerrit},
  journal={Scientific reports},
  volume={6},
  pages={27988},
  year={2016},
  publisher={Nature Publishing Group}
}

"colorectal_histology_large"

10 large 5000 x 5000 textured colorectal cancer histology images

Features

FeaturesDict({
    'filename': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(5000, 5000, 3), dtype=tf.uint8),
})

Statistics

Split Examples
TEST 10
ALL 10

Urls

Supervised keys (for as_supervised=True)

None

Citation

@article{kather2016multi,
  title={Multi-class texture analysis in colorectal cancer histology},
  author={Kather, Jakob Nikolas and Weis, Cleo-Aron and Bianconi, Francesco and Melchers, Susanne M and Schad, Lothar R and Gaiser, Timo and Marx, Alexander and Z{"o}llner, Frank Gerrit},
  journal={Scientific reports},
  volume={6},
  pages={27988},
  year={2016},
  publisher={Nature Publishing Group}
}

"curated_breast_imaging_ddsm"

The CBIS-DDSM (Curated Breast Imaging Subset of DDSM) is an updated and standardized version of the Digital Database for Screening Mammography (DDSM). The DDSM is a database of 2,620 scanned film mammography studies. It contains normal, benign, and malignant cases with verified pathology information.

The default config is made of patches extracted from the original mammograms, following the description from http://arxiv.org/abs/1708.09427, in order to frame the task to solve in a traditional image classification setting.

Because special software and libraries are needed to download and read the images contained in the dataset, TFDS assumes that the user has downloaded the original DCIM files and converted them to PNG.

The following commands (or equivalent) should be used to generate the PNG files, in order to guarantee reproducible results:

find $DATASET_DCIM_DIR -name '*.dcm' |
xargs -n1 -P8 -I{} bash -c 'f={}; dcmj2pnm $f | convert - ${f/.dcm/.png}'

curated_breast_imaging_ddsm is configured with tfds.image.cbis_ddsm.CuratedBreastImagingDDSMConfig and has the following configurations predefined (defaults to the first one):

  • "patches" (v0.1.0) (Size: 2.01 MiB): Patches containing both calsification and mass cases, plus pathces with no abnormalities. Designed as a traditional 5-class classification task.

  • "original-calc" (v0.1.0) (Size: 1.06 MiB): Original images of the calcification cases compressed in lossless PNG.

  • "original-mass" (v0.1.0) (Size: 966.57 KiB): Original images of the mass cases compressed in lossless PNG.

"curated_breast_imaging_ddsm/patches"

FeaturesDict({
    'id': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 1), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=5),
})

"curated_breast_imaging_ddsm/original-calc"

FeaturesDict({
    'abnormalities': Sequence({'assessment': TensorInfo(shape=(None,), dtype=tf.int64), 'calc_distribution': TensorInfo(shape=(None,), dtype=tf.int64), 'calc_type': TensorInfo(shape=(None,), dtype=tf.int64), 'id': TensorInfo(shape=(None,), dtype=tf.int32), 'mask': TensorInfo(shape=(None, None, None, 1), dtype=tf.uint8), 'subtlety': TensorInfo(shape=(None,), dtype=tf.int64), 'pathology': TensorInfo(shape=(None,), dtype=tf.int64)}),
    'breast': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'id': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 1), dtype=tf.uint8),
    'patient': Text(shape=(), dtype=tf.string, encoder=None),
    'view': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
})

"curated_breast_imaging_ddsm/original-mass"

FeaturesDict({
    'abnormalities': Sequence({'assessment': TensorInfo(shape=(None,), dtype=tf.int64), 'mass_shape': TensorInfo(shape=(None,), dtype=tf.int64), 'id': TensorInfo(shape=(None,), dtype=tf.int32), 'mask': TensorInfo(shape=(None, None, None, 1), dtype=tf.uint8), 'subtlety': TensorInfo(shape=(None,), dtype=tf.int64), 'pathology': TensorInfo(shape=(None,), dtype=tf.int64), 'mass_margins': TensorInfo(shape=(None,), dtype=tf.int64)}),
    'breast': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'id': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 1), dtype=tf.uint8),
    'patient': Text(shape=(), dtype=tf.string, encoder=None),
    'view': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
})

Statistics

Split Examples
ALL 1,514
TRAIN 1,166
TEST 348

Urls

Supervised keys (for as_supervised=True)

None

Citation

@misc{CBIS_DDSM_Citation,
  doi = {10.7937/k9/tcia.2016.7o02s9cy},
  url = {https://wiki.cancerimagingarchive.net/x/lZNXAQ},
  author = {Sawyer-Lee,  Rebecca and Gimenez,  Francisco and Hoogi,  Assaf and Rubin,  Daniel},
  title = {Curated Breast Imaging Subset of DDSM},
  publisher = {The Cancer Imaging Archive},
  year = {2016},
}
@article{TCIA_Citation,
  author = {
    K. Clark and B. Vendt and K. Smith and J. Freymann and J. Kirby and
    P. Koppel and S. Moore and S. Phillips and D. Maffitt and M. Pringle and
    L. Tarbox and F. Prior
  },
  title = { {The Cancer Imaging Archive (TCIA): Maintaining and Operating a
  Public Information Repository}},
  journal = {Journal of Digital Imaging},
  volume = {26},
  month = {December},
  year = {2013},
  pages = {1045-1057},
}
@article{DBLP:journals/corr/abs-1708-09427,
  author    = {Li Shen},
  title     = {End-to-end Training for Whole Image Breast Cancer Diagnosis using
               An All Convolutional Design},
  journal   = {CoRR},
  volume    = {abs/1708.09427},
  year      = {2017},
  url       = {http://arxiv.org/abs/1708.09427},
  archivePrefix = {arXiv},
  eprint    = {1708.09427},
  timestamp = {Mon, 13 Aug 2018 16:48:35 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1708-09427},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

"cycle_gan"

Dataset with images from 2 classes (see config name for information on the specific class)

cycle_gan is configured with tfds.image.cycle_gan.CycleGANConfig and has the following configurations predefined (defaults to the first one):

  • "apple2orange" (v0.1.0) (Size: 74.82 MiB): A dataset consisting of images from two classes A and B (For example: horses/zebras, apple/orange,...)

  • "summer2winter_yosemite" (v0.1.0) (Size: 126.50 MiB): A dataset consisting of images from two classes A and B (For example: horses/zebras, apple/orange,...)

  • "horse2zebra" (v0.1.0) (Size: 111.45 MiB): A dataset consisting of images from two classes A and B (For example: horses/zebras, apple/orange,...)

  • "monet2photo" (v0.1.0) (Size: 291.09 MiB): A dataset consisting of images from two classes A and B (For example: horses/zebras, apple/orange,...)

  • "cezanne2photo" (v0.1.0) (Size: 266.92 MiB): A dataset consisting of images from two classes A and B (For example: horses/zebras, apple/orange,...)

  • "ukiyoe2photo" (v0.1.0) (Size: 279.38 MiB): A dataset consisting of images from two classes A and B (For example: horses/zebras, apple/orange,...)

  • "vangogh2photo" (v0.1.0) (Size: 292.39 MiB): A dataset consisting of images from two classes A and B (For example: horses/zebras, apple/orange,...)

  • "maps" (v0.1.0) (Size: 1.38 GiB): A dataset consisting of images from two classes A and B (For example: horses/zebras, apple/orange,...)

  • "cityscapes" (v0.1.0) (Size: 266.65 MiB): A dataset consisting of images from two classes A and B (For example: horses/zebras, apple/orange,...)

  • "facades" (v0.1.0) (Size: 33.51 MiB): A dataset consisting of images from two classes A and B (For example: horses/zebras, apple/orange,...)

  • "iphone2dslr_flower" (v0.1.0) (Size: 324.22 MiB): A dataset consisting of images from two classes A and B (For example: horses/zebras, apple/orange,...)

"cycle_gan/apple2orange"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
})

"cycle_gan/summer2winter_yosemite"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
})

"cycle_gan/horse2zebra"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
})

"cycle_gan/monet2photo"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
})

"cycle_gan/cezanne2photo"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
})

"cycle_gan/ukiyoe2photo"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
})

"cycle_gan/vangogh2photo"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
})

"cycle_gan/maps"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
})

"cycle_gan/cityscapes"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
})

"cycle_gan/facades"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
})

"cycle_gan/iphone2dslr_flower"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
})

Statistics

Split Examples
ALL 6,186
TRAINB 3,325
TRAINA 1,812
TESTA 569
TESTB 480

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')


"diabetic_retinopathy_detection"

A large set of high-resolution retina images taken under a variety of imaging conditions.

diabetic_retinopathy_detection is configured with tfds.image.diabetic_retinopathy_detection.DiabeticRetinopathyDetectionConfig and has the following configurations predefined (defaults to the first one):

  • "original" (v2.0.0) (Size: 1.13 MiB): Images at their original resolution and quality.

  • "1M" (v2.1.0) (Size: 1.13 MiB): Images have roughly 1,000,000 pixels, at 72 quality.

  • "250K" (v2.1.0) (Size: 1.13 MiB): Images have roughly 250,000 pixels, at 72 quality.

"diabetic_retinopathy_detection/original"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=5),
    'name': Text(shape=(), dtype=tf.string, encoder=None),
})

"diabetic_retinopathy_detection/1M"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=5),
    'name': Text(shape=(), dtype=tf.string, encoder=None),
})

"diabetic_retinopathy_detection/250K"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=5),
    'name': Text(shape=(), dtype=tf.string, encoder=None),
})

Statistics

Split Examples
ALL 88,712
TEST 42,670
TRAIN 35,126
VALIDATION 10,906
SAMPLE 10

Urls

Supervised keys (for as_supervised=True)

None

Citation

@ONLINE {kaggle-diabetic-retinopathy,
    author = "Kaggle and EyePacs",
    title  = "Kaggle Diabetic Retinopathy Detection",
    month  = "jul",
    year   = "2015",
    url    = "https://www.kaggle.com/c/diabetic-retinopathy-detection/data"
}

"downsampled_imagenet"

Dataset with images of 2 resolutions (see config name for information on the resolution). It is used for density estimation and generative modeling experiments.

downsampled_imagenet is configured with tfds.image.downsampled_imagenet.DownsampledImagenetConfig and has the following configurations predefined (defaults to the first one):

  • "32x32" (v0.1.0) (Size: ?? GiB): A dataset consisting of Train and Validation images of 32x32 resolution.

  • "64x64" (v0.1.0) (Size: ?? GiB): A dataset consisting of Train and Validation images of 64x64 resolution.

"downsampled_imagenet/32x32"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
})

"downsampled_imagenet/64x64"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
})

Statistics

None computed

Urls

Supervised keys (for as_supervised=True)

None


"dsprites"

dSprites is a dataset of 2D shapes procedurally generated from 6 ground truth independent latent factors. These factors are color, shape, scale, rotation, x and y positions of a sprite.

All possible combinations of these latents are present exactly once, generating N = 737280 total images.

Latent factor values

  • Color: white
  • Shape: square, ellipse, heart
  • Scale: 6 values linearly spaced in [0.5, 1]
  • Orientation: 40 values in [0, 2 pi]
  • Position X: 32 values in [0, 1]
  • Position Y: 32 values in [0, 1]

We varied one latent at a time (starting from Position Y, then Position X, etc), and sequentially stored the images in fixed order. Hence the order along the first dimension is fixed and allows you to map back to the value of the latents corresponding to that image.

We chose the latents values deliberately to have the smallest step changes while ensuring that all pixel outputs were different. No noise was added.

Features

FeaturesDict({
    'image': Image(shape=(64, 64, 1), dtype=tf.uint8),
    'label_orientation': ClassLabel(shape=(), dtype=tf.int64, num_classes=40),
    'label_scale': ClassLabel(shape=(), dtype=tf.int64, num_classes=6),
    'label_shape': ClassLabel(shape=(), dtype=tf.int64, num_classes=3),
    'label_x_position': ClassLabel(shape=(), dtype=tf.int64, num_classes=32),
    'label_y_position': ClassLabel(shape=(), dtype=tf.int64, num_classes=32),
    'value_orientation': Tensor(shape=[], dtype=tf.float32),
    'value_scale': Tensor(shape=[], dtype=tf.float32),
    'value_shape': Tensor(shape=[], dtype=tf.float32),
    'value_x_position': Tensor(shape=[], dtype=tf.float32),
    'value_y_position': Tensor(shape=[], dtype=tf.float32),
})

Statistics

Split Examples
TRAIN 737,280
ALL 737,280

Urls

Supervised keys (for as_supervised=True)

None

Citation

@misc{dsprites17,
author = {Loic Matthey and Irina Higgins and Demis Hassabis and Alexander Lerchner},
title = {dSprites: Disentanglement testing Sprites dataset},
howpublished= {https://github.com/deepmind/dsprites-dataset/},
year = "2017",
}

"dtd"

The Describable Textures Dataset (DTD) is an evolving collection of textural images in the wild, annotated with a series of human-centric attributes, inspired by the perceptual properties of textures. This data is made available to the computer vision community for research purposes.

The "label" of each example is its "key attribute" (see the official website). The official release of the dataset defines a 10-fold cross-validation partition. Our TRAIN/TEST/VALIDATION splits are those of the first fold.

Features

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=47),
})

Statistics

Split Examples
ALL 5,640
VALIDATION 1,880
TRAIN 1,880
TEST 1,880

Urls

Supervised keys (for as_supervised=True)

None

Citation

@InProceedings{cimpoi14describing,
Author    = {M. Cimpoi and S. Maji and I. Kokkinos and S. Mohamed and A. Vedaldi},
Title     = {Describing Textures in the Wild},
Booktitle = {Proceedings of the {IEEE} Conf. on Computer Vision and Pattern Recognition ({CVPR})},
Year      = {2014}}

"emnist"

The EMNIST dataset is a set of handwritten character digits derived from the NIST Special Database 19 and converted to a 28x28 pixel image format and dataset structure that directly matches the MNIST dataset.

emnist is configured with tfds.image.mnist.EMNISTConfig and has the following configurations predefined (defaults to the first one):

  • "byclass" (v1.0.1) (Size: 535.73 MiB): EMNIST ByClass

  • "bymerge" (v1.0.1) (Size: 535.73 MiB): EMNIST ByMerge

  • "balanced" (v1.0.1) (Size: 535.73 MiB): EMNIST Balanced

  • "letters" (v1.0.1) (Size: 535.73 MiB): EMNIST Letters

  • "digits" (v1.0.1) (Size: 535.73 MiB): EMNIST Digits

  • "mnist" (v1.0.1) (Size: 535.73 MiB): EMNIST MNIST

"emnist/byclass"

FeaturesDict({
    'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=62),
})

"emnist/bymerge"

FeaturesDict({
    'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=47),
})

"emnist/balanced"

FeaturesDict({
    'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=47),
})

"emnist/letters"

FeaturesDict({
    'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=37),
})

"emnist/digits"

FeaturesDict({
    'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"emnist/mnist"

FeaturesDict({
    'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

Statistics

Split Examples
ALL 70,000
TRAIN 60,000
TEST 10,000

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@article{cohen_afshar_tapson_schaik_2017, 
    title={EMNIST: Extending MNIST to handwritten letters}, 
    DOI={10.1109/ijcnn.2017.7966217}, 
    journal={2017 International Joint Conference on Neural Networks (IJCNN)}, 
    author={Cohen, Gregory and Afshar, Saeed and Tapson, Jonathan and Schaik, Andre Van}, 
    year={2017}
}

"eurosat"

EuroSAT dataset is based on Sentinel-2 satellite images covering 13 spectral bands and consisting of 10 classes with 27000 labeled and geo-referenced samples.

Two datasets are offered: - rgb: Contains only the optical R, G, B frequency bands encoded as JPEG image. - all: Contains all 13 bands in the original value range (float32).

URL: https://github.com/phelber/eurosat

eurosat is configured with tfds.image.eurosat.EurosatConfig and has the following configurations predefined (defaults to the first one):

  • "rgb" (v0.0.1) (Size: ?? GiB): Sentinel-2 RGB channels

  • "all" (v0.0.1) (Size: ?? GiB): 13 Sentinel-2 channels

"eurosat/rgb"

FeaturesDict({
    'filename': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(64, 64, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

"eurosat/all"

FeaturesDict({
    'filename': Text(shape=(), dtype=tf.string, encoder=None),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    'sentinel2': Tensor(shape=[64, 64, 13], dtype=tf.float32),
})

Statistics

None computed

Urls

Supervised keys (for as_supervised=True)

(u'sentinel2', u'label')

Citation

@misc{helber2017eurosat,
    title={EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification},
    author={Patrick Helber and Benjamin Bischke and Andreas Dengel and Damian Borth},
    year={2017},
    eprint={1709.00029},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

"fashion_mnist"

Fashion-MNIST is a dataset of Zalando's article images consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.

Features

FeaturesDict({
    'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

Statistics

Split Examples
ALL 70,000
TRAIN 60,000
TEST 10,000

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@article{DBLP:journals/corr/abs-1708-07747,
  author    = {Han Xiao and
               Kashif Rasul and
               Roland Vollgraf},
  title     = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning
               Algorithms},
  journal   = {CoRR},
  volume    = {abs/1708.07747},
  year      = {2017},
  url       = {http://arxiv.org/abs/1708.07747},
  archivePrefix = {arXiv},
  eprint    = {1708.07747},
  timestamp = {Mon, 13 Aug 2018 16:47:27 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1708-07747},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

"horses_or_humans"

A large set of images of horses and humans.

Features

FeaturesDict({
    'image': Image(shape=(300, 300, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
})

Statistics

Split Examples
ALL 1,283
TRAIN 1,027
TEST 256

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@ONLINE {horses_or_humans,
author = "Laurence Moroney",
title = "Horses or Humans Dataset",
month = "feb",
year = "2019",
url = "http://laurencemoroney.com/horses-or-humans-dataset"
}

"image_label_folder"

Generic image classification dataset.

Features

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=None),
})

Statistics

None computed

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')


"imagenet2012"

ILSVRC 2012, aka ImageNet is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.

Features

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

Statistics

Split Examples
ALL 1,331,167
TRAIN 1,281,167
VALIDATION 50,000

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@article{ILSVRC15,
Author = {Olga Russakovsky and Jia Deng and Hao Su and Jonathan Krause and Sanjeev Satheesh and Sean Ma and Zhiheng Huang and Andrej Karpathy and Aditya Khosla and Michael Bernstein and Alexander C. Berg and Li Fei-Fei},
Title = { {ImageNet Large Scale Visual Recognition Challenge}},
Year = {2015},
journal   = {International Journal of Computer Vision (IJCV)},
doi = {10.1007/s11263-015-0816-y},
volume={115},
number={3},
pages={211-252}
}

"imagenet2012_corrupted"

Imagenet2012Corrupted is a dataset generated by adding common corruptions to the validation images in the ImageNet dataset. In the original paper, there are 15 different corruptions, and each has 5 levels of severity. In this dataset, we implement 12 out of the 15 corruptions, including Gaussian noise, shot noise, impulse_noise, defocus blur, frosted glass blur, zoom blur, fog, brightness, contrast, elastic, pixelate, and jpeg compression. The randomness is fixed so that regeneration is deterministic.

imagenet2012_corrupted is configured with tfds.image.imagenet2012_corrupted.Imagenet2012CorruptedConfig and has the following configurations predefined (defaults to the first one):

  • "gaussian_noise_1" (v0.0.1) (Size: ?? GiB): corruption type = gaussian_noise, severity = 1

  • "gaussian_noise_2" (v0.0.1) (Size: ?? GiB): corruption type = gaussian_noise, severity = 2

  • "gaussian_noise_3" (v0.0.1) (Size: ?? GiB): corruption type = gaussian_noise, severity = 3

  • "gaussian_noise_4" (v0.0.1) (Size: ?? GiB): corruption type = gaussian_noise, severity = 4

  • "gaussian_noise_5" (v0.0.1) (Size: ?? GiB): corruption type = gaussian_noise, severity = 5

  • "shot_noise_1" (v0.0.1) (Size: ?? GiB): corruption type = shot_noise, severity = 1

  • "shot_noise_2" (v0.0.1) (Size: ?? GiB): corruption type = shot_noise, severity = 2

  • "shot_noise_3" (v0.0.1) (Size: ?? GiB): corruption type = shot_noise, severity = 3

  • "shot_noise_4" (v0.0.1) (Size: ?? GiB): corruption type = shot_noise, severity = 4

  • "shot_noise_5" (v0.0.1) (Size: ?? GiB): corruption type = shot_noise, severity = 5

  • "impulse_noise_1" (v0.0.1) (Size: ?? GiB): corruption type = impulse_noise, severity = 1

  • "impulse_noise_2" (v0.0.1) (Size: ?? GiB): corruption type = impulse_noise, severity = 2

  • "impulse_noise_3" (v0.0.1) (Size: ?? GiB): corruption type = impulse_noise, severity = 3

  • "impulse_noise_4" (v0.0.1) (Size: ?? GiB): corruption type = impulse_noise, severity = 4

  • "impulse_noise_5" (v0.0.1) (Size: ?? GiB): corruption type = impulse_noise, severity = 5

  • "defocus_blur_1" (v0.0.1) (Size: ?? GiB): corruption type = defocus_blur, severity = 1

  • "defocus_blur_2" (v0.0.1) (Size: ?? GiB): corruption type = defocus_blur, severity = 2

  • "defocus_blur_3" (v0.0.1) (Size: ?? GiB): corruption type = defocus_blur, severity = 3

  • "defocus_blur_4" (v0.0.1) (Size: ?? GiB): corruption type = defocus_blur, severity = 4

  • "defocus_blur_5" (v0.0.1) (Size: ?? GiB): corruption type = defocus_blur, severity = 5

  • "frosted_glass_blur_1" (v0.0.1) (Size: ?? GiB): corruption type = frosted_glass_blur, severity = 1

  • "frosted_glass_blur_2" (v0.0.1) (Size: ?? GiB): corruption type = frosted_glass_blur, severity = 2

  • "frosted_glass_blur_3" (v0.0.1) (Size: ?? GiB): corruption type = frosted_glass_blur, severity = 3

  • "frosted_glass_blur_4" (v0.0.1) (Size: ?? GiB): corruption type = frosted_glass_blur, severity = 4

  • "frosted_glass_blur_5" (v0.0.1) (Size: ?? GiB): corruption type = frosted_glass_blur, severity = 5

  • "zoom_blur_1" (v0.0.1) (Size: ?? GiB): corruption type = zoom_blur, severity = 1

  • "zoom_blur_2" (v0.0.1) (Size: ?? GiB): corruption type = zoom_blur, severity = 2

  • "zoom_blur_3" (v0.0.1) (Size: ?? GiB): corruption type = zoom_blur, severity = 3

  • "zoom_blur_4" (v0.0.1) (Size: ?? GiB): corruption type = zoom_blur, severity = 4

  • "zoom_blur_5" (v0.0.1) (Size: ?? GiB): corruption type = zoom_blur, severity = 5

  • "fog_1" (v0.0.1) (Size: ?? GiB): corruption type = fog, severity = 1

  • "fog_2" (v0.0.1) (Size: ?? GiB): corruption type = fog, severity = 2

  • "fog_3" (v0.0.1) (Size: ?? GiB): corruption type = fog, severity = 3

  • "fog_4" (v0.0.1) (Size: ?? GiB): corruption type = fog, severity = 4

  • "fog_5" (v0.0.1) (Size: ?? GiB): corruption type = fog, severity = 5

  • "brightness_1" (v0.0.1) (Size: ?? GiB): corruption type = brightness, severity = 1

  • "brightness_2" (v0.0.1) (Size: ?? GiB): corruption type = brightness, severity = 2

  • "brightness_3" (v0.0.1) (Size: ?? GiB): corruption type = brightness, severity = 3

  • "brightness_4" (v0.0.1) (Size: ?? GiB): corruption type = brightness, severity = 4

  • "brightness_5" (v0.0.1) (Size: ?? GiB): corruption type = brightness, severity = 5

  • "contrast_1" (v0.0.1) (Size: ?? GiB): corruption type = contrast, severity = 1

  • "contrast_2" (v0.0.1) (Size: ?? GiB): corruption type = contrast, severity = 2

  • "contrast_3" (v0.0.1) (Size: ?? GiB): corruption type = contrast, severity = 3

  • "contrast_4" (v0.0.1) (Size: ?? GiB): corruption type = contrast, severity = 4

  • "contrast_5" (v0.0.1) (Size: ?? GiB): corruption type = contrast, severity = 5

  • "elastic_1" (v0.0.1) (Size: ?? GiB): corruption type = elastic, severity = 1

  • "elastic_2" (v0.0.1) (Size: ?? GiB): corruption type = elastic, severity = 2

  • "elastic_3" (v0.0.1) (Size: ?? GiB): corruption type = elastic, severity = 3

  • "elastic_4" (v0.0.1) (Size: ?? GiB): corruption type = elastic, severity = 4

  • "elastic_5" (v0.0.1) (Size: ?? GiB): corruption type = elastic, severity = 5

  • "pixelate_1" (v0.0.1) (Size: ?? GiB): corruption type = pixelate, severity = 1

  • "pixelate_2" (v0.0.1) (Size: ?? GiB): corruption type = pixelate, severity = 2

  • "pixelate_3" (v0.0.1) (Size: ?? GiB): corruption type = pixelate, severity = 3

  • "pixelate_4" (v0.0.1) (Size: ?? GiB): corruption type = pixelate, severity = 4

  • "pixelate_5" (v0.0.1) (Size: ?? GiB): corruption type = pixelate, severity = 5

  • "jpeg_compression_1" (v0.0.1) (Size: ?? GiB): corruption type = jpeg_compression, severity = 1

  • "jpeg_compression_2" (v0.0.1) (Size: ?? GiB): corruption type = jpeg_compression, severity = 2

  • "jpeg_compression_3" (v0.0.1) (Size: ?? GiB): corruption type = jpeg_compression, severity = 3

  • "jpeg_compression_4" (v0.0.1) (Size: ?? GiB): corruption type = jpeg_compression, severity = 4

  • "jpeg_compression_5" (v0.0.1) (Size: ?? GiB): corruption type = jpeg_compression, severity = 5

"imagenet2012_corrupted/gaussian_noise_1"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/gaussian_noise_2"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/gaussian_noise_3"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/gaussian_noise_4"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/gaussian_noise_5"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/shot_noise_1"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/shot_noise_2"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/shot_noise_3"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/shot_noise_4"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/shot_noise_5"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/impulse_noise_1"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/impulse_noise_2"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/impulse_noise_3"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/impulse_noise_4"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/impulse_noise_5"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/defocus_blur_1"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/defocus_blur_2"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/defocus_blur_3"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/defocus_blur_4"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/defocus_blur_5"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/frosted_glass_blur_1"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/frosted_glass_blur_2"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/frosted_glass_blur_3"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/frosted_glass_blur_4"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/frosted_glass_blur_5"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/zoom_blur_1"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/zoom_blur_2"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/zoom_blur_3"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/zoom_blur_4"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/zoom_blur_5"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/fog_1"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/fog_2"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/fog_3"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/fog_4"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/fog_5"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/brightness_1"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/brightness_2"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/brightness_3"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/brightness_4"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/brightness_5"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/contrast_1"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/contrast_2"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/contrast_3"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/contrast_4"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/contrast_5"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/elastic_1"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/elastic_2"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/elastic_3"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/elastic_4"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/elastic_5"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/pixelate_1"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/pixelate_2"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/pixelate_3"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/pixelate_4"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/pixelate_5"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/jpeg_compression_1"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/jpeg_compression_2"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/jpeg_compression_3"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/jpeg_compression_4"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

"imagenet2012_corrupted/jpeg_compression_5"

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1000),
})

Statistics

Split Examples
VALIDATION 50,000
ALL 50,000

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@inproceedings{
  hendrycks2018benchmarking,
  title={Benchmarking Neural Network Robustness to Common Corruptions and Perturbations},
  author={Dan Hendrycks and Thomas Dietterich},
  booktitle={International Conference on Learning Representations},
  year={2019},
  url={https://openreview.net/forum?id=HJz6tiCqYm},
}

"kmnist"

Kuzushiji-MNIST is a drop-in replacement for the MNIST dataset (28x28 grayscale, 70,000 images), provided in the original MNIST format as well as a NumPy format. Since MNIST restricts us to 10 classes, we chose one character to represent each of the 10 rows of Hiragana when creating Kuzushiji-MNIST.

Features

FeaturesDict({
    'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

Statistics

Split Examples
ALL 70,000
TRAIN 60,000
TEST 10,000

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@online{clanuwat2018deep,
  author       = {Tarin Clanuwat and Mikel Bober-Irizar and Asanobu Kitamoto and Alex Lamb and Kazuaki Yamamoto and David Ha},
  title        = {Deep Learning for Classical Japanese Literature},
  date         = {2018-12-03},
  year         = {2018},
  eprintclass  = {cs.CV},
  eprinttype   = {arXiv},
  eprint       = {cs.CV/1812.01718},
}

"lsun"

Large scale images showing different objects from given categories like bedroom, tower etc.

lsun is configured with tfds.image.lsun.BuilderConfig and has the following configurations predefined (defaults to the first one):

  • "classroom" (v0.1.1) (Size: 3.06 GiB): Images of category classroom

  • "bedroom" (v0.1.1) (Size: 42.77 GiB): Images of category bedroom

  • "bridge" (v0.1.1) (Size: 15.35 GiB): Images of category bridge

  • "church_outdoor" (v0.1.1) (Size: 2.29 GiB): Images of category church_outdoor

  • "conference_room" (v0.1.1) (Size: 3.78 GiB): Images of category conference_room

  • "dining_room" (v0.1.1) (Size: 10.80 GiB): Images of category dining_room

  • "kitchen" (v0.1.1) (Size: 33.34 GiB): Images of category kitchen

  • "living_room" (v0.1.1) (Size: 21.23 GiB): Images of category living_room

  • "restaurant" (v0.1.1) (Size: 12.57 GiB): Images of category restaurant

  • "tower" (v0.1.1) (Size: 11.19 GiB): Images of category tower

"lsun/classroom"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
})

"lsun/bedroom"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
})

"lsun/bridge"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
})

"lsun/church_outdoor"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
})

"lsun/conference_room"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
})

"lsun/dining_room"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
})

"lsun/kitchen"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
})

"lsun/living_room"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
})

"lsun/restaurant"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
})

"lsun/tower"

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
})

Statistics

Split Examples
ALL 708,564
TRAIN 708,264
VALIDATION 300

Urls

Supervised keys (for as_supervised=True)

None

Citation

@article{journals/corr/YuZSSX15,
  added-at = {2018-08-13T00:00:00.000+0200},
  author = {Yu, Fisher and Zhang, Yinda and Song, Shuran and Seff, Ari and Xiao, Jianxiong},
  biburl = {https://www.bibsonomy.org/bibtex/2446d4ffb99a5d7d2ab6e5417a12e195f/dblp},
  ee = {http://arxiv.org/abs/1506.03365},
  interhash = {3e9306c4ce2ead125f3b2ab0e25adc85},
  intrahash = {446d4ffb99a5d7d2ab6e5417a12e195f},
  journal = {CoRR},
  keywords = {dblp},
  timestamp = {2018-08-14T15:08:59.000+0200},
  title = {LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop.},
  url = {http://dblp.uni-trier.de/db/journals/corr/corr1506.html#YuZSSX15},
  volume = {abs/1506.03365},
  year = 2015
}

"mnist"

The MNIST database of handwritten digits.

Features

FeaturesDict({
    'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

Statistics

Split Examples
ALL 70,000
TRAIN 60,000
TEST 10,000

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@article{lecun2010mnist,
  title={MNIST handwritten digit database},
  author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
  journal={ATT Labs [Online]. Available: http://yann. lecun. com/exdb/mnist},
  volume={2},
  year={2010}
}

"omniglot"

Omniglot data set for one-shot learning. This dataset contains 1623 different handwritten characters from 50 different alphabets.

Features

FeaturesDict({
    'alphabet': ClassLabel(shape=(), dtype=tf.int64, num_classes=50),
    'alphabet_char_id': Tensor(shape=(), dtype=tf.int64),
    'image': Image(shape=(105, 105, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=1623),
})

Statistics

Split Examples
ALL 38,300
TRAIN 19,280
TEST 13,180
SMALL2 3,120
SMALL1 2,720

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@article{lake2015human,
  title={Human-level concept learning through probabilistic program induction},
  author={Lake, Brenden M and Salakhutdinov, Ruslan and Tenenbaum, Joshua B},
  journal={Science},
  volume={350},
  number={6266},
  pages={1332--1338},
  year={2015},
  publisher={American Association for the Advancement of Science}
}

"open_images_v4"

Open Images is a dataset of ~9M images that have been annotated with image-level labels and object bounding boxes.

The training set of V4 contains 14.6M bounding boxes for 600 object classes on 1.74M images, making it the largest existing dataset with object location annotations. The boxes have been largely manually drawn by professional annotators to ensure accuracy and consistency. The images are very diverse and often contain complex scenes with several objects (8.4 per image on average). Moreover, the dataset is annotated with image-level labels spanning thousands of classes.

open_images_v4 is configured with tfds.image.open_images.OpenImagesV4Config and has the following configurations predefined (defaults to the first one):

  • "original" (v0.2.0) (Size: 565.11 GiB): Images at their original resolution and quality.

  • "300k" (v0.2.1) (Size: 565.11 GiB): Images have roughly 300,000 pixels, at 72 JPEG quality.

  • "200k" (v0.2.1) (Size: 565.11 GiB): Images have roughly 200,000 pixels, at 72 JPEG quality.

"open_images_v4/original"

FeaturesDict({
    'bobjects': Sequence({'is_group_of': TensorInfo(shape=(None,), dtype=tf.int8), 'is_truncated': TensorInfo(shape=(None,), dtype=tf.int8), 'is_occluded': TensorInfo(shape=(None,), dtype=tf.int8), 'is_depiction': TensorInfo(shape=(None,), dtype=tf.int8), 'bbox': TensorInfo(shape=(None, 4), dtype=tf.float32), 'source': TensorInfo(shape=(None,), dtype=tf.int64), 'is_inside': TensorInfo(shape=(None,), dtype=tf.int8), 'label': TensorInfo(shape=(None,), dtype=tf.int64)}),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'image/filename': Text(shape=(), dtype=tf.string, encoder=None),
    'objects': Sequence({'source': TensorInfo(shape=(None,), dtype=tf.int64), 'confidence': TensorInfo(shape=(None,), dtype=tf.int32), 'label': TensorInfo(shape=(None,), dtype=tf.int64)}),
    'objects_trainable': Sequence({'source': TensorInfo(shape=(None,), dtype=tf.int64), 'confidence': TensorInfo(shape=(None,), dtype=tf.int32), 'label': TensorInfo(shape=(None,), dtype=tf.int64)}),
})

"open_images_v4/300k"

FeaturesDict({
    'bobjects': Sequence({'is_group_of': TensorInfo(shape=(None,), dtype=tf.int8), 'is_truncated': TensorInfo(shape=(None,), dtype=tf.int8), 'is_occluded': TensorInfo(shape=(None,), dtype=tf.int8), 'is_depiction': TensorInfo(shape=(None,), dtype=tf.int8), 'bbox': TensorInfo(shape=(None, 4), dtype=tf.float32), 'source': TensorInfo(shape=(None,), dtype=tf.int64), 'is_inside': TensorInfo(shape=(None,), dtype=tf.int8), 'label': TensorInfo(shape=(None,), dtype=tf.int64)}),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'image/filename': Text(shape=(), dtype=tf.string, encoder=None),
    'objects': Sequence({'source': TensorInfo(shape=(None,), dtype=tf.int64), 'confidence': TensorInfo(shape=(None,), dtype=tf.int32), 'label': TensorInfo(shape=(None,), dtype=tf.int64)}),
    'objects_trainable': Sequence({'source': TensorInfo(shape=(None,), dtype=tf.int64), 'confidence': TensorInfo(shape=(None,), dtype=tf.int32), 'label': TensorInfo(shape=(None,), dtype=tf.int64)}),
})

"open_images_v4/200k"

FeaturesDict({
    'bobjects': Sequence({'is_group_of': TensorInfo(shape=(None,), dtype=tf.int8), 'is_truncated': TensorInfo(shape=(None,), dtype=tf.int8), 'is_occluded': TensorInfo(shape=(None,), dtype=tf.int8), 'is_depiction': TensorInfo(shape=(None,), dtype=tf.int8), 'bbox': TensorInfo(shape=(None, 4), dtype=tf.float32), 'source': TensorInfo(shape=(None,), dtype=tf.int64), 'is_inside': TensorInfo(shape=(None,), dtype=tf.int8), 'label': TensorInfo(shape=(None,), dtype=tf.int64)}),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'image/filename': Text(shape=(), dtype=tf.string, encoder=None),
    'objects': Sequence({'source': TensorInfo(shape=(None,), dtype=tf.int64), 'confidence': TensorInfo(shape=(None,), dtype=tf.int32), 'label': TensorInfo(shape=(None,), dtype=tf.int64)}),
    'objects_trainable': Sequence({'source': TensorInfo(shape=(None,), dtype=tf.int64), 'confidence': TensorInfo(shape=(None,), dtype=tf.int32), 'label': TensorInfo(shape=(None,), dtype=tf.int64)}),
})

Statistics

Split Examples
ALL 1,910,098
TRAIN 1,743,042
TEST 125,436
VALIDATION 41,620

Urls

Supervised keys (for as_supervised=True)

None

Citation

@article{OpenImages,
  author = {Alina Kuznetsova and
            Hassan Rom and
            Neil Alldrin and
            Jasper Uijlings and
            Ivan Krasin and
            Jordi Pont-Tuset and
            Shahab Kamali and
            Stefan Popov and
            Matteo Malloci and
            Tom Duerig and
            Vittorio Ferrari},
  title = {The Open Images Dataset V4: Unified image classification,
           object detection, and visual relationship detection at scale},
  year = {2018},
  journal = {arXiv:1811.00982}
}
@article{OpenImages2,
  author = {Krasin, Ivan and
            Duerig, Tom and
            Alldrin, Neil and
            Ferrari, Vittorio
            and Abu-El-Haija, Sami and
            Kuznetsova, Alina and
            Rom, Hassan and
            Uijlings, Jasper and
            Popov, Stefan and
            Kamali, Shahab and
            Malloci, Matteo and
            Pont-Tuset, Jordi and
            Veit, Andreas and
            Belongie, Serge and
            Gomes, Victor and
            Gupta, Abhinav and
            Sun, Chen and
            Chechik, Gal and
            Cai, David and
            Feng, Zheyun and
            Narayanan, Dhyanesh and
            Murphy, Kevin},
  title = {OpenImages: A public dataset for large-scale multi-label and
           multi-class image classification.},
  journal = {Dataset available from
             https://storage.googleapis.com/openimages/web/index.html},
  year={2017}
}

"oxford_flowers102"

The Oxford Flowers 102 dataset is a consistent of 102 flower categories commonly occurring in the United Kingdom. Each class consists of between 40 and 258 images. The images have large scale, pose and light variations. In addition, there are categories that have large variations within the category and several very similar categories.

The dataset is divided into a training set, a validation set and a test set. The training set and validation set each consist of 10 images per class (totalling 1030 images each). The test set consist of the remaining 6129 images (minimum 20 per class).

Features

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=102),
})

Statistics

Split Examples
ALL 8,189
TEST 6,149
VALIDATION 1,020
TRAIN 1,020

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@InProceedings{Nilsback08,
   author = "Nilsback, M-E. and Zisserman, A.",
   title = "Automated Flower Classification over a Large Number of Classes",
   booktitle = "Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing",
   year = "2008",
   month = "Dec"
}

"oxford_iiit_pet"

The Oxford-IIIT pet dataset is a 37 category pet image dataset with roughly 200 images for each class. The images have large variations in scale, pose and lighting. All images have an associated ground truth annotation of breed.

Features

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=37),
})

Statistics

Split Examples
ALL 7,349
TRAIN 3,680
TEST 3,669

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@InProceedings{parkhi12a,
  author       = "Parkhi, O. M. and Vedaldi, A. and Zisserman, A. and Jawahar, C.~V.",
  title        = "Cats and Dogs",
  booktitle    = "IEEE Conference on Computer Vision and Pattern Recognition",
  year         = "2012",
}

"quickdraw_bitmap"

The Quick Draw Dataset is a collection of 50 million drawings across 345 categories, contributed by players of the game Quick, Draw!. The bitmap dataset contains these drawings converted from vector format into 28x28 grayscale images

Features

FeaturesDict({
    'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=345),
})

Statistics

Split Examples
TRAIN 50,426,266
ALL 50,426,266

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@article{DBLP:journals/corr/HaE17,
  author    = {David Ha and
               Douglas Eck},
  title     = {A Neural Representation of Sketch Drawings},
  journal   = {CoRR},
  volume    = {abs/1704.03477},
  year      = {2017},
  url       = {http://arxiv.org/abs/1704.03477},
  archivePrefix = {arXiv},
  eprint    = {1704.03477},
  timestamp = {Mon, 13 Aug 2018 16:48:30 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/HaE17},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

"resisc45"

RESISC45 dataset is a publicly available benchmark for Remote Sensing Image Scene Classification (RESISC), created by Northwestern Polytechnical University (NWPU). This dataset contains 31,500 images, covering 45 scene classes with 700 images in each class.

Features

FeaturesDict({
    'filename': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(256, 256, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=45),
})

Statistics

None computed

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@article{Cheng_2017,
   title={Remote Sensing Image Scene Classification: Benchmark and State of the Art},
   volume={105},
   ISSN={1558-2256},
   url={http://dx.doi.org/10.1109/JPROC.2017.2675998},
   DOI={10.1109/jproc.2017.2675998},
   number={10},
   journal={Proceedings of the IEEE},
   publisher={Institute of Electrical and Electronics Engineers (IEEE)},
   author={Cheng, Gong and Han, Junwei and Lu, Xiaoqiang},
   year={2017},
   month={Oct},
   pages={1865-1883}
}

"rock_paper_scissors"

Images of hands playing rock, paper, scissor game.

Features

FeaturesDict({
    'image': Image(shape=(300, 300, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=3),
})

Statistics

Split Examples
ALL 2,892
TRAIN 2,520
TEST 372

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@ONLINE {rps,
author = "Laurence Moroney",
title = "Rock, Paper, Scissors Dataset",
month = "feb",
year = "2019",
url = "http://laurencemoroney.com/rock-paper-scissors-dataset"
}

"shapes3d"

3dshapes is a dataset of 3D shapes procedurally generated from 6 ground truth independent latent factors. These factors are floor colour, wall colour, object colour, scale, shape and orientation.

All possible combinations of these latents are present exactly once, generating N = 480000 total images.

Latent factor values

  • floor hue: 10 values linearly spaced in [0, 1]
  • wall hue: 10 values linearly spaced in [0, 1]
  • object hue: 10 values linearly spaced in [0, 1]
  • scale: 8 values linearly spaced in [0, 1]
  • shape: 4 values in [0, 1, 2, 3]
  • orientation: 15 values linearly spaced in [-30, 30]

We varied one latent at a time (starting from orientation, then shape, etc), and sequentially stored the images in fixed order in the images array. The corresponding values of the factors are stored in the same order in the labels array.

Features

FeaturesDict({
    'image': Image(shape=(64, 64, 3), dtype=tf.uint8),
    'label_floor_hue': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    'label_object_hue': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    'label_orientation': ClassLabel(shape=(), dtype=tf.int64, num_classes=15),
    'label_scale': ClassLabel(shape=(), dtype=tf.int64, num_classes=8),
    'label_shape': ClassLabel(shape=(), dtype=tf.int64, num_classes=4),
    'label_wall_hue': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    'value_floor_hue': Tensor(shape=[], dtype=tf.float32),
    'value_object_hue': Tensor(shape=[], dtype=tf.float32),
    'value_orientation': Tensor(shape=[], dtype=tf.float32),
    'value_scale': Tensor(shape=[], dtype=tf.float32),
    'value_shape': Tensor(shape=[], dtype=tf.float32),
    'value_wall_hue': Tensor(shape=[], dtype=tf.float32),
})

Statistics

Split Examples
TRAIN 480,000
ALL 480,000

Urls

Supervised keys (for as_supervised=True)

None

Citation

@misc{3dshapes18,
  title={3D Shapes Dataset},
  author={Burgess, Chris and Kim, Hyunjik},
  howpublished={https://github.com/deepmind/3dshapes-dataset/},
  year={2018}
}

"smallnorb"


This database is intended for experiments in 3D object recognition from shape. It contains images of 50 toys belonging to 5 generic categories: four-legged animals, human figures, airplanes, trucks, and cars. The objects were imaged by two cameras under 6 lighting conditions, 9 elevations (30 to 70 degrees every 5 degrees), and 18 azimuths (0 to 340 every 20 degrees).

The training set is composed of 5 instances of each category (instances 4, 6, 7, 8 and 9), and the test set of the remaining 5 instances (instances 0, 1, 2, 3, and 5).

Features

FeaturesDict({
    'image': Image(shape=(96, 96, 1), dtype=tf.uint8),
    'image2': Image(shape=(96, 96, 1), dtype=tf.uint8),
    'instance': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    'label_azimuth': ClassLabel(shape=(), dtype=tf.int64, num_classes=18),
    'label_category': ClassLabel(shape=(), dtype=tf.int64, num_classes=5),
    'label_elevation': ClassLabel(shape=(), dtype=tf.int64, num_classes=9),
    'label_lighting': ClassLabel(shape=(), dtype=tf.int64, num_classes=6),
})

Statistics

Split Examples
ALL 48,600
TRAIN 24,300
TEST 24,300

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label_category')

Citation

\
@article{LeCun2004LearningMF,
  title={Learning methods for generic object recognition with invariance to pose and lighting},
  author={Yann LeCun and Fu Jie Huang and L{\'e}on Bottou},
  journal={Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition},
  year={2004},
  volume={2},
  pages={II-104 Vol.2}
}

"so2sat"

So2Sat LCZ42 is a dataset consisting of co-registered synthetic aperture radar and multispectral optical image patches acquired by the Sentinel-1 and Sentinel-2 remote sensing satellites, and the corresponding local climate zones (LCZ) label. The dataset is distributed over 42 cities across different continents and cultural regions of the world.

The full dataset (all) consists of 8 Sentinel-1 and 10 Sentinel-2 channels. Alternatively, one can select the rgb subset, which contains only the optical frequency bands of Sentinel-2, rescaled and encoded as JPEG.

Dataset URL: http://doi.org/10.14459/2018MP1454690 License: http://creativecommons.org/licenses/by/4.0

so2sat is configured with tfds.image.so2sat.So2satConfig and has the following configurations predefined (defaults to the first one):

  • "rgb" (v0.0.1) (Size: ?? GiB): Sentinel-2 RGB channels

  • "all" (v0.0.1) (Size: ?? GiB): 8 Sentinel-1 and 10 Sentinel-2 channels

"so2sat/rgb"

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=17),
    'sample_id': Tensor(shape=(), dtype=tf.int64),
})

"so2sat/all"

FeaturesDict({
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=17),
    'sample_id': Tensor(shape=(), dtype=tf.int64),
    'sentinel1': Tensor(shape=[32, 32, 8], dtype=tf.float32),
    'sentinel2': Tensor(shape=[32, 32, 10], dtype=tf.float32),
})

Statistics

None computed

Urls

Supervised keys (for as_supervised=True)

None


"sun397"

The database contains 108,754 images of 397 categories, used in the Scene UNderstanding (SUN) benchmark. The number of images varies across categories, but there are at least 100 images per category.

The official release of the dataset defines 10 overlapping partitions of the dataset, with 50 testing and training images in each. Since TFDS requires the splits not to overlap, we provide a single split for the entire dataset (named "full"). All images are converted to RGB.

Features

FeaturesDict({
    'file_name': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=397),
})

Statistics

Split Examples
FULL 108,753
ALL 108,753

Urls

Supervised keys (for as_supervised=True)

None

Citation

@INPROCEEDINGS{Xiao:2010,
author={J. {Xiao} and J. {Hays} and K. A. {Ehinger} and A. {Oliva} and A. {Torralba}},
booktitle={2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition},
title={SUN database: Large-scale scene recognition from abbey to zoo},
year={2010},
volume={},
number={},
pages={3485-3492},
keywords={computer vision;human factors;image classification;object recognition;visual databases;SUN database;large-scale scene recognition;abbey;zoo;scene categorization;computer vision;scene understanding research;scene category;object categorization;scene understanding database;state-of-the-art algorithms;human scene classification performance;finer-grained scene representation;Sun;Large-scale systems;Layout;Humans;Image databases;Computer vision;Anthropometry;Bridges;Legged locomotion;Spatial databases}, 
doi={10.1109/CVPR.2010.5539970},
ISSN={1063-6919},
month={June},}

"svhn_cropped"

The Street View House Numbers (SVHN) Dataset is an image digit recognition dataset of over 600,000 digit images coming from real world data. Images are cropped to 32x32.

Features

FeaturesDict({
    'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

Statistics

Split Examples
ALL 630,420
EXTRA 531,131
TRAIN 73,257
TEST 26,032

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@article{Netzer2011,
author = {Netzer, Yuval and Wang, Tao and Coates, Adam and Bissacco, Alessandro and Wu, Bo and Ng, Andrew Y},
booktitle = {Advances in Neural Information Processing Systems ({NIPS})},
title = {Reading Digits in Natural Images with Unsupervised Feature Learning},
year = {2011}
}

"tf_flowers"

A large set of images of flowers

Features

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=5),
})

Statistics

Split Examples
TRAIN 3,670
ALL 3,670

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@ONLINE {tfflowers,
author = "The TensorFlow Team",
title = "Flowers",
month = "jan",
year = "2019",
url = "http://download.tensorflow.org/example_images/flower_photos.tgz" }

"uc_merced"

UC Merced is a 21 class land use remote sensing image dataset, with 100 images per class. The images were manually extracted from large images from the USGS National Map Urban Area Imagery collection for various urban areas around the country. The pixel resolution of this public domain imagery is 0.3 m. Each image measures 256x256 pixels.

Features

FeaturesDict({
    'filename': Text(shape=(), dtype=tf.string, encoder=None),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=21),
})

Statistics

Split Examples
TRAIN 2,100
ALL 2,100

Urls

Supervised keys (for as_supervised=True)

(u'image', u'label')

Citation

@InProceedings{Nilsback08,
   author = "Yang, Yi and Newsam, Shawn",
   title = "Bag-Of-Visual-Words and Spatial Extensions for Land-Use Classification",
   booktitle = "ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS)",
   year = "2010",
}

"voc2007"

This dataset contains the data from the PASCAL Visual Object Classes Challenge 2007, a.k.a. VOC2007, corresponding to the Classification and Detection competitions. A total of 9,963 images are included in this dataset, where each image contains a set of objects, out of 20 different classes, making a total of 24,640 annotated objects. In the Classification competition, the goal is to predict the set of labels contained in the image, while in the Detection competition the goal is to predict the bounding box and label of each individual object.

Features

FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'image/filename': Text(shape=(), dtype=tf.string, encoder=None),
    'labels': Sequence(shape=(None,), dtype=tf.int64, feature=ClassLabel(shape=(), dtype=tf.int64, num_classes=20)),
    'labels_no_difficult': Sequence(shape=(None,), dtype=tf.int64, feature=ClassLabel(shape=(), dtype=tf.int64, num_classes=20)),
    'objects': Sequence({'is_truncated': TensorInfo(shape=(None,), dtype=tf.bool), 'is_difficult': TensorInfo(shape=(None,), dtype=tf.bool), 'label': TensorInfo(shape=(None,), dtype=tf.int64), 'bbox': TensorInfo(shape=(None, 4), dtype=tf.float32), 'pose': TensorInfo(shape=(None,), dtype=tf.int64)}),
})

Statistics

Split Examples
ALL 9,963
TEST 4,952
VALIDATION 2,510
TRAIN 2,501

Urls

Supervised keys (for as_supervised=True)

None

Citation

@misc{pascal-voc-2007,
  author = "Everingham, M. and Van~Gool, L. and Williams, C. K. I. and Winn, J. and Zisserman, A.",
  title = "The {PASCAL} {V}isual {O}bject {C}lasses {C}hallenge 2007 {(VOC2007)} {R}esults",
  howpublished = "http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html"}

structured

"higgs"

The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator. The last seven features are functions of the first 21 features; these are high-level features derived by physicists to help discriminate between the two classes. There is an interest in using deep learning methods to obviate the need for physicists to manually develop such features. Benchmark results using Bayesian Decision Trees from a standard physics package and 5-layer neural networks are presented in the original paper.

Features

FeaturesDict({
    'class_label': Tensor(shape=(), dtype=tf.float32),
    'jet_1_b-tag': Tensor(shape=(), dtype=tf.float64),
    'jet_1_eta': Tensor(shape=(), dtype=tf.float64),
    'jet_1_phi': Tensor(shape=(), dtype=tf.float64),
    'jet_1_pt': Tensor(shape=(), dtype=tf.float64),
    'jet_2_b-tag': Tensor(shape=(), dtype=tf.float64),
    'jet_2_eta': Tensor(shape=(), dtype=tf.float64),
    'jet_2_phi': Tensor(shape=(), dtype=tf.float64),
    'jet_2_pt': Tensor(shape=(), dtype=tf.float64),
    'jet_3_b-tag': Tensor(shape=(), dtype=tf.float64),
    'jet_3_eta': Tensor(shape=(), dtype=tf.float64),
    'jet_3_phi': Tensor(shape=(), dtype=tf.float64),
    'jet_3_pt': Tensor(shape=(), dtype=tf.float64),
    'jet_4_b-tag': Tensor(shape=(), dtype=tf.float64),
    'jet_4_eta': Tensor(shape=(), dtype=tf.float64),
    'jet_4_phi': Tensor(shape=(), dtype=tf.float64),
    'jet_4_pt': Tensor(shape=(), dtype=tf.float64),
    'lepton_eta': Tensor(shape=(), dtype=tf.float64),
    'lepton_pT': Tensor(shape=(), dtype=tf.float64),
    'lepton_phi': Tensor(shape=(), dtype=tf.float64),
    'm_bb': Tensor(shape=(), dtype=tf.float64),
    'm_jj': Tensor(shape=(), dtype=tf.float64),
    'm_jjj': Tensor(shape=(), dtype=tf.float64),
    'm_jlv': Tensor(shape=(), dtype=tf.float64),
    'm_lv': Tensor(shape=(), dtype=tf.float64),
    'm_wbb': Tensor(shape=(), dtype=tf.float64),
    'm_wwbb': Tensor(shape=(), dtype=tf.float64),
    'missing_energy_magnitude': Tensor(shape=(), dtype=tf.float64),
    'missing_energy_phi': Tensor(shape=(), dtype=tf.float64),
})

Statistics

Split Examples
TRAIN 11,000,000
ALL 11,000,000

Urls

Supervised keys (for as_supervised=True)

None

Citation

@article{Baldi:2014kfa,
      author         = "Baldi, Pierre and Sadowski, Peter and Whiteson, Daniel",
      title          = "{Searching for Exotic Particles in High-Energy Physics
                        with Deep Learning}",
      journal        = "Nature Commun.",
      volume         = "5",
      year           = "2014",
      pages          = "4308",
      doi            = "10.1038/ncomms5308",
      eprint         = "1402.4735",
      archivePrefix  = "arXiv",
      primaryClass   = "hep-ph",
      SLACcitation   = "%%CITATION = ARXIV:1402.4735;%%"
}

"iris"

This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

Features

FeaturesDict({
    'features': Tensor(shape=(4,), dtype=tf.float32),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=3),
})

Statistics

Split Examples
TRAIN 150
ALL 150

Urls

Supervised keys (for as_supervised=True)

(u'features', u'label')

Citation

@misc{Dua:2019 ,
author = "Dua, Dheeru and Graff, Casey",
year = "2017",
title = "{UCI} Machine Learning Repository",
url = "http://archive.ics.uci.edu/ml",
institution = "University of California, Irvine, School of Information and Computer Sciences"
}

"titanic"

Dataset describing the survival status of individual passengers on the Titanic. Missing values in the original dataset are represented using ?. Float and int missing values are replaced with -1, string missing values are replaced with 'Unknown'.

Features

FeaturesDict({
    'features': FeaturesDict({
        'age': Tensor(shape=(), dtype=tf.float32),
        'boat': Tensor(shape=(), dtype=tf.string),
        'body': Tensor(shape=(), dtype=tf.int32),
        'cabin': Tensor(shape=(), dtype=tf.string),
        'embarked': ClassLabel(shape=(), dtype=tf.int64, num_classes=4),
        'fare': Tensor(shape=(), dtype=tf.float32),
        'home.dest': Tensor(shape=(), dtype=tf.string),
        'name': Tensor(shape=(), dtype=tf.string),
        'parch': Tensor(shape=(), dtype=tf.int32),
        'pclass': ClassLabel(shape=(), dtype=tf.int64, num_classes=3),
        'sex': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
        'sibsp': Tensor(shape=(), dtype=tf.int32),
        'ticket': Tensor(shape=(), dtype=tf.string),
    }),
    'survived': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
})

Statistics

Split Examples
TRAIN 1,309
ALL 1,309

Urls

Supervised keys (for as_supervised=True)

(u'features', u'survived')

Citation

@ONLINE {titanic,
author = "Frank E. Harrell Jr., Thomas Cason",
title  = "Titanic dataset",
month  = "oct",
year   = "2017",
url    = "https://www.openml.org/d/40945"
}

text

"cnn_dailymail"

CNN/DailyMail non-anonymized summarization dataset.

There are two features: - article: text of news article, used as the document to be summarized - highlights: joined text of highlights with and around each highlight, which is the target summary

cnn_dailymail is configured with tfds.text.cnn_dailymail.CnnDailymailConfig and has the following configurations predefined (defaults to the first one):

  • "plain_text" (v0.0.2) (Size: 558.32 MiB): Plain text

  • "bytes" (v0.0.2) (Size: 558.32 MiB): Uses byte-level text encoding with tfds.features.text.ByteTextEncoder

  • "subwords32k" (v0.0.2) (Size: 558.32 MiB): Uses tfds.features.text.SubwordTextEncoder with 32k vocab size

"cnn_dailymail/plain_text"

FeaturesDict({
    'article': Text(shape=(), dtype=tf.string, encoder=None),
    'highlights': Text(shape=(), dtype=tf.string, encoder=None),
})

"cnn_dailymail/bytes"

FeaturesDict({
    'article': Text(shape=(None,), dtype=tf.int64, encoder=<ByteTextEncoder vocab_size=257>),
    'highlights': Text(shape=(None,), dtype=tf.int64, encoder=<ByteTextEncoder vocab_size=257>),
})

"cnn_dailymail/subwords32k"

FeaturesDict({
    'article': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=32915>),
    'highlights': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=32915>),
})

Statistics

Split Examples
ALL 311,971
TRAIN 287,113
VALIDATION 13,368
TEST 11,490

Urls

Supervised keys (for as_supervised=True)

(u'article', u'highlights')

Citation

@article{DBLP:journals/corr/SeeLM17,
  author    = {Abigail See and
               Peter J. Liu and
               Christopher D. Manning},
  title     = {Get To The Point: Summarization with Pointer-Generator Networks},
  journal   = {CoRR},
  volume    = {abs/1704.04368},
  year      = {2017},
  url       = {http://arxiv.org/abs/1704.04368},
  archivePrefix = {arXiv},
  eprint    = {1704.04368},
  timestamp = {Mon, 13 Aug 2018 16:46:08 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/SeeLM17},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

@inproceedings{hermann2015teaching,
  title={Teaching machines to read and comprehend},
  author={Hermann, Karl Moritz and Kocisky, Tomas and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil},
  booktitle={Advances in neural information processing systems},
  pages={1693--1701},
  year={2015}
}

"glue"

        The Winograd Schema Challenge (Levesque et al., 2011) is a reading comprehension task
        in which a system must read a sentence with a pronoun and select the referent of that pronoun from
        a list of choices. The examples are manually constructed to foil simple statistical methods: Each
        one is contingent on contextual information provided by a single word or phrase in the sentence.
        To convert the problem into sentence pair classification, we construct sentence pairs by replacing
        the ambiguous pronoun with each possible referent. The task is to predict if the sentence with the
        pronoun substituted is entailed by the original sentence. We use a small evaluation set consisting of
        new examples derived from fiction books that was shared privately by the authors of the original
        corpus. While the included training set is balanced between two classes, the test set is imbalanced
        between them (65% not entailment). Also, due to a data quirk, the development set is adversarial:
        hypotheses are sometimes shared between training and development examples, so if a model memorizes the
        training examples, they will predict the wrong label on corresponding development set
        example. As with QNLI, each example is evaluated separately, so there is not a systematic correspondence
        between a model's score on this task and its score on the unconverted original task. We
        call converted dataset WNLI (Winograd NLI).

glue is configured with tfds.text.glue.GlueConfig and has the following configurations predefined (defaults to the first one):

  • "cola" (v0.0.2) (Size: 368.14 KiB): The Corpus of Linguistic Acceptability consists of English acceptability judgments drawn from books and journal articles on linguistic theory. Each example is a sequence of words annotated with whether it is a grammatical English sentence.

  • "sst2" (v0.0.2) (Size: 7.09 MiB): The Stanford Sentiment Treebank consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence. We use the two-way (positive/negative) class split, and use only sentence-level labels.

  • "mrpc" (v0.0.2) (Size: 1.43 MiB): The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.

  • "qqp" (v0.0.2) (Size: 57.73 MiB): The Quora Question Pairs2 dataset is a collection of question pairs from the community question-answering website Quora. The task is to determine whether a pair of questions are semantically equivalent.

  • "stsb" (v0.0.2) (Size: 784.05 KiB): The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data. Each pair is human-annotated with a similarity score from 1 to 5.

  • "mnli" (v0.0.2) (Size: 298.29 MiB): The Multi-Genre Natural Language Inference Corpusn is a crowdsourced collection of sentence pairs with textual entailment annotations. Given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis (entailment), contradicts the hypothesis (contradiction), or neither (neutral). The premise sentences are gathered from ten different sources, including transcribed speech, fiction, and government reports. We use the standard test set, for which we obtained private labels from the authors, and evaluate on both the matched (in-domain) and mismatched (cross-domain) section. We also use and recommend the SNLI corpus as 550k examples of auxiliary training data.

  • "qnli" (v0.0.2) (Size: 10.14 MiB): The Stanford Question Answering Dataset is a question-answering dataset consisting of question-paragraph pairs, where one of the sentences in the paragraph (drawn from Wikipedia) contains the answer to the corresponding question (written by an annotator). We convert the task into sentence pair classification by forming a pair between each question and each sentence in the corresponding context, and filtering out pairs with low lexical overlap between the question and the context sentence. The task is to determine whether the context sentence contains the answer to the question. This modified version of the original task removes the requirement that the model select the exact answer, but also removes the simplifying assumptions that the answer is always present in the input and that lexical overlap is a reliable cue.

  • "rte" (v0.0.2) (Size: 680.81 KiB): The Recognizing Textual Entailment (RTE) datasets come from a series of annual textual entailment challenges. We combine the data from RTE1 (Dagan et al., 2006), RTE2 (Bar Haim et al., 2006), RTE3 (Giampiccolo et al., 2007), and RTE5 (Bentivogli et al., 2009).4 Examples are constructed based on news and Wikipedia text. We convert all datasets to a two-class split, where for three-class datasets we collapse neutral and contradiction into not entailment, for consistency.

  • "wnli" (v0.0.2) (Size: 28.32 KiB): The Winograd Schema Challenge (Levesque et al., 2011) is a reading comprehension task in which a system must read a sentence with a pronoun and select the referent of that pronoun from a list of choices. The examples are manually constructed to foil simple statistical methods: Each one is contingent on contextual information provided by a single word or phrase in the sentence. To convert the problem into sentence pair classification, we construct sentence pairs by replacing the ambiguous pronoun with each possible referent. The task is to predict if the sentence with the pronoun substituted is entailed by the original sentence. We use a small evaluation set consisting of new examples derived from fiction books that was shared privately by the authors of the original corpus. While the included training set is balanced between two classes, the test set is imbalanced between them (65% not entailment). Also, due to a data quirk, the development set is adversarial: hypotheses are sometimes shared between training and development examples, so if a model memorizes the training examples, they will predict the wrong label on corresponding development set example. As with QNLI, each example is evaluated separately, so there is not a systematic correspondence between a model's score on this task and its score on the unconverted original task. We call converted dataset WNLI (Winograd NLI).

"glue/cola"

FeaturesDict({
    'idx': Tensor(shape=(), dtype=tf.int32),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'sentence': Text(shape=(), dtype=tf.string, encoder=None),
})

"glue/sst2"

FeaturesDict({
    'idx': Tensor(shape=(), dtype=tf.int32),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'sentence': Text(shape=(), dtype=tf.string, encoder=None),
})

"glue/mrpc"

FeaturesDict({
    'idx': Tensor(shape=(), dtype=tf.int32),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'sentence1': Text(shape=(), dtype=tf.string, encoder=None),
    'sentence2': Text(shape=(), dtype=tf.string, encoder=None),
})

"glue/qqp"

FeaturesDict({
    'idx': Tensor(shape=(), dtype=tf.int32),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'question1': Text(shape=(), dtype=tf.string, encoder=None),
    'question2': Text(shape=(), dtype=tf.string, encoder=None),
})

"glue/stsb"

FeaturesDict({
    'idx': Tensor(shape=(), dtype=tf.int32),
    'label': Tensor(shape=(), dtype=tf.float32),
    'sentence1': Text(shape=(), dtype=tf.string, encoder=None),
    'sentence2': Text(shape=(), dtype=tf.string, encoder=None),
})

"glue/mnli"

FeaturesDict({
    'hypothesis': Text(shape=(), dtype=tf.string, encoder=None),
    'idx': Tensor(shape=(), dtype=tf.int32),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=3),
    'premise': Text(shape=(), dtype=tf.string, encoder=None),
})

"glue/qnli"

FeaturesDict({
    'idx': Tensor(shape=(), dtype=tf.int32),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'question': Text(shape=(), dtype=tf.string, encoder=None),
    'sentence': Text(shape=(), dtype=tf.string, encoder=None),
})

"glue/rte"

FeaturesDict({
    'idx': Tensor(shape=(), dtype=tf.int32),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'sentence1': Text(shape=(), dtype=tf.string, encoder=None),
    'sentence2': Text(shape=(), dtype=tf.string, encoder=None),
})

"glue/wnli"

FeaturesDict({
    'idx': Tensor(shape=(), dtype=tf.int32),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'sentence1': Text(shape=(), dtype=tf.string, encoder=None),
    'sentence2': Text(shape=(), dtype=tf.string, encoder=None),
})

Statistics

Split Examples
ALL 852
TRAIN 635
TEST 146
VALIDATION 71

Urls

Supervised keys (for as_supervised=True)

None

Citation

@inproceedings{levesque2012winograd,
              title={The winograd schema challenge},
              author={Levesque, Hector and Davis, Ernest and Morgenstern, Leora},
              booktitle={Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning},
              year={2012}
            }
@inproceedings{wang2019glue,
  title={ {GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},
  note={In the Proceedings of ICLR.},
  year={2019}
}

Note that each GLUE dataset has its own citation. Please see the source to see
the correct citation for each contained dataset.

"imdb_reviews"

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

imdb_reviews is configured with tfds.text.imdb.IMDBReviewsConfig and has the following configurations predefined (defaults to the first one):

  • "plain_text" (v0.1.0) (Size: 80.23 MiB): Plain text

  • "bytes" (v0.1.0) (Size: 80.23 MiB): Uses byte-level text encoding with tfds.features.text.ByteTextEncoder

  • "subwords8k" (v0.1.0) (Size: 80.23 MiB): Uses tfds.features.text.SubwordTextEncoder with 8k vocab size

  • "subwords32k" (v0.1.0) (Size: 80.23 MiB): Uses tfds.features.text.SubwordTextEncoder with 32k vocab size

"imdb_reviews/plain_text"

FeaturesDict({
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'text': Text(shape=(), dtype=tf.string, encoder=None),
})

"imdb_reviews/bytes"

FeaturesDict({
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'text': Text(shape=(None,), dtype=tf.int64, encoder=<ByteTextEncoder vocab_size=257>),
})

"imdb_reviews/subwords8k"

FeaturesDict({
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'text': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8185>),
})

"imdb_reviews/subwords32k"

FeaturesDict({
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'text': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=32650>),
})

Statistics

Split Examples
ALL 100,000
UNSUPERVISED 50,000
TRAIN 25,000
TEST 25,000

Urls

Supervised keys (for as_supervised=True)

(u'text', u'label')

Citation

@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}

"lm1b"

A benchmark corpus to be used for measuring progress in statistical language modeling. This has almost one billion words in the training data.

lm1b is configured with tfds.text.lm1b.Lm1bConfig and has the following configurations predefined (defaults to the first one):

  • "plain_text" (v0.0.1) (Size: 1.67 GiB): Plain text

  • "bytes" (v0.0.1) (Size: 1.67 GiB): Uses byte-level text encoding with tfds.features.text.ByteTextEncoder

  • "subwords8k" (v0.0.2) (Size: 1.67 GiB): Uses tfds.features.text.SubwordTextEncoder with 8k vocab size

  • "subwords32k" (v0.0.2) (Size: 1.67 GiB): Uses tfds.features.text.SubwordTextEncoder with 32k vocab size

"lm1b/plain_text"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
})

"lm1b/bytes"

FeaturesDict({
    'text': Text(shape=(None,), dtype=tf.int64, encoder=<ByteTextEncoder vocab_size=257>),
})

"lm1b/subwords8k"

FeaturesDict({
    'text': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8189>),
})

"lm1b/subwords32k"

FeaturesDict({
    'text': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=32711>),
})

Statistics

Split Examples
ALL 30,607,716
TRAIN 30,301,028
TEST 306,688

Urls

Supervised keys (for as_supervised=True)

(u'text', u'text')

Citation

@article{DBLP:journals/corr/ChelbaMSGBK13,
  author    = {Ciprian Chelba and
               Tomas Mikolov and
               Mike Schuster and
               Qi Ge and
               Thorsten Brants and
               Phillipp Koehn},
  title     = {One Billion Word Benchmark for Measuring Progress in Statistical Language
               Modeling},
  journal   = {CoRR},
  volume    = {abs/1312.3005},
  year      = {2013},
  url       = {http://arxiv.org/abs/1312.3005},
  archivePrefix = {arXiv},
  eprint    = {1312.3005},
  timestamp = {Mon, 13 Aug 2018 16:46:16 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/ChelbaMSGBK13},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

"multi_nli"

The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation. The corpus served as the basis for the shared task of the RepEval 2017 Workshop at EMNLP in Copenhagen.

multi_nli is configured with tfds.text.multi_nli.MultiNLIConfig and has the following configurations predefined (defaults to the first one):

  • "plain_text" (v0.0.2) (Size: 216.34 MiB): Plain text

"multi_nli/plain_text"

FeaturesDict({
    'hypothesis': Text(shape=(), dtype=tf.string, encoder=None),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=3),
    'premise': Text(shape=(), dtype=tf.string, encoder=None),
})

Statistics

Split Examples
ALL 412,349
TRAIN 392,702
VALIDATION_MISMATCHED 9,832
VALIDATION_MATCHED 9,815

Urls

Supervised keys (for as_supervised=True)

None

Citation

@InProceedings{N18-1101,
  author = "Williams, Adina
            and Nangia, Nikita
            and Bowman, Samuel",
  title = "A Broad-Coverage Challenge Corpus for
           Sentence Understanding through Inference",
  booktitle = "Proceedings of the 2018 Conference of
               the North American Chapter of the
               Association for Computational Linguistics:
               Human Language Technologies, Volume 1 (Long
               Papers)",
  year = "2018",
  publisher = "Association for Computational Linguistics",
  pages = "1112--1122",
  location = "New Orleans, Louisiana",
  url = "http://aclweb.org/anthology/N18-1101"
}

"squad"

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

squad is configured with tfds.text.squad.SquadConfig and has the following configurations predefined (defaults to the first one):

  • "plain_text" (v0.1.0) (Size: 33.51 MiB): Plain text

"squad/plain_text"

FeaturesDict({
    'answers': Sequence({'text': TensorInfo(shape=(None,), dtype=tf.string), 'answer_start': TensorInfo(shape=(None,), dtype=tf.int32)}),
    'context': Text(shape=(), dtype=tf.string, encoder=None),
    'id': Tensor(shape=(), dtype=tf.string),
    'question': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

Statistics

Split Examples
ALL 98,169
TRAIN 87,599
VALIDATION 10,570

Urls

Supervised keys (for as_supervised=True)

None

Citation

@article{2016arXiv160605250R,
       author = { {Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev},
                 Konstantin and {Liang}, Percy},
        title = "{SQuAD: 100,000+ Questions for Machine Comprehension of Text}",
      journal = {arXiv e-prints},
         year = 2016,
          eid = {arXiv:1606.05250},
        pages = {arXiv:1606.05250},
archivePrefix = {arXiv},
       eprint = {1606.05250},
}

"super_glue"

The Winograd Schema Challenge (WSC, Levesque et al., 2012) is a reading comprehension task in which a system must read a sentence with a pronoun and select the referent of that pronoun from a list of choices. Given the difficulty of this task and the headroom still left, we have included WSC in SuperGLUE and recast the dataset into its coreference form. The task is cast as a binary classification problem, as opposed to N-multiple choice, in order to isolate the model's ability to understand the coreference links within a sentence as opposed to various other strategies that may come into play in multiple choice conditions. With that in mind, we create a split with 65% negative majority class in the validation set, reflecting the distribution of the hidden test set, and 52% negative class in the training set. The training and validation examples are drawn from the original Winograd Schema dataset (Levesque et al., 2012), as well as those distributed by the affiliated organization Commonsense Reasoning. The test examples are derived from fiction books and have been shared with us by the authors of the original dataset. Previously, a version of WSC recast as NLI as included in GLUE, known as WNLI. No substantial progress was made on WNLI, with many submissions opting to submit only majority class predictions. WNLI was made especially difficult due to an adversarial train/dev split: Premise sentences that appeared in the training set sometimes appeared in the development set with a different hypothesis and a flipped label. If a system memorized the training set without meaningfully generalizing, which was easy due to the small size of the training set, it could perform far below chance on the development set. We remove this adversarial design in the SuperGLUE version of WSC by ensuring that no sentences are shared between the training, validation, and test sets.

However, the validation and test sets come from different domains, with the validation set consisting of ambiguous examples such that changing one non-noun phrase word will change the coreference dependencies in the sentence. The test set consists only of more straightforward examples, with a high number of noun phrases (and thus more choices for the model), but low to no ambiguity.

This version fixes issues where the spans are not actually substrings of the text.

super_glue is configured with tfds.text.super_glue.SuperGlueConfig and has the following configurations predefined (defaults to the first one):

  • "cb" (v0.0.2) (Size: 73.56 KiB): The CommitmentBank (De Marneffe et al., 2019) is a corpus of short texts in which at least one sentence contains an embedded clause. Each of these embedded clauses is annotated with the degree to which we expect that the person who wrote the text is committed to the truth of the clause. The resulting task framed as three-class textual entailment on examples that are drawn from the Wall Street Journal, fiction from the British National Corpus, and Switchboard. Each example consists of a premise containing an embedded clause and the corresponding hypothesis is the extraction of that clause. We use a subset of the data that had inter-annotator agreement above 0.85. The data is imbalanced (relatively fewer neutral examples), so we evaluate using accuracy and F1, where for multi-class F1 we compute the unweighted average of the F1 per class.

  • "copa" (v0.0.2) (Size: 42.79 KiB): The Choice Of Plausible Alternatives (COPA, Roemmele et al., 2011) dataset is a causal reasoning task in which a system is given a premise sentence and two possible alternatives. The system must choose the alternative which has the more plausible causal relationship with the premise. The method used for the construction of the alternatives ensures that the task requires causal reasoning to solve. Examples either deal with alternative possible causes or alternative possible effects of the premise sentence, accompanied by a simple question disambiguating between the two instance types for the model. All examples are handcrafted and focus on topics from online blogs and a photography-related encyclopedia. Following the recommendation of the authors, we evaluate using accuracy.

  • "multirc" (v0.0.2) (Size: 1.16 MiB): The Multi-Sentence Reading Comprehension dataset (MultiRC, Khashabi et al., 2018) is a true/false question-answering task. Each example consists of a context paragraph, a question about that paragraph, and a list of possible answers to that question which must be labeled as true or false. Question-answering (QA) is a popular problem with many datasets. We use MultiRC because of a number of desirable properties: (i) each question can have multiple possible correct answers, so each question-answer pair must be evaluated independent of other pairs, (ii) the questions are designed such that answering each question requires drawing facts from multiple context sentences, and (iii) the question-answer pair format more closely matches the API of other SuperGLUE tasks than span-based extractive QA does. The paragraphs are drawn from seven domains including news, fiction, and historical text.

  • "rte" (v0.0.2) (Size: 733.16 KiB): The Recognizing Textual Entailment (RTE) datasets come from a series of annual competitions on textual entailment, the problem of predicting whether a given premise sentence entails a given hypothesis sentence (also known as natural language inference, NLI). RTE was previously included in GLUE, and we use the same data and format as before: We merge data from RTE1 (Dagan et al., 2006), RTE2 (Bar Haim et al., 2006), RTE3 (Giampiccolo et al., 2007), and RTE5 (Bentivogli et al., 2009). All datasets are combined and converted to two-class classification: entailment and not_entailment. Of all the GLUE tasks, RTE was among those that benefited from transfer learning the most, jumping from near random-chance performance (~56%) at the time of GLUE's launch to 85% accuracy (Liu et al., 2019c) at the time of writing. Given the eight point gap with respect to human performance, however, the task is not yet solved by machines, and we expect the remaining gap to be difficult to close.

  • "wic" (v0.0.2) (Size: 347.15 KiB): The Word-in-Context (WiC, Pilehvar and Camacho-Collados, 2019) dataset supports a word sense disambiguation task cast as binary classification over sentence pairs. Given two sentences and a polysemous (sense-ambiguous) word that appears in both sentences, the task is to determine whether the word is used with the same sense in both sentences. Sentences are drawn from WordNet (Miller, 1995), VerbNet (Schuler, 2005), and Wiktionary. We follow the original work and evaluate using accuracy.

  • "wsc" (v0.0.2) (Size: 31.84 KiB): The Winograd Schema Challenge (WSC, Levesque et al., 2012) is a reading comprehension task in which a system must read a sentence with a pronoun and select the referent of that pronoun from a list of choices. Given the difficulty of this task and the headroom still left, we have included WSC in SuperGLUE and recast the dataset into its coreference form. The task is cast as a binary classification problem, as opposed to N-multiple choice, in order to isolate the model's ability to understand the coreference links within a sentence as opposed to various other strategies that may come into play in multiple choice conditions. With that in mind, we create a split with 65% negative majority class in the validation set, reflecting the distribution of the hidden test set, and 52% negative class in the training set. The training and validation examples are drawn from the original Winograd Schema dataset (Levesque et al., 2012), as well as those distributed by the affiliated organization Commonsense Reasoning. The test examples are derived from fiction books and have been shared with us by the authors of the original dataset. Previously, a version of WSC recast as NLI as included in GLUE, known as WNLI. No substantial progress was made on WNLI, with many submissions opting to submit only majority class predictions. WNLI was made especially difficult due to an adversarial train/dev split: Premise sentences that appeared in the training set sometimes appeared in the development set with a different hypothesis and a flipped label. If a system memorized the training set without meaningfully generalizing, which was easy due to the small size of the training set, it could perform far below chance on the development set. We remove this adversarial design in the SuperGLUE version of WSC by ensuring that no sentences are shared between the training, validation, and test sets.

However, the validation and test sets come from different domains, with the validation set consisting of ambiguous examples such that changing one non-noun phrase word will change the coreference dependencies in the sentence. The test set consists only of more straightforward examples, with a high number of noun phrases (and thus more choices for the model), but low to no ambiguity.

  • "wsc.fixed" (v0.0.2) (Size: 31.84 KiB): The Winograd Schema Challenge (WSC, Levesque et al., 2012) is a reading comprehension task in which a system must read a sentence with a pronoun and select the referent of that pronoun from a list of choices. Given the difficulty of this task and the headroom still left, we have included WSC in SuperGLUE and recast the dataset into its coreference form. The task is cast as a binary classification problem, as opposed to N-multiple choice, in order to isolate the model's ability to understand the coreference links within a sentence as opposed to various other strategies that may come into play in multiple choice conditions. With that in mind, we create a split with 65% negative majority class in the validation set, reflecting the distribution of the hidden test set, and 52% negative class in the training set. The training and validation examples are drawn from the original Winograd Schema dataset (Levesque et al., 2012), as well as those distributed by the affiliated organization Commonsense Reasoning. The test examples are derived from fiction books and have been shared with us by the authors of the original dataset. Previously, a version of WSC recast as NLI as included in GLUE, known as WNLI. No substantial progress was made on WNLI, with many submissions opting to submit only majority class predictions. WNLI was made especially difficult due to an adversarial train/dev split: Premise sentences that appeared in the training set sometimes appeared in the development set with a different hypothesis and a flipped label. If a system memorized the training set without meaningfully generalizing, which was easy due to the small size of the training set, it could perform far below chance on the development set. We remove this adversarial design in the SuperGLUE version of WSC by ensuring that no sentences are shared between the training, validation, and test sets.

However, the validation and test sets come from different domains, with the validation set consisting of ambiguous examples such that changing one non-noun phrase word will change the coreference dependencies in the sentence. The test set consists only of more straightforward examples, with a high number of noun phrases (and thus more choices for the model), but low to no ambiguity.

This version fixes issues where the spans are not actually substrings of the text.

"super_glue/cb"

FeaturesDict({
    'hypothesis': Text(shape=(), dtype=tf.string, encoder=None),
    'idx': Tensor(shape=(), dtype=tf.int32),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=3),
    'premise': Text(shape=(), dtype=tf.string, encoder=None),
})

"super_glue/copa"

FeaturesDict({
    'choice1': Text(shape=(), dtype=tf.string, encoder=None),
    'choice2': Text(shape=(), dtype=tf.string, encoder=None),
    'idx': Tensor(shape=(), dtype=tf.int32),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'premise': Text(shape=(), dtype=tf.string, encoder=None),
    'question': Text(shape=(), dtype=tf.string, encoder=None),
})

"super_glue/multirc"

FeaturesDict({
    'answer': Text(shape=(), dtype=tf.string, encoder=None),
    'idx': FeaturesDict({
        'answer': Tensor(shape=(), dtype=tf.int32),
        'paragraph': Tensor(shape=(), dtype=tf.int32),
        'question': Tensor(shape=(), dtype=tf.int32),
    }),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'paragraph': Text(shape=(), dtype=tf.string, encoder=None),
    'question': Text(shape=(), dtype=tf.string, encoder=None),
})

"super_glue/rte"

FeaturesDict({
    'hypothesis': Text(shape=(), dtype=tf.string, encoder=None),
    'idx': Tensor(shape=(), dtype=tf.int32),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'premise': Text(shape=(), dtype=tf.string, encoder=None),
})

"super_glue/wic"

FeaturesDict({
    'idx': Tensor(shape=(), dtype=tf.int32),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'pos': Text(shape=(), dtype=tf.string, encoder=None),
    'sentence1': Text(shape=(), dtype=tf.string, encoder=None),
    'sentence2': Text(shape=(), dtype=tf.string, encoder=None),
    'word': Text(shape=(), dtype=tf.string, encoder=None),
})

"super_glue/wsc"

FeaturesDict({
    'idx': Tensor(shape=(), dtype=tf.int32),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'span1_index': Tensor(shape=(), dtype=tf.int32),
    'span1_text': Text(shape=(), dtype=tf.string, encoder=None),
    'span2_index': Tensor(shape=(), dtype=tf.int32),
    'span2_text': Text(shape=(), dtype=tf.string, encoder=None),
    'text': Text(shape=(), dtype=tf.string, encoder=None),
})

"super_glue/wsc.fixed"

FeaturesDict({
    'idx': Tensor(shape=(), dtype=tf.int32),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'span1_index': Tensor(shape=(), dtype=tf.int32),
    'span1_text': Text(shape=(), dtype=tf.string, encoder=None),
    'span2_index': Tensor(shape=(), dtype=tf.int32),
    'span2_text': Text(shape=(), dtype=tf.string, encoder=None),
    'text': Text(shape=(), dtype=tf.string, encoder=None),
})

Statistics

Split Examples
ALL 804
TRAIN 554
TEST 146
VALIDATION 104

Urls

Supervised keys (for as_supervised=True)

None

Citation

@inproceedings{levesque2012winograd,
  title={The winograd schema challenge},
  author={Levesque, Hector and Davis, Ernest and Morgenstern, Leora},
  booktitle={Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning},
  year={2012}
}
@article{wang2019superglue,
  title={SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems},
  author={Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R},
  journal={arXiv preprint arXiv:1905.00537},
  year={2019}
}

Note that each SuperGLUE dataset has its own citation. Please see the source to
get the correct citation for each contained dataset.

"wikipedia"

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

wikipedia is configured with tfds.text.wikipedia.WikipediaConfig and has the following configurations predefined (defaults to the first one):

  • "20190301.aa" (v0.0.2) (Size: 44.09 KiB): Wikipedia dataset for aa, parsed from 20190301 dump.

  • "20190301.ab" (v0.0.2) (Size: 1.31 MiB): Wikipedia dataset for ab, parsed from 20190301 dump.

  • "20190301.ace" (v0.0.2) (Size: 2.66 MiB): Wikipedia dataset for ace, parsed from 20190301 dump.

  • "20190301.ady" (v0.0.2) (Size: 349.43 KiB): Wikipedia dataset for ady, parsed from 20190301 dump.

  • "20190301.af" (v0.0.2) (Size: 84.13 MiB): Wikipedia dataset for af, parsed from 20190301 dump.

  • "20190301.ak" (v0.0.2) (Size: 377.84 KiB): Wikipedia dataset for ak, parsed from 20190301 dump.

  • "20190301.als" (v0.0.2) (Size: 46.90 MiB): Wikipedia dataset for als, parsed from 20190301 dump.

  • "20190301.am" (v0.0.2) (Size: 6.54 MiB): Wikipedia dataset for am, parsed from 20190301 dump.

  • "20190301.an" (v0.0.2) (Size: 31.39 MiB): Wikipedia dataset for an, parsed from 20190301 dump.

  • "20190301.ang" (v0.0.2) (Size: 3.77 MiB): Wikipedia dataset for ang, parsed from 20190301 dump.

  • "20190301.ar" (v0.0.2) (Size: 805.82 MiB): Wikipedia dataset for ar, parsed from 20190301 dump.

  • "20190301.arc" (v0.0.2) (Size: 952.49 KiB): Wikipedia dataset for arc, parsed from 20190301 dump.

  • "20190301.arz" (v0.0.2) (Size: 20.32 MiB): Wikipedia dataset for arz, parsed from 20190301 dump.

  • "20190301.as" (v0.0.2) (Size: 19.06 MiB): Wikipedia dataset for as, parsed from 20190301 dump.

  • "20190301.ast" (v0.0.2) (Size: 216.68 MiB): Wikipedia dataset for ast, parsed from 20190301 dump.

  • "20190301.atj" (v0.0.2) (Size: 467.05 KiB): Wikipedia dataset for atj, parsed from 20190301 dump.

  • "20190301.av" (v0.0.2) (Size: 3.61 MiB): Wikipedia dataset for av, parsed from 20190301 dump.

  • "20190301.ay" (v0.0.2) (Size: 2.06 MiB): Wikipedia dataset for ay, parsed from 20190301 dump.

  • "20190301.az" (v0.0.2) (Size: 163.04 MiB): Wikipedia dataset for az, parsed from 20190301 dump.

  • "20190301.azb" (v0.0.2) (Size: 50.59 MiB): Wikipedia dataset for azb, parsed from 20190301 dump.

  • "20190301.ba" (v0.0.2) (Size: 55.04 MiB): Wikipedia dataset for ba, parsed from 20190301 dump.

  • "20190301.bar" (v0.0.2) (Size: 30.14 MiB): Wikipedia dataset for bar, parsed from 20190301 dump.

  • "20190301.bat-smg" (v0.0.2) (Size: 4.61 MiB): Wikipedia dataset for bat-smg, parsed from 20190301 dump.

  • "20190301.bcl" (v0.0.2) (Size: 6.18 MiB): Wikipedia dataset for bcl, parsed from 20190301 dump.

  • "20190301.be" (v0.0.2) (Size: 192.23 MiB): Wikipedia dataset for be, parsed from 20190301 dump.

  • "20190301.be-x-old" (v0.0.2) (Size: 74.77 MiB): Wikipedia dataset for be-x-old, parsed from 20190301 dump.

  • "20190301.bg" (v0.0.2) (Size: 326.20 MiB): Wikipedia dataset for bg, parsed from 20190301 dump.

  • "20190301.bh" (v0.0.2) (Size: 13.28 MiB): Wikipedia dataset for bh, parsed from 20190301 dump.

  • "20190301.bi" (v0.0.2) (Size: 424.88 KiB): Wikipedia dataset for bi, parsed from 20190301 dump.

  • "20190301.bjn" (v0.0.2) (Size: 2.09 MiB): Wikipedia dataset for bjn, parsed from 20190301 dump.

  • "20190301.bm" (v0.0.2) (Size: 447.98 KiB): Wikipedia dataset for bm, parsed from 20190301 dump.

  • "20190301.bn" (v0.0.2) (Size: 145.04 MiB): Wikipedia dataset for bn, parsed from 20190301 dump.

  • "20190301.bo" (v0.0.2) (Size: 12.41 MiB): Wikipedia dataset for bo, parsed from 20190301 dump.

  • "20190301.bpy" (v0.0.2) (Size: 5.05 MiB): Wikipedia dataset for bpy, parsed from 20190301 dump.

  • "20190301.br" (v0.0.2) (Size: 49.14 MiB): Wikipedia dataset for br, parsed from 20190301 dump.

  • "20190301.bs" (v0.0.2) (Size: 103.26 MiB): Wikipedia dataset for bs, parsed from 20190301 dump.

  • "20190301.bug" (v0.0.2) (Size: 1.76 MiB): Wikipedia dataset for bug, parsed from 20190301 dump.

  • "20190301.bxr" (v0.0.2) (Size: 3.21 MiB): Wikipedia dataset for bxr, parsed from 20190301 dump.

  • "20190301.ca" (v0.0.2) (Size: 849.65 MiB): Wikipedia dataset for ca, parsed from 20190301 dump.

  • "20190301.cbk-zam" (v0.0.2) (Size: 1.84 MiB): Wikipedia dataset for cbk-zam, parsed from 20190301 dump.

  • "20190301.cdo" (v0.0.2) (Size: 3.22 MiB): Wikipedia dataset for cdo, parsed from 20190301 dump.

  • "20190301.ce" (v0.0.2) (Size: 43.89 MiB): Wikipedia dataset for ce, parsed from 20190301 dump.

  • "20190301.ceb" (v0.0.2) (Size: ?? GiB): Wikipedia dataset for ceb, parsed from 20190301 dump.

  • "20190301.ch" (v0.0.2) (Size: 684.97 KiB): Wikipedia dataset for ch, parsed from 20190301 dump.

  • "20190301.cho" (v0.0.2) (Size: 25.99 KiB): Wikipedia dataset for cho, parsed from 20190301 dump.

  • "20190301.chr" (v0.0.2) (Size: 651.25 KiB): Wikipedia dataset for chr, parsed from 20190301 dump.

  • "20190301.chy" (v0.0.2) (Size: 325.90 KiB): Wikipedia dataset for chy, parsed from 20190301 dump.

  • "20190301.ckb" (v0.0.2) (Size: 22.16 MiB): Wikipedia dataset for ckb, parsed from 20190301 dump.

  • "20190301.co" (v0.0.2) (Size: 3.38 MiB): Wikipedia dataset for co, parsed from 20190301 dump.

  • "20190301.cr" (v0.0.2) (Size: 259.71 KiB): Wikipedia dataset for cr, parsed from 20190301 dump.

  • "20190301.crh" (v0.0.2) (Size: 4.01 MiB): Wikipedia dataset for crh, parsed from 20190301 dump.

  • "20190301.cs" (v0.0.2) (Size: 759.21 MiB): Wikipedia dataset for cs, parsed from 20190301 dump.

  • "20190301.csb" (v0.0.2) (Size: 2.03 MiB): Wikipedia dataset for csb, parsed from 20190301 dump.

  • "20190301.cu" (v0.0.2) (Size: 631.49 KiB): Wikipedia dataset for cu, parsed from 20190301 dump.

  • "20190301.cv" (v0.0.2) (Size: 22.23 MiB): Wikipedia dataset for cv, parsed from 20190301 dump.

  • "20190301.cy" (v0.0.2) (Size: 64.37 MiB): Wikipedia dataset for cy, parsed from 20190301 dump.

  • "20190301.da" (v0.0.2) (Size: 323.53 MiB): Wikipedia dataset for da, parsed from 20190301 dump.

  • "20190301.de" (v0.0.2) (Size: 4.97 GiB): Wikipedia dataset for de, parsed from 20190301 dump.

  • "20190301.din" (v0.0.2) (Size: 457.06 KiB): Wikipedia dataset for din, parsed from 20190301 dump.

  • "20190301.diq" (v0.0.2) (Size: 7.24 MiB): Wikipedia dataset for diq, parsed from 20190301 dump.

  • "20190301.dsb" (v0.0.2) (Size: 3.54 MiB): Wikipedia dataset for dsb, parsed from 20190301 dump.

  • "20190301.dty" (v0.0.2) (Size: 4.95 MiB): Wikipedia dataset for dty, parsed from 20190301 dump.

  • "20190301.dv" (v0.0.2) (Size: 4.24 MiB): Wikipedia dataset for dv, parsed from 20190301 dump.

  • "20190301.dz" (v0.0.2) (Size: 360.01 KiB): Wikipedia dataset for dz, parsed from 20190301 dump.

  • "20190301.ee" (v0.0.2) (Size: 434.14 KiB): Wikipedia dataset for ee, parsed from 20190301 dump.

  • "20190301.el" (v0.0.2) (Size: 324.40 MiB): Wikipedia dataset for el, parsed from 20190301 dump.

  • "20190301.eml" (v0.0.2) (Size: 7.72 MiB): Wikipedia dataset for eml, parsed from 20190301 dump.

  • "20190301.en" (v0.0.2) (Size: 15.72 GiB): Wikipedia dataset for en, parsed from 20190301 dump.

  • "20190301.eo" (v0.0.2) (Size: 245.73 MiB): Wikipedia dataset for eo, parsed from 20190301 dump.

  • "20190301.es" (v0.0.2) (Size: 2.93 GiB): Wikipedia dataset for es, parsed from 20190301 dump.

  • "20190301.et" (v0.0.2) (Size: 196.03 MiB): Wikipedia dataset for et, parsed from 20190301 dump.

  • "20190301.eu" (v0.0.2) (Size: 180.35 MiB): Wikipedia dataset for eu, parsed from 20190301 dump.

  • "20190301.ext" (v0.0.2) (Size: 2.40 MiB): Wikipedia dataset for ext, parsed from 20190301 dump.

  • "20190301.fa" (v0.0.2) (Size: 693.84 MiB): Wikipedia dataset for fa, parsed from 20190301 dump.

  • "20190301.ff" (v0.0.2) (Size: 387.75 KiB): Wikipedia dataset for ff, parsed from 20190301 dump.

  • "20190301.fi" (v0.0.2) (Size: 656.44 MiB): Wikipedia dataset for fi, parsed from 20190301 dump.

  • "20190301.fiu-vro" (v0.0.2) (Size: 2.00 MiB): Wikipedia dataset for fiu-vro, parsed from 20190301 dump.

  • "20190301.fj" (v0.0.2) (Size: 262.98 KiB): Wikipedia dataset for fj, parsed from 20190301 dump.

  • "20190301.fo" (v0.0.2) (Size: 13.67 MiB): Wikipedia dataset for fo, parsed from 20190301 dump.

  • "20190301.fr" (v0.0.2) (Size: 4.14 GiB): Wikipedia dataset for fr, parsed from 20190301 dump.

  • "20190301.frp" (v0.0.2) (Size: 2.03 MiB): Wikipedia dataset for frp, parsed from 20190301 dump.

  • "20190301.frr" (v0.0.2) (Size: 7.88 MiB): Wikipedia dataset for frr, parsed from 20190301 dump.

  • "20190301.fur" (v0.0.2) (Size: 2.29 MiB): Wikipedia dataset for fur, parsed from 20190301 dump.

  • "20190301.fy" (v0.0.2) (Size: 45.52 MiB): Wikipedia dataset for fy, parsed from 20190301 dump.

  • "20190301.ga" (v0.0.2) (Size: 24.78 MiB): Wikipedia dataset for ga, parsed from 20190301 dump.

  • "20190301.gag" (v0.0.2) (Size: 2.04 MiB): Wikipedia dataset for gag, parsed from 20190301 dump.

  • "20190301.gan" (v0.0.2) (Size: 3.82 MiB): Wikipedia dataset for gan, parsed from 20190301 dump.

  • "20190301.gd" (v0.0.2) (Size: 8.51 MiB): Wikipedia dataset for gd, parsed from 20190301 dump.

  • "20190301.gl" (v0.0.2) (Size: 235.07 MiB): Wikipedia dataset for gl, parsed from 20190301 dump.

  • "20190301.glk" (v0.0.2) (Size: 1.91 MiB): Wikipedia dataset for glk, parsed from 20190301 dump.

  • "20190301.gn" (v0.0.2) (Size: 3.37 MiB): Wikipedia dataset for gn, parsed from 20190301 dump.

  • "20190301.gom" (v0.0.2) (Size: 6.07 MiB): Wikipedia dataset for gom, parsed from 20190301 dump.

  • "20190301.gor" (v0.0.2) (Size: 1.28 MiB): Wikipedia dataset for gor, parsed from 20190301 dump.

  • "20190301.got" (v0.0.2) (Size: 604.10 KiB): Wikipedia dataset for got, parsed from 20190301 dump.

  • "20190301.gu" (v0.0.2) (Size: 27.23 MiB): Wikipedia dataset for gu, parsed from 20190301 dump.

  • "20190301.gv" (v0.0.2) (Size: 5.32 MiB): Wikipedia dataset for gv, parsed from 20190301 dump.

  • "20190301.ha" (v0.0.2) (Size: 1.62 MiB): Wikipedia dataset for ha, parsed from 20190301 dump.

  • "20190301.hak" (v0.0.2) (Size: 3.28 MiB): Wikipedia dataset for hak, parsed from 20190301 dump.

  • "20190301.haw" (v0.0.2) (Size: 1017.76 KiB): Wikipedia dataset for haw, parsed from 20190301 dump.

  • "20190301.he" (v0.0.2) (Size: 572.30 MiB): Wikipedia dataset for he, parsed from 20190301 dump.

  • "20190301.hi" (v0.0.2) (Size: 137.86 MiB): Wikipedia dataset for hi, parsed from 20190301 dump.

  • "20190301.hif" (v0.0.2) (Size: 4.57 MiB): Wikipedia dataset for hif, parsed from 20190301 dump.

  • "20190301.ho" (v0.0.2) (Size: 18.37 KiB): Wikipedia dataset for ho, parsed from 20190301 dump.

  • "20190301.hr" (v0.0.2) (Size: 246.05 MiB): Wikipedia dataset for hr, parsed from 20190301 dump.

  • "20190301.hsb" (v0.0.2) (Size: 10.38 MiB): Wikipedia dataset for hsb, parsed from 20190301 dump.

  • "20190301.ht" (v0.0.2) (Size: 10.23 MiB): Wikipedia dataset for ht, parsed from 20190301 dump.

  • "20190301.hu" (v0.0.2) (Size: 810.17 MiB): Wikipedia dataset for hu, parsed from 20190301 dump.

  • "20190301.hy" (v0.0.2) (Size: 277.53 MiB): Wikipedia dataset for hy, parsed from 20190301 dump.

  • "20190301.hz" (v0.0.2) (Size: 16.35 KiB): Wikipedia dataset for hz, parsed from 20190301 dump.

  • "20190301.ia" (v0.0.2) (Size: 7.85 MiB): Wikipedia dataset for ia, parsed from 20190301 dump.

  • "20190301.id" (v0.0.2) (Size: 523.94 MiB): Wikipedia dataset for id, parsed from 20190301 dump.

  • "20190301.ie" (v0.0.2) (Size: 1.70 MiB): Wikipedia dataset for ie, parsed from 20190301 dump.

  • "20190301.ig" (v0.0.2) (Size: 1.00 MiB): Wikipedia dataset for ig, parsed from 20190301 dump.

  • "20190301.ii" (v0.0.2) (Size: 30.88 KiB): Wikipedia dataset for ii, parsed from 20190301 dump.

  • "20190301.ik" (v0.0.2) (Size: 238.12 KiB): Wikipedia dataset for ik, parsed from 20190301 dump.

  • "20190301.ilo" (v0.0.2) (Size: 15.22 MiB): Wikipedia dataset for ilo, parsed from 20190301 dump.

  • "20190301.inh" (v0.0.2) (Size: 1.26 MiB): Wikipedia dataset for inh, parsed from 20190301 dump.

  • "20190301.io" (v0.0.2) (Size: 12.56 MiB): Wikipedia dataset for io, parsed from 20190301 dump.

  • "20190301.is" (v0.0.2) (Size: 41.86 MiB): Wikipedia dataset for is, parsed from 20190301 dump.

  • "20190301.it" (v0.0.2) (Size: 2.66 GiB): Wikipedia dataset for it, parsed from 20190301 dump.

  • "20190301.iu" (v0.0.2) (Size: 284.06 KiB): Wikipedia dataset for iu, parsed from 20190301 dump.

  • "20190301.ja" (v0.0.2) (Size: 2.74 GiB): Wikipedia dataset for ja, parsed from 20190301 dump.

  • "20190301.jam" (v0.0.2) (Size: 895.29 KiB): Wikipedia dataset for jam, parsed from 20190301 dump.

  • "20190301.jbo" (v0.0.2) (Size: 1.06 MiB): Wikipedia dataset for jbo, parsed from 20190301 dump.

  • "20190301.jv" (v0.0.2) (Size: 39.32 MiB): Wikipedia dataset for jv, parsed from 20190301 dump.

  • "20190301.ka" (v0.0.2) (Size: 131.78 MiB): Wikipedia dataset for ka, parsed from 20190301 dump.

  • "20190301.kaa" (v0.0.2) (Size: 1.35 MiB): Wikipedia dataset for kaa, parsed from 20190301 dump.

  • "20190301.kab" (v0.0.2) (Size: 3.62 MiB): Wikipedia dataset for kab, parsed from 20190301 dump.

  • "20190301.kbd" (v0.0.2) (Size: 1.65 MiB): Wikipedia dataset for kbd, parsed from 20190301 dump.

  • "20190301.kbp" (v0.0.2) (Size: 1.24 MiB): Wikipedia dataset for kbp, parsed from 20190301 dump.

  • "20190301.kg" (v0.0.2) (Size: 439.26 KiB): Wikipedia dataset for kg, parsed from 20190301 dump.

  • "20190301.ki" (v0.0.2) (Size: 370.78 KiB): Wikipedia dataset for ki, parsed from 20190301 dump.

  • "20190301.kj" (v0.0.2) (Size: 16.58 KiB): Wikipedia dataset for kj, parsed from 20190301 dump.

  • "20190301.kk" (v0.0.2) (Size: 113.46 MiB): Wikipedia dataset for kk, parsed from 20190301 dump.

  • "20190301.kl" (v0.0.2) (Size: 862.51 KiB): Wikipedia dataset for kl, parsed from 20190301 dump.

  • "20190301.km" (v0.0.2) (Size: 21.92 MiB): Wikipedia dataset for km, parsed from 20190301 dump.

  • "20190301.kn" (v0.0.2) (Size: 69.62 MiB): Wikipedia dataset for kn, parsed from 20190301 dump.

  • "20190301.ko" (v0.0.2) (Size: 625.16 MiB): Wikipedia dataset for ko, parsed from 20190301 dump.

  • "20190301.koi" (v0.0.2) (Size: 2.12 MiB): Wikipedia dataset for koi, parsed from 20190301 dump.

  • "20190301.kr" (v0.0.2) (Size: 13.89 KiB): Wikipedia dataset for kr, parsed from 20190301 dump.

  • "20190301.krc" (v0.0.2) (Size: 3.16 MiB): Wikipedia dataset for krc, parsed from 20190301 dump.

  • "20190301.ks" (v0.0.2) (Size: 309.15 KiB): Wikipedia dataset for ks, parsed from 20190301 dump.

  • "20190301.ksh" (v0.0.2) (Size: 3.07 MiB): Wikipedia dataset for ksh, parsed from 20190301 dump.

  • "20190301.ku" (v0.0.2) (Size: 17.09 MiB): Wikipedia dataset for ku, parsed from 20190301 dump.

  • "20190301.kv" (v0.0.2) (Size: 3.36 MiB): Wikipedia dataset for kv, parsed from 20190301 dump.

  • "20190301.kw" (v0.0.2) (Size: 1.71 MiB): Wikipedia dataset for kw, parsed from 20190301 dump.

  • "20190301.ky" (v0.0.2) (Size: 33.13 MiB): Wikipedia dataset for ky, parsed from 20190301 dump.

  • "20190301.la" (v0.0.2) (Size: 82.72 MiB): Wikipedia dataset for la, parsed from 20190301 dump.

  • "20190301.lad" (v0.0.2) (Size: 3.39 MiB): Wikipedia dataset for lad, parsed from 20190301 dump.

  • "20190301.lb" (v0.0.2) (Size: 45.70 MiB): Wikipedia dataset for lb, parsed from 20190301 dump.

  • "20190301.lbe" (v0.0.2) (Size: 1.22 MiB): Wikipedia dataset for lbe, parsed from 20190301 dump.

  • "20190301.lez" (v0.0.2) (Size: 4.16 MiB): Wikipedia dataset for lez, parsed from 20190301 dump.

  • "20190301.lfn" (v0.0.2) (Size: 2.81 MiB): Wikipedia dataset for lfn, parsed from 20190301 dump.

  • "20190301.lg" (v0.0.2) (Size: 1.58 MiB): Wikipedia dataset for lg, parsed from 20190301 dump.

  • "20190301.li" (v0.0.2) (Size: 13.86 MiB): Wikipedia dataset for li, parsed from 20190301 dump.

  • "20190301.lij" (v0.0.2) (Size: 2.73 MiB): Wikipedia dataset for lij, parsed from 20190301 dump.

  • "20190301.lmo" (v0.0.2) (Size: 21.34 MiB): Wikipedia dataset for lmo, parsed from 20190301 dump.

  • "20190301.ln" (v0.0.2) (Size: 1.83 MiB): Wikipedia dataset for ln, parsed from 20190301 dump.

  • "20190301.lo" (v0.0.2) (Size: 3.44 MiB): Wikipedia dataset for lo, parsed from 20190301 dump.

  • "20190301.lrc" (v0.0.2) (Size: 4.71 MiB): Wikipedia dataset for lrc, parsed from 20190301 dump.

  • "20190301.lt" (v0.0.2) (Size: 174.73 MiB): Wikipedia dataset for lt, parsed from 20190301 dump.

  • "20190301.ltg" (v0.0.2) (Size: 798.18 KiB): Wikipedia dataset for ltg, parsed from 20190301 dump.

  • "20190301.lv" (v0.0.2) (Size: 127.47 MiB): Wikipedia dataset for lv, parsed from 20190301 dump.

  • "20190301.mai" (v0.0.2) (Size: 10.80 MiB): Wikipedia dataset for mai, parsed from 20190301 dump.

  • "20190301.map-bms" (v0.0.2) (Size: 4.49 MiB): Wikipedia dataset for map-bms, parsed from 20190301 dump.

  • "20190301.mdf" (v0.0.2) (Size: 1.04 MiB): Wikipedia dataset for mdf, parsed from 20190301 dump.

  • "20190301.mg" (v0.0.2) (Size: 25.64 MiB): Wikipedia dataset for mg, parsed from 20190301 dump.

  • "20190301.mh" (v0.0.2) (Size: 27.71 KiB): Wikipedia dataset for mh, parsed from 20190301 dump.

  • "20190301.mhr" (v0.0.2) (Size: 5.69 MiB): Wikipedia dataset for mhr, parsed from 20190301 dump.

  • "20190301.mi" (v0.0.2) (Size: 1.96 MiB): Wikipedia dataset for mi, parsed from 20190301 dump.

  • "20190301.min" (v0.0.2) (Size: 25.05 MiB): Wikipedia dataset for min, parsed from 20190301 dump.

  • "20190301.mk" (v0.0.2) (Size: 140.69 MiB): Wikipedia dataset for mk, parsed from 20190301 dump.

  • "20190301.ml" (v0.0.2) (Size: 117.24 MiB): Wikipedia dataset for ml, parsed from 20190301 dump.

  • "20190301.mn" (v0.0.2) (Size: 28.23 MiB): Wikipedia dataset for mn, parsed from 20190301 dump.

  • "20190301.mr" (v0.0.2) (Size: 49.58 MiB): Wikipedia dataset for mr, parsed from 20190301 dump.

  • "20190301.mrj" (v0.0.2) (Size: 3.01 MiB): Wikipedia dataset for mrj, parsed from 20190301 dump.

  • "20190301.ms" (v0.0.2) (Size: 205.79 MiB): Wikipedia dataset for ms, parsed from 20190301 dump.

  • "20190301.mt" (v0.0.2) (Size: 8.21 MiB): Wikipedia dataset for mt, parsed from 20190301 dump.

  • "20190301.mus" (v0.0.2) (Size: 14.20 KiB): Wikipedia dataset for mus, parsed from 20190301 dump.

  • "20190301.mwl" (v0.0.2) (Size: 8.95 MiB): Wikipedia dataset for mwl, parsed from 20190301 dump.

  • "20190301.my" (v0.0.2) (Size: 34.60 MiB): Wikipedia dataset for my, parsed from 20190301 dump.

  • "20190301.myv" (v0.0.2) (Size: 7.79 MiB): Wikipedia dataset for myv, parsed from 20190301 dump.

  • "20190301.mzn" (v0.0.2) (Size: 6.47 MiB): Wikipedia dataset for mzn, parsed from 20190301 dump.

  • "20190301.na" (v0.0.2) (Size: 480.57 KiB): Wikipedia dataset for na, parsed from 20190301 dump.

  • "20190301.nah" (v0.0.2) (Size: 4.30 MiB): Wikipedia dataset for nah, parsed from 20190301 dump.

  • "20190301.nap" (v0.0.2) (Size: 5.55 MiB): Wikipedia dataset for nap, parsed from 20190301 dump.

  • "20190301.nds" (v0.0.2) (Size: 33.28 MiB): Wikipedia dataset for nds, parsed from 20190301 dump.

  • "20190301.nds-nl" (v0.0.2) (Size: 6.67 MiB): Wikipedia dataset for nds-nl, parsed from 20190301 dump.

  • "20190301.ne" (v0.0.2) (Size: 29.26 MiB): Wikipedia dataset for ne, parsed from 20190301 dump.

  • "20190301.new" (v0.0.2) (Size: 16.91 MiB): Wikipedia dataset for new, parsed from 20190301 dump.

  • "20190301.ng" (v0.0.2) (Size: 91.11 KiB): Wikipedia dataset for ng, parsed from 20190301 dump.

  • "20190301.nl" (v0.0.2) (Size: 1.38 GiB): Wikipedia dataset for nl, parsed from 20190301 dump.

  • "20190301.nn" (v0.0.2) (Size: 126.01 MiB): Wikipedia dataset for nn, parsed from 20190301 dump.

  • "20190301.no" (v0.0.2) (Size: 610.74 MiB): Wikipedia dataset for no, parsed from 20190301 dump.

  • "20190301.nov" (v0.0.2) (Size: 1.12 MiB): Wikipedia dataset for nov, parsed from 20190301 dump.

  • "20190301.nrm" (v0.0.2) (Size: 1.56 MiB): Wikipedia dataset for nrm, parsed from 20190301 dump.

  • "20190301.nso" (v0.0.2) (Size: 2.20 MiB): Wikipedia dataset for nso, parsed from 20190301 dump.

  • "20190301.nv" (v0.0.2) (Size: 2.52 MiB): Wikipedia dataset for nv, parsed from 20190301 dump.

  • "20190301.ny" (v0.0.2) (Size: 1.18 MiB): Wikipedia dataset for ny, parsed from 20190301 dump.

  • "20190301.oc" (v0.0.2) (Size: 70.97 MiB): Wikipedia dataset for oc, parsed from 20190301 dump.

  • "20190301.olo" (v0.0.2) (Size: 1.55 MiB): Wikipedia dataset for olo, parsed from 20190301 dump.

  • "20190301.om" (v0.0.2) (Size: 1.06 MiB): Wikipedia dataset for om, parsed from 20190301 dump.

  • "20190301.or" (v0.0.2) (Size: 24.90 MiB): Wikipedia dataset for or, parsed from 20190301 dump.

  • "20190301.os" (v0.0.2) (Size: 7.31 MiB): Wikipedia dataset for os, parsed from 20190301 dump.

  • "20190301.pa" (v0.0.2) (Size: 40.39 MiB): Wikipedia dataset for pa, parsed from 20190301 dump.

  • "20190301.pag" (v0.0.2) (Size: 1.29 MiB): Wikipedia dataset for pag, parsed from 20190301 dump.

  • "20190301.pam" (v0.0.2) (Size: 8.17 MiB): Wikipedia dataset for pam, parsed from 20190301 dump.

  • "20190301.pap" (v0.0.2) (Size: 1.33 MiB): Wikipedia dataset for pap, parsed from 20190301 dump.

  • "20190301.pcd" (v0.0.2) (Size: 4.14 MiB): Wikipedia dataset for pcd, parsed from 20190301 dump.

  • "20190301.pdc" (v0.0.2) (Size: 1.10 MiB): Wikipedia dataset for pdc, parsed from 20190301 dump.

  • "20190301.pfl" (v0.0.2) (Size: 3.22 MiB): Wikipedia dataset for pfl, parsed from 20190301 dump.

  • "20190301.pi" (v0.0.2) (Size: 586.77 KiB): Wikipedia dataset for pi, parsed from 20190301 dump.

  • "20190301.pih" (v0.0.2) (Size: 654.11 KiB): Wikipedia dataset for pih, parsed from 20190301 dump.

  • "20190301.pl" (v0.0.2) (Size: 1.76 GiB): Wikipedia dataset for pl, parsed from 20190301 dump.

  • "20190301.pms" (v0.0.2) (Size: 13.42 MiB): Wikipedia dataset for pms, parsed from 20190301 dump.

  • "20190301.pnb" (v0.0.2) (Size: 24.31 MiB): Wikipedia dataset for pnb, parsed from 20190301 dump.

  • "20190301.pnt" (v0.0.2) (Size: 533.84 KiB): Wikipedia dataset for pnt, parsed from 20190301 dump.

  • "20190301.ps" (v0.0.2) (Size: 14.09 MiB): Wikipedia dataset for ps, parsed from 20190301 dump.

  • "20190301.pt" (v0.0.2) (Size: 1.58 GiB): Wikipedia dataset for pt, parsed from 20190301 dump.

  • "20190301.qu" (v0.0.2) (Size: 11.42 MiB): Wikipedia dataset for qu, parsed from 20190301 dump.

  • "20190301.rm" (v0.0.2) (Size: 5.85 MiB): Wikipedia dataset for rm, parsed from 20190301 dump.

  • "20190301.rmy" (v0.0.2) (Size: 509.61 KiB): Wikipedia dataset for rmy, parsed from 20190301 dump.

  • "20190301.rn" (v0.0.2) (Size: 779.25 KiB): Wikipedia dataset for rn, parsed from 20190301 dump.

  • "20190301.ro" (v0.0.2) (Size: 449.49 MiB): Wikipedia dataset for ro, parsed from 20190301 dump.

  • "20190301.roa-rup" (v0.0.2) (Size: 931.23 KiB): Wikipedia dataset for roa-rup, parsed from 20190301 dump.

  • "20190301.roa-tara" (v0.0.2) (Size: 5.98 MiB): Wikipedia dataset for roa-tara, parsed from 20190301 dump.

  • "20190301.ru" (v0.0.2) (Size: 3.51 GiB): Wikipedia dataset for ru, parsed from 20190301 dump.

  • "20190301.rue" (v0.0.2) (Size: 4.11 MiB): Wikipedia dataset for rue, parsed from 20190301 dump.

  • "20190301.rw" (v0.0.2) (Size: 904.81 KiB): Wikipedia dataset for rw, parsed from 20190301 dump.

  • "20190301.sa" (v0.0.2) (Size: 14.29 MiB): Wikipedia dataset for sa, parsed from 20190301 dump.

  • "20190301.sah" (v0.0.2) (Size: 11.88 MiB): Wikipedia dataset for sah, parsed from 20190301 dump.

  • "20190301.sat" (v0.0.2) (Size: 2.36 MiB): Wikipedia dataset for sat, parsed from 20190301 dump.

  • "20190301.sc" (v0.0.2) (Size: 4.39 MiB): Wikipedia dataset for sc, parsed from 20190301 dump.

  • "20190301.scn" (v0.0.2) (Size: 11.83 MiB): Wikipedia dataset for scn, parsed from 20190301 dump.

  • "20190301.sco" (v0.0.2) (Size: 57.80 MiB): Wikipedia dataset for sco, parsed from 20190301 dump.

  • "20190301.sd" (v0.0.2) (Size: 12.62 MiB): Wikipedia dataset for sd, parsed from 20190301 dump.

  • "20190301.se" (v0.0.2) (Size: 3.30 MiB): Wikipedia dataset for se, parsed from 20190301 dump.

  • "20190301.sg" (v0.0.2) (Size: 286.02 KiB): Wikipedia dataset for sg, parsed from 20190301 dump.

  • "20190301.sh" (v0.0.2) (Size: 406.72 MiB): Wikipedia dataset for sh, parsed from 20190301 dump.

  • "20190301.si" (v0.0.2) (Size: 36.84 MiB): Wikipedia dataset for si, parsed from 20190301 dump.

  • "20190301.simple" (v0.0.2) (Size: 156.11 MiB): Wikipedia dataset for simple, parsed from 20190301 dump.

  • "20190301.sk" (v0.0.2) (Size: 254.37 MiB): Wikipedia dataset for sk, parsed from 20190301 dump.

  • "20190301.sl" (v0.0.2) (Size: 201.41 MiB): Wikipedia dataset for sl, parsed from 20190301 dump.

  • "20190301.sm" (v0.0.2) (Size: 678.46 KiB): Wikipedia dataset for sm, parsed from 20190301 dump.

  • "20190301.sn" (v0.0.2) (Size: 2.02 MiB): Wikipedia dataset for sn, parsed from 20190301 dump.

  • "20190301.so" (v0.0.2) (Size: 8.17 MiB): Wikipedia dataset for so, parsed from 20190301 dump.

  • "20190301.sq" (v0.0.2) (Size: 77.55 MiB): Wikipedia dataset for sq, parsed from 20190301 dump.

  • "20190301.sr" (v0.0.2) (Size: 725.30 MiB): Wikipedia dataset for sr, parsed from 20190301 dump.

  • "20190301.srn" (v0.0.2) (Size: 634.21 KiB): Wikipedia dataset for srn, parsed from 20190301 dump.

  • "20190301.ss" (v0.0.2) (Size: 737.58 KiB): Wikipedia dataset for ss, parsed from 20190301 dump.

  • "20190301.st" (v0.0.2) (Size: 482.27 KiB): Wikipedia dataset for st, parsed from 20190301 dump.

  • "20190301.stq" (v0.0.2) (Size: 3.26 MiB): Wikipedia dataset for stq, parsed from 20190301 dump.

  • "20190301.su" (v0.0.2) (Size: 20.52 MiB): Wikipedia dataset for su, parsed from 20190301 dump.

  • "20190301.sv" (v0.0.2) (Size: ?? GiB): Wikipedia dataset for sv, parsed from 20190301 dump.

  • "20190301.sw" (v0.0.2) (Size: 27.60 MiB): Wikipedia dataset for sw, parsed from 20190301 dump.

  • "20190301.szl" (v0.0.2) (Size: 4.06 MiB): Wikipedia dataset for szl, parsed from 20190301 dump.

  • "20190301.ta" (v0.0.2) (Size: 141.07 MiB): Wikipedia dataset for ta, parsed from 20190301 dump.

  • "20190301.tcy" (v0.0.2) (Size: 2.33 MiB): Wikipedia dataset for tcy, parsed from 20190301 dump.

  • "20190301.te" (v0.0.2) (Size: 113.16 MiB): Wikipedia dataset for te, parsed from 20190301 dump.

  • "20190301.tet" (v0.0.2) (Size: 1.06 MiB): Wikipedia dataset for tet, parsed from 20190301 dump.

  • "20190301.tg" (v0.0.2) (Size: 36.95 MiB): Wikipedia dataset for tg, parsed from 20190301 dump.

  • "20190301.th" (v0.0.2) (Size: 254.00 MiB): Wikipedia dataset for th, parsed from 20190301 dump.

  • "20190301.ti" (v0.0.2) (Size: 309.72 KiB): Wikipedia dataset for ti, parsed from 20190301 dump.

  • "20190301.tk" (v0.0.2) (Size: 4.50 MiB): Wikipedia dataset for tk, parsed from 20190301 dump.

  • "20190301.tl" (v0.0.2) (Size: 50.85 MiB): Wikipedia dataset for tl, parsed from 20190301 dump.

  • "20190301.tn" (v0.0.2) (Size: 1.21 MiB): Wikipedia dataset for tn, parsed from 20190301 dump.

  • "20190301.to" (v0.0.2) (Size: 775.10 KiB): Wikipedia dataset for to, parsed from 20190301 dump.

  • "20190301.tpi" (v0.0.2) (Size: 1.39 MiB): Wikipedia dataset for tpi, parsed from 20190301 dump.

  • "20190301.tr" (v0.0.2) (Size: 497.19 MiB): Wikipedia dataset for tr, parsed from 20190301 dump.

  • "20190301.ts" (v0.0.2) (Size: 1.39 MiB): Wikipedia dataset for ts, parsed from 20190301 dump.

  • "20190301.tt" (v0.0.2) (Size: 53.23 MiB): Wikipedia dataset for tt, parsed from 20190301 dump.

  • "20190301.tum" (v0.0.2) (Size: 309.58 KiB): Wikipedia dataset for tum, parsed from 20190301 dump.

  • "20190301.tw" (v0.0.2) (Size: 345.96 KiB): Wikipedia dataset for tw, parsed from 20190301 dump.

  • "20190301.ty" (v0.0.2) (Size: 485.56 KiB): Wikipedia dataset for ty, parsed from 20190301 dump.

  • "20190301.tyv" (v0.0.2) (Size: 2.60 MiB): Wikipedia dataset for tyv, parsed from 20190301 dump.

  • "20190301.udm" (v0.0.2) (Size: 2.94 MiB): Wikipedia dataset for udm, parsed from 20190301 dump.

  • "20190301.ug" (v0.0.2) (Size: 5.64 MiB): Wikipedia dataset for ug, parsed from 20190301 dump.

  • "20190301.uk" (v0.0.2) (Size: 1.28 GiB): Wikipedia dataset for uk, parsed from 20190301 dump.

  • "20190301.ur" (v0.0.2) (Size: 129.57 MiB): Wikipedia dataset for ur, parsed from 20190301 dump.

  • "20190301.uz" (v0.0.2) (Size: 60.85 MiB): Wikipedia dataset for uz, parsed from 20190301 dump.

  • "20190301.ve" (v0.0.2) (Size: 257.59 KiB): Wikipedia dataset for ve, parsed from 20190301 dump.

  • "20190301.vec" (v0.0.2) (Size: 10.65 MiB): Wikipedia dataset for vec, parsed from 20190301 dump.

  • "20190301.vep" (v0.0.2) (Size: 4.59 MiB): Wikipedia dataset for vep, parsed from 20190301 dump.

  • "20190301.vi" (v0.0.2) (Size: 623.74 MiB): Wikipedia dataset for vi, parsed from 20190301 dump.

  • "20190301.vls" (v0.0.2) (Size: 6.58 MiB): Wikipedia dataset for vls, parsed from 20190301 dump.

  • "20190301.vo" (v0.0.2) (Size: 23.80 MiB): Wikipedia dataset for vo, parsed from 20190301 dump.

  • "20190301.wa" (v0.0.2) (Size: 8.75 MiB): Wikipedia dataset for wa, parsed from 20190301 dump.

  • "20190301.war" (v0.0.2) (Size: 256.72 MiB): Wikipedia dataset for war, parsed from 20190301 dump.

  • "20190301.wo" (v0.0.2) (Size: 1.54 MiB): Wikipedia dataset for wo, parsed from 20190301 dump.

  • "20190301.wuu" (v0.0.2) (Size: 9.08 MiB): Wikipedia dataset for wuu, parsed from 20190301 dump.

  • "20190301.xal" (v0.0.2) (Size: 1.64 MiB): Wikipedia dataset for xal, parsed from 20190301 dump.

  • "20190301.xh" (v0.0.2) (Size: 1.26 MiB): Wikipedia dataset for xh, parsed from 20190301 dump.

  • "20190301.xmf" (v0.0.2) (Size: 9.40 MiB): Wikipedia dataset for xmf, parsed from 20190301 dump.

  • "20190301.yi" (v0.0.2) (Size: 11.56 MiB): Wikipedia dataset for yi, parsed from 20190301 dump.

  • "20190301.yo" (v0.0.2) (Size: 11.55 MiB): Wikipedia dataset for yo, parsed from 20190301 dump.

  • "20190301.za" (v0.0.2) (Size: 735.93 KiB): Wikipedia dataset for za, parsed from 20190301 dump.

  • "20190301.zea" (v0.0.2) (Size: 2.47 MiB): Wikipedia dataset for zea, parsed from 20190301 dump.

  • "20190301.zh" (v0.0.2) (Size: 1.71 GiB): Wikipedia dataset for zh, parsed from 20190301 dump.

  • "20190301.zh-classical" (v0.0.2) (Size: 13.37 MiB): Wikipedia dataset for zh-classical, parsed from 20190301 dump.

  • "20190301.zh-min-nan" (v0.0.2) (Size: 50.30 MiB): Wikipedia dataset for zh-min-nan, parsed from 20190301 dump.

  • "20190301.zh-yue" (v0.0.2) (Size: 52.41 MiB): Wikipedia dataset for zh-yue, parsed from 20190301 dump.

  • "20190301.zu" (v0.0.2) (Size: 1.50 MiB): Wikipedia dataset for zu, parsed from 20190301 dump.

"wikipedia/20190301.aa"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ab"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ace"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ady"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.af"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ak"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.als"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.am"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.an"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ang"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ar"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.arc"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.arz"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.as"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ast"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.atj"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.av"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ay"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.az"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.azb"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ba"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.bar"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.bat-smg"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.bcl"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.be"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.be-x-old"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.bg"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.bh"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.bi"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.bjn"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.bm"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.bn"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.bo"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.bpy"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.br"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.bs"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.bug"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.bxr"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ca"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.cbk-zam"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.cdo"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ce"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ceb"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ch"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.cho"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.chr"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.chy"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ckb"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.co"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.cr"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.crh"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.cs"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.csb"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.cu"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.cv"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.cy"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.da"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.de"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.din"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.diq"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.dsb"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.dty"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.dv"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.dz"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ee"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.el"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.eml"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.en"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.eo"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.es"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.et"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.eu"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ext"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.fa"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ff"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.fi"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.fiu-vro"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.fj"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.fo"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.fr"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.frp"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.frr"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.fur"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.fy"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ga"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.gag"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.gan"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.gd"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.gl"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.glk"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.gn"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.gom"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.gor"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.got"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.gu"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.gv"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ha"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.hak"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.haw"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.he"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.hi"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.hif"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ho"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.hr"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.hsb"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ht"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.hu"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.hy"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.hz"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ia"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.id"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ie"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ig"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ii"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ik"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ilo"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.inh"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.io"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.is"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.it"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.iu"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ja"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.jam"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.jbo"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.jv"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ka"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.kaa"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.kab"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.kbd"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.kbp"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.kg"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ki"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.kj"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.kk"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.kl"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.km"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.kn"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ko"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.koi"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.kr"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.krc"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ks"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ksh"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ku"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.kv"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.kw"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ky"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.la"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.lad"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.lb"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.lbe"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.lez"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.lfn"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.lg"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.li"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.lij"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.lmo"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ln"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.lo"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.lrc"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.lt"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ltg"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.lv"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.mai"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.map-bms"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.mdf"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.mg"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.mh"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.mhr"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.mi"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.min"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.mk"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ml"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.mn"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.mr"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.mrj"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ms"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.mt"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.mus"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.mwl"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.my"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.myv"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.mzn"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.na"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.nah"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.nap"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.nds"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.nds-nl"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ne"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.new"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ng"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.nl"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.nn"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.no"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.nov"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.nrm"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.nso"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.nv"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ny"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.oc"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.olo"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.om"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.or"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.os"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.pa"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.pag"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.pam"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.pap"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.pcd"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.pdc"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.pfl"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.pi"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.pih"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.pl"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.pms"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.pnb"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.pnt"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ps"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.pt"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.qu"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.rm"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.rmy"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.rn"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ro"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.roa-rup"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.roa-tara"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ru"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.rue"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.rw"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.sa"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.sah"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.sat"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.sc"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.scn"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.sco"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.sd"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.se"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.sg"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.sh"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.si"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.simple"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.sk"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.sl"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.sm"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.sn"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.so"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.sq"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.sr"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.srn"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ss"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.st"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.stq"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.su"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.sv"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.sw"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.szl"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ta"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.tcy"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.te"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.tet"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.tg"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.th"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ti"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.tk"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.tl"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.tn"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.to"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.tpi"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.tr"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ts"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.tt"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.tum"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.tw"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ty"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.tyv"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.udm"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ug"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.uk"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ur"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.uz"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.ve"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.vec"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.vep"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.vi"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.vls"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.vo"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.wa"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.war"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.wo"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.wuu"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.xal"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.xh"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.xmf"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.yi"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.yo"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.za"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.zea"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.zh"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.zh-classical"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.zh-min-nan"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.zh-yue"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

"wikipedia/20190301.zu"

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string, encoder=None),
    'title': Text(shape=(), dtype=tf.string, encoder=None),
})

Statistics

None computed

Urls

Supervised keys (for as_supervised=True)

None

Citation

@ONLINE {wikidump,
    author = "Wikimedia Foundation",
    title  = "Wikimedia Downloads",
    url    = "https://dumps.wikimedia.org"
}

"xnli"

XNLI is a subset of a few thousand examples from MNLI which has been translated into a 14 different languages (some low-ish resource). As with MNLI, the goal is to predict textual entailment (does sentence A imply/contradict/neither sentence B) and is a classification task (given two sentences, predict one of three labels).

xnli is configured with tfds.text.xnli.BuilderConfig and has the following configurations predefined (defaults to the first one):

  • "plain_text" (v0.0.1) (Size: 17.04 MiB): Plain text import of XNLI

"xnli/plain_text"

FeaturesDict({
    'hypothesis': TranslationVariableLanguages({'language': TensorInfo(shape=(None,), dtype=tf.string), 'translation': TensorInfo(shape=(None,), dtype=tf.string)}),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=3),
    'premise': Translation({
        'ar': Text(shape=(), dtype=tf.string, encoder=None),
        'bg': Text(shape=(), dtype=tf.string, encoder=None),
        'de': Text(shape=(), dtype=tf.string, encoder=None),
        'el': Text(shape=(), dtype=tf.string, encoder=None),
        'en': Text(shape=(), dtype=tf.string, encoder=None),
        'es': Text(shape=(), dtype=tf.string, encoder=None),
        'fr': Text(shape=(), dtype=tf.string, encoder=None),
        'hi': Text(shape=(), dtype=tf.string, encoder=None),
        'ru': Text(shape=(), dtype=tf.string, encoder=None),
        'sw': Text(shape=(), dtype=tf.string, encoder=None),
        'th': Text(shape=(), dtype=tf.string, encoder=None),
        'tr': Text(shape=(), dtype=tf.string, encoder=None),
        'ur': Text(shape=(), dtype=tf.string, encoder=None),
        'vi': Text(shape=(), dtype=tf.string, encoder=None),
        'zh': Text(shape=(), dtype=tf.string, encoder=None),
    }),
})

Statistics

Split Examples
ALL 7,500
TEST 5,010
VALIDATION 2,490

Urls

Supervised keys (for as_supervised=True)

None

Citation

@InProceedings{conneau2018xnli,
  author = "Conneau, Alexis
                 and Rinott, Ruty
                 and Lample, Guillaume
                 and Williams, Adina
                 and Bowman, Samuel R.
                 and Schwenk, Holger
                 and Stoyanov, Veselin",
  title = "XNLI: Evaluating Cross-lingual Sentence Representations",
  booktitle = "Proceedings of the 2018 Conference on Empirical Methods
               in Natural Language Processing",
  year = "2018",
  publisher = "Association for Computational Linguistics",
  location = "Brussels, Belgium",
}

translate

"flores"

Evaluation datasets for low-resource machine translation: Nepali-English and Sinhala-English.

flores is configured with tfds.translate.flores.FloresConfig and has the following configurations predefined (defaults to the first one):

  • "neen_plain_text" (v0.0.3) (Size: 984.65 KiB): Translation dataset from ne to en, uses encoder plain_text.

  • "sien_plain_text" (v0.0.3) (Size: 984.65 KiB): Translation dataset from si to en, uses encoder plain_text.

"flores/neen_plain_text"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'ne': Text(shape=(), dtype=tf.string, encoder=None),
})

"flores/sien_plain_text"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'si': Text(shape=(), dtype=tf.string, encoder=None),
})

Statistics

Split Examples
ALL 5,664
VALIDATION 2,898
TEST 2,766

Urls

Supervised keys (for as_supervised=True)

(u'si', u'en')

Citation

@misc{guzmn2019new,
    title={Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English},
    author={Francisco Guzman and Peng-Jen Chen and Myle Ott and Juan Pino and Guillaume Lample and Philipp Koehn and Vishrav Chaudhary and Marc'Aurelio Ranzato},
    year={2019},
    eprint={1902.01382},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

"para_crawl"

Web-Scale Parallel Corpora for Official European Languages. English-Croatian.

para_crawl is configured with tfds.translate.para_crawl.ParaCrawlConfig and has the following configurations predefined (defaults to the first one):

  • "enel_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to el, uses encoder plain_text.

  • "enga_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to ga, uses encoder plain_text.

  • "encs_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to cs, uses encoder plain_text.

  • "enet_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to et, uses encoder plain_text.

  • "enes_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to es, uses encoder plain_text.

  • "ensk_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to sk, uses encoder plain_text.

  • "enpl_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to pl, uses encoder plain_text.

  • "enmt_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to mt, uses encoder plain_text.

  • "enpt_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to pt, uses encoder plain_text.

  • "enro_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to ro, uses encoder plain_text.

  • "enit_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to it, uses encoder plain_text.

  • "enda_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to da, uses encoder plain_text.

  • "ende_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to de, uses encoder plain_text.

  • "enfi_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to fi, uses encoder plain_text.

  • "enbg_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to bg, uses encoder plain_text.

  • "enfr_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to fr, uses encoder plain_text.

  • "enlv_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to lv, uses encoder plain_text.

  • "ensv_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to sv, uses encoder plain_text.

  • "enlt_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to lt, uses encoder plain_text.

  • "ennl_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to nl, uses encoder plain_text.

  • "ensl_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to sl, uses encoder plain_text.

  • "enhu_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to hu, uses encoder plain_text.

  • "enhr_plain_text" (v0.1.0) (Size: ?? GiB): Translation dataset from English to hr, uses encoder plain_text.

"para_crawl/enel_plain_text"

Translation({
    'el': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/enga_plain_text"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'ga': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/encs_plain_text"

Translation({
    'cs': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/enet_plain_text"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'et': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/enes_plain_text"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'es': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/ensk_plain_text"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'sk': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/enpl_plain_text"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'pl': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/enmt_plain_text"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'mt': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/enpt_plain_text"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'pt': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/enro_plain_text"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'ro': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/enit_plain_text"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'it': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/enda_plain_text"

Translation({
    'da': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/ende_plain_text"

Translation({
    'de': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/enfi_plain_text"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'fi': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/enbg_plain_text"

Translation({
    'bg': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/enfr_plain_text"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'fr': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/enlv_plain_text"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'lv': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/ensv_plain_text"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'sv': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/enlt_plain_text"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'lt': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/ennl_plain_text"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'nl': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/ensl_plain_text"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'sl': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/enhu_plain_text"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'hu': Text(shape=(), dtype=tf.string, encoder=None),
})

"para_crawl/enhr_plain_text"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'hr': Text(shape=(), dtype=tf.string, encoder=None),
})

Statistics

None computed

Urls

Supervised keys (for as_supervised=True)

(u'en', u'hr')

Citation

@misc {paracrawl,
    title  = "ParaCrawl",
    year   = "2018",
    url    = "http://paracrawl.eu/download.html."
}

"ted_hrlr_translate"

Data sets derived from TED talk transcripts for comparing similar language pairs where one is high resource and the other is low resource.

ted_hrlr_translate is configured with tfds.translate.ted_hrlr.TedHrlrConfig and has the following configurations predefined (defaults to the first one):

  • "az_to_en" (v0.0.1) (Size: 124.94 MiB): Translation dataset from az to en in plain text.

  • "aztr_to_en" (v0.0.1) (Size: 124.94 MiB): Translation dataset from az_tr to en in plain text.

  • "be_to_en" (v0.0.1) (Size: 124.94 MiB): Translation dataset from be to en in plain text.

  • "beru_to_en" (v0.0.1) (Size: 124.94 MiB): Translation dataset from be_ru to en in plain text.

  • "es_to_pt" (v0.0.1) (Size: 124.94 MiB): Translation dataset from es to pt in plain text.

  • "fr_to_pt" (v0.0.1) (Size: 124.94 MiB): Translation dataset from fr to pt in plain text.

  • "gl_to_en" (v0.0.1) (Size: 124.94 MiB): Translation dataset from gl to en in plain text.

  • "glpt_to_en" (v0.0.1) (Size: 124.94 MiB): Translation dataset from gl_pt to en in plain text.

  • "he_to_pt" (v0.0.1) (Size: 124.94 MiB): Translation dataset from he to pt in plain text.

  • "it_to_pt" (v0.0.1) (Size: 124.94 MiB): Translation dataset from it to pt in plain text.

  • "pt_to_en" (v0.0.1) (Size: 124.94 MiB): Translation dataset from pt to en in plain text.

  • "ru_to_en" (v0.0.1) (Size: 124.94 MiB): Translation dataset from ru to en in plain text.

  • "ru_to_pt" (v0.0.1) (Size: 124.94 MiB): Translation dataset from ru to pt in plain text.

  • "tr_to_en" (v0.0.1) (Size: 124.94 MiB): Translation dataset from tr to en in plain text.

"ted_hrlr_translate/az_to_en"

Translation({
    'az': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

"ted_hrlr_translate/aztr_to_en"

Translation({
    'az_tr': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

"ted_hrlr_translate/be_to_en"

Translation({
    'be': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

"ted_hrlr_translate/beru_to_en"

Translation({
    'be_ru': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

"ted_hrlr_translate/es_to_pt"

Translation({
    'es': Text(shape=(), dtype=tf.string, encoder=None),
    'pt': Text(shape=(), dtype=tf.string, encoder=None),
})

"ted_hrlr_translate/fr_to_pt"

Translation({
    'fr': Text(shape=(), dtype=tf.string, encoder=None),
    'pt': Text(shape=(), dtype=tf.string, encoder=None),
})

"ted_hrlr_translate/gl_to_en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'gl': Text(shape=(), dtype=tf.string, encoder=None),
})

"ted_hrlr_translate/glpt_to_en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'gl_pt': Text(shape=(), dtype=tf.string, encoder=None),
})

"ted_hrlr_translate/he_to_pt"

Translation({
    'he': Text(shape=(), dtype=tf.string, encoder=None),
    'pt': Text(shape=(), dtype=tf.string, encoder=None),
})

"ted_hrlr_translate/it_to_pt"

Translation({
    'it': Text(shape=(), dtype=tf.string, encoder=None),
    'pt': Text(shape=(), dtype=tf.string, encoder=None),
})

"ted_hrlr_translate/pt_to_en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'pt': Text(shape=(), dtype=tf.string, encoder=None),
})

"ted_hrlr_translate/ru_to_en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'ru': Text(shape=(), dtype=tf.string, encoder=None),
})

"ted_hrlr_translate/ru_to_pt"

Translation({
    'pt': Text(shape=(), dtype=tf.string, encoder=None),
    'ru': Text(shape=(), dtype=tf.string, encoder=None),
})

"ted_hrlr_translate/tr_to_en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'tr': Text(shape=(), dtype=tf.string, encoder=None),
})

Statistics

Split Examples
ALL 191,524
TRAIN 182,450
TEST 5,029
VALIDATION 4,045

Urls

Supervised keys (for as_supervised=True)

(u'tr', u'en')

Citation

@inproceedings{Ye2018WordEmbeddings,
  author  = {Ye, Qi and Devendra, Sachan and Matthieu, Felix and Sarguna, Padmanabhan and Graham, Neubig},
  title   = {When and Why are pre-trained word embeddings useful for Neural Machine Translation},
  booktitle = {HLT-NAACL},
  year    = {2018},
  }

"ted_multi_translate"

Massively multilingual (60 language) data set derived from TED Talk transcripts. Each record consists of parallel arrays of language and text. Missing and incomplete translations will be filtered out.

ted_multi_translate is configured with tfds.translate.ted_multi.BuilderConfig and has the following configurations predefined (defaults to the first one):

  • "plain_text" (v0.0.3) (Size: 335.91 MiB): Plain text import of multilingual TED talk translations

"ted_multi_translate/plain_text"

FeaturesDict({
    'talk_name': Text(shape=(), dtype=tf.string, encoder=None),
    'translations': TranslationVariableLanguages({'language': TensorInfo(shape=(None,), dtype=tf.string), 'translation': TensorInfo(shape=(None,), dtype=tf.string)}),
})

Statistics

Split Examples
ALL 271,360
TRAIN 258,098
TEST 7,213
VALIDATION 6,049

Urls

Supervised keys (for as_supervised=True)

None

Citation

@InProceedings{qi-EtAl:2018:N18-2,
  author    = {Qi, Ye  and  Sachan, Devendra  and  Felix, Matthieu  and  Padmanabhan, Sarguna  and  Neubig, Graham},
  title     = {When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?},
  booktitle = {Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)},
  month     = {June},
  year      = {2018},
  address   = {New Orleans, Louisiana},
  publisher = {Association for Computational Linguistics},
  pages     = {529--535},
  abstract  = {The performance of Neural Machine Translation (NMT) systems often suffers in low-resource scenarios where sufficiently large-scale parallel corpora cannot be obtained. Pre-trained word embeddings have proven to be invaluable for improving performance in natural language analysis tasks, which often suffer from paucity of data. However, their utility for NMT has not been extensively explored. In this work, we perform five sets of experiments that analyze when we can expect pre-trained word embeddings to help in NMT tasks. We show that such embeddings can be surprisingly effective in some cases -- providing gains of up to 20 BLEU points in the most favorable setting.},
  url       = {http://www.aclweb.org/anthology/N18-2084}
}

"wmt14_translate"

Translate dataset based on the data from statmt.org.

Versions exists for the different years using a combination of multiple data sources. The base wmt_translate allows you to create your own config to choose your own data/language pair by creating a custom tfds.translate.wmt.WmtConfig.

config = tfds.translate.wmt.WmtConfig(
    version="0.0.1",
    language_pair=("fr", "de"),
    subsets={
        tfds.Split.TRAIN: ["commoncrawl_frde"],
        tfds.Split.VALIDATION: ["euelections_dev2019"],
    },
)
builder = tfds.builder("wmt_translate", config=config)

wmt14_translate is configured with tfds.translate.wmt14.WmtConfig and has the following configurations predefined (defaults to the first one):

  • "cs-en" (v0.0.3) (Size: 1.58 GiB): WMT 2014 cs-en translation task dataset.

  • "de-en" (v0.0.3) (Size: 1.58 GiB): WMT 2014 de-en translation task dataset.

  • "fr-en" (v0.0.3) (Size: 6.20 GiB): WMT 2014 fr-en translation task dataset.

  • "hi-en" (v0.0.3) (Size: 44.65 MiB): WMT 2014 hi-en translation task dataset.

  • "ru-en" (v0.0.3) (Size: 998.38 MiB): WMT 2014 ru-en translation task dataset.

"wmt14_translate/cs-en"

Translation({
    'cs': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt14_translate/de-en"

Translation({
    'de': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt14_translate/fr-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'fr': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt14_translate/hi-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'hi': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt14_translate/ru-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'ru': Text(shape=(), dtype=tf.string, encoder=None),
})

Statistics

Split Examples
ALL 2,492,968
TRAIN 2,486,965
TEST 3,003
VALIDATION 3,000

Urls

Supervised keys (for as_supervised=True)

(u'ru', u'en')

Citation

@InProceedings{bojar-EtAl:2014:W14-33,
  author    = {Bojar, Ondrej  and  Buck, Christian  and  Federmann, Christian  and  Haddow, Barry  and  Koehn, Philipp  and  Leveling, Johannes  and  Monz, Christof  and  Pecina, Pavel  and  Post, Matt  and  Saint-Amand, Herve  and  Soricut, Radu  and  Specia, Lucia  and  Tamchyna, Ale
{s}},
  title     = {Findings of the 2014 Workshop on Statistical Machine Translation},
  booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation},
  month     = {June},
  year      = {2014},
  address   = {Baltimore, Maryland, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {12--58},
  url       = {http://www.aclweb.org/anthology/W/W14/W14-3302}
}

"wmt15_translate"

Translate dataset based on the data from statmt.org.

Versions exists for the different years using a combination of multiple data sources. The base wmt_translate allows you to create your own config to choose your own data/language pair by creating a custom tfds.translate.wmt.WmtConfig.

config = tfds.translate.wmt.WmtConfig(
    version="0.0.1",
    language_pair=("fr", "de"),
    subsets={
        tfds.Split.TRAIN: ["commoncrawl_frde"],
        tfds.Split.VALIDATION: ["euelections_dev2019"],
    },
)
builder = tfds.builder("wmt_translate", config=config)

wmt15_translate is configured with tfds.translate.wmt15.WmtConfig and has the following configurations predefined (defaults to the first one):

  • "cs-en" (v0.0.3) (Size: 1.62 GiB): WMT 2015 cs-en translation task dataset.

  • "de-en" (v0.0.3) (Size: 1.62 GiB): WMT 2015 de-en translation task dataset.

  • "fi-en" (v0.0.3) (Size: 260.51 MiB): WMT 2015 fi-en translation task dataset.

  • "fr-en" (v0.0.3) (Size: 6.24 GiB): WMT 2015 fr-en translation task dataset.

  • "ru-en" (v0.0.3) (Size: 1.02 GiB): WMT 2015 ru-en translation task dataset.

  • "cs-en.subwords8k" (v0.0.3) (Size: 1.62 GiB): WMT 2015 cs-en translation task dataset with subword encoding.

  • "de-en.subwords8k" (v0.0.3) (Size: 1.62 GiB): WMT 2015 de-en translation task dataset with subword encoding.

  • "fi-en.subwords8k" (v0.0.3) (Size: 260.51 MiB): WMT 2015 fi-en translation task dataset with subword encoding.

  • "fr-en.subwords8k" (v0.0.3) (Size: 6.24 GiB): WMT 2015 fr-en translation task dataset with subword encoding.

  • "ru-en.subwords8k" (v0.0.3) (Size: 1.02 GiB): WMT 2015 ru-en translation task dataset with subword encoding.

"wmt15_translate/cs-en"

Translation({
    'cs': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt15_translate/de-en"

Translation({
    'de': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt15_translate/fi-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'fi': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt15_translate/fr-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'fr': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt15_translate/ru-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'ru': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt15_translate/cs-en.subwords8k"

Translation({
    'cs': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8245>),
    'en': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8198>),
})

"wmt15_translate/de-en.subwords8k"

Translation({
    'de': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8270>),
    'en': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8212>),
})

"wmt15_translate/fi-en.subwords8k"

Translation({
    'en': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8217>),
    'fi': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8113>),
})

"wmt15_translate/fr-en.subwords8k"

Translation({
    'en': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8183>),
    'fr': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8133>),
})

"wmt15_translate/ru-en.subwords8k"

Translation({
    'en': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8194>),
    'ru': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8180>),
})

Statistics

Split Examples
ALL 2,500,902
TRAIN 2,495,081
VALIDATION 3,003
TEST 2,818

Urls

Supervised keys (for as_supervised=True)

(u'ru', u'en')

Citation

@InProceedings{bojar-EtAl:2015:WMT,
  author    = {Bojar, Ond
{r}ej  and  Chatterjee, Rajen  and  Federmann, Christian  and  Haddow, Barry  and  Huck, Matthias  and  Hokamp, Chris  and  Koehn, Philipp  and  Logacheva, Varvara  and  Monz, Christof  and  Negri, Matteo  and  Post, Matt  and  Scarton, Carolina  and  Specia, Lucia  and  Turchi, Marco},
  title     = {Findings of the 2015 Workshop on Statistical Machine Translation},
  booktitle = {Proceedings of the Tenth Workshop on Statistical Machine Translation},
  month     = {September},
  year      = {2015},
  address   = {Lisbon, Portugal},
  publisher = {Association for Computational Linguistics},
  pages     = {1--46},
  url       = {http://aclweb.org/anthology/W15-3001}
}

"wmt16_translate"

Translate dataset based on the data from statmt.org.

Versions exists for the different years using a combination of multiple data sources. The base wmt_translate allows you to create your own config to choose your own data/language pair by creating a custom tfds.translate.wmt.WmtConfig.

config = tfds.translate.wmt.WmtConfig(
    version="0.0.1",
    language_pair=("fr", "de"),
    subsets={
        tfds.Split.TRAIN: ["commoncrawl_frde"],
        tfds.Split.VALIDATION: ["euelections_dev2019"],
    },
)
builder = tfds.builder("wmt_translate", config=config)

wmt16_translate is configured with tfds.translate.wmt16.WmtConfig and has the following configurations predefined (defaults to the first one):

  • "cs-en" (v0.0.3) (Size: 1.57 GiB): WMT 2016 cs-en translation task dataset.

  • "de-en" (v0.0.3) (Size: 1.57 GiB): WMT 2016 de-en translation task dataset.

  • "fi-en" (v0.0.3) (Size: 260.51 MiB): WMT 2016 fi-en translation task dataset.

  • "ro-en" (v0.0.3) (Size: 273.83 MiB): WMT 2016 ro-en translation task dataset.

  • "ru-en" (v0.0.3) (Size: 993.38 MiB): WMT 2016 ru-en translation task dataset.

  • "tr-en" (v0.0.3) (Size: 59.32 MiB): WMT 2016 tr-en translation task dataset.

"wmt16_translate/cs-en"

Translation({
    'cs': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt16_translate/de-en"

Translation({
    'de': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt16_translate/fi-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'fi': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt16_translate/ro-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'ro': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt16_translate/ru-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'ru': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt16_translate/tr-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'tr': Text(shape=(), dtype=tf.string, encoder=None),
})

Statistics

Split Examples
ALL 209,757
TRAIN 205,756
TEST 3,000
VALIDATION 1,001

Urls

Supervised keys (for as_supervised=True)

(u'tr', u'en')

Citation

@InProceedings{bojar-EtAl:2016:WMT1,
  author    = {Bojar, Ond
{r}ej  and  Chatterjee, Rajen  and  Federmann, Christian  and  Graham, Yvette  and  Haddow, Barry  and  Huck, Matthias  and  Jimeno Yepes, Antonio  and  Koehn, Philipp  and  Logacheva, Varvara  and  Monz, Christof  and  Negri, Matteo  and  Neveol, Aurelie  and  Neves, Mariana  and  Popel, Martin  and  Post, Matt  and  Rubino, Raphael  and  Scarton, Carolina  and  Specia, Lucia  and  Turchi, Marco  and  Verspoor, Karin  and  Zampieri, Marcos},
  title     = {Findings of the 2016 Conference on Machine Translation},
  booktitle = {Proceedings of the First Conference on Machine Translation},
  month     = {August},
  year      = {2016},
  address   = {Berlin, Germany},
  publisher = {Association for Computational Linguistics},
  pages     = {131--198},
  url       = {http://www.aclweb.org/anthology/W/W16/W16-2301}
}

"wmt17_translate"

Translate dataset based on the data from statmt.org.

Versions exists for the different years using a combination of multiple data sources. The base wmt_translate allows you to create your own config to choose your own data/language pair by creating a custom tfds.translate.wmt.WmtConfig.

config = tfds.translate.wmt.WmtConfig(
    version="0.0.1",
    language_pair=("fr", "de"),
    subsets={
        tfds.Split.TRAIN: ["commoncrawl_frde"],
        tfds.Split.VALIDATION: ["euelections_dev2019"],
    },
)
builder = tfds.builder("wmt_translate", config=config)

wmt17_translate is configured with tfds.translate.wmt17.WmtConfig and has the following configurations predefined (defaults to the first one):

  • "cs-en" (v0.0.3) (Size: 1.66 GiB): WMT 2017 cs-en translation task dataset.

  • "de-en" (v0.0.3) (Size: 1.81 GiB): WMT 2017 de-en translation task dataset.

  • "fi-en" (v0.0.3) (Size: 414.10 MiB): WMT 2017 fi-en translation task dataset.

  • "lv-en" (v0.0.3) (Size: 161.69 MiB): WMT 2017 lv-en translation task dataset.

  • "ru-en" (v0.0.3) (Size: 3.34 GiB): WMT 2017 ru-en translation task dataset.

  • "tr-en" (v0.0.3) (Size: 59.32 MiB): WMT 2017 tr-en translation task dataset.

  • "zh-en" (v0.0.3) (Size: 2.16 GiB): WMT 2017 zh-en translation task dataset.

"wmt17_translate/cs-en"

Translation({
    'cs': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt17_translate/de-en"

Translation({
    'de': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt17_translate/fi-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'fi': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt17_translate/lv-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'lv': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt17_translate/ru-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'ru': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt17_translate/tr-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'tr': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt17_translate/zh-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'zh': Text(shape=(), dtype=tf.string, encoder=None),
})

Statistics

Split Examples
ALL 25,140,612
TRAIN 25,136,609
VALIDATION 2,002
TEST 2,001

Urls

Supervised keys (for as_supervised=True)

(u'zh', u'en')

Citation

@InProceedings{bojar-EtAl:2017:WMT1,
  author    = {Bojar, Ond
{r}ej  and  Chatterjee, Rajen  and  Federmann, Christian  and  Graham, Yvette  and  Haddow, Barry  and  Huang, Shujian  and  Huck, Matthias  and  Koehn, Philipp  and  Liu, Qun  and  Logacheva, Varvara  and  Monz, Christof  and  Negri, Matteo  and  Post, Matt  and  Rubino, Raphael  and  Specia, Lucia  and  Turchi, Marco},
  title     = {Findings of the 2017 Conference on Machine Translation (WMT17)},
  booktitle = {Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers},
  month     = {September},
  year      = {2017},
  address   = {Copenhagen, Denmark},
  publisher = {Association for Computational Linguistics},
  pages     = {169--214},
  url       = {http://www.aclweb.org/anthology/W17-4717}
}

"wmt18_translate"

Translate dataset based on the data from statmt.org.

Versions exists for the different years using a combination of multiple data sources. The base wmt_translate allows you to create your own config to choose your own data/language pair by creating a custom tfds.translate.wmt.WmtConfig.

config = tfds.translate.wmt.WmtConfig(
    version="0.0.1",
    language_pair=("fr", "de"),
    subsets={
        tfds.Split.TRAIN: ["commoncrawl_frde"],
        tfds.Split.VALIDATION: ["euelections_dev2019"],
    },
)
builder = tfds.builder("wmt_translate", config=config)

wmt18_translate is configured with tfds.translate.wmt18.WmtConfig and has the following configurations predefined (defaults to the first one):

  • "cs-en" (v0.0.3) (Size: 1.89 GiB): WMT 2018 cs-en translation task dataset.

  • "de-en" (v0.0.3) (Size: 3.55 GiB): WMT 2018 de-en translation task dataset.

  • "et-en" (v0.0.3) (Size: 499.91 MiB): WMT 2018 et-en translation task dataset.

  • "fi-en" (v0.0.3) (Size: 468.76 MiB): WMT 2018 fi-en translation task dataset.

  • "kk-en" (v0.0.3) (Size: ?? GiB): WMT 2018 kk-en translation task dataset.

  • "ru-en" (v0.0.3) (Size: 3.91 GiB): WMT 2018 ru-en translation task dataset.

  • "tr-en" (v0.0.3) (Size: 59.32 MiB): WMT 2018 tr-en translation task dataset.

  • "zh-en" (v0.0.3) (Size: 2.10 GiB): WMT 2018 zh-en translation task dataset.

"wmt18_translate/cs-en"

Translation({
    'cs': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt18_translate/de-en"

Translation({
    'de': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt18_translate/et-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'et': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt18_translate/fi-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'fi': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt18_translate/kk-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'kk': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt18_translate/ru-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'ru': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt18_translate/tr-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'tr': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt18_translate/zh-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'zh': Text(shape=(), dtype=tf.string, encoder=None),
})

Statistics

Split Examples
ALL 25,168,191
TRAIN 25,162,209
TEST 3,981
VALIDATION 2,001

Urls

Supervised keys (for as_supervised=True)

(u'zh', u'en')

Citation

@InProceedings{bojar-EtAl:2018:WMT1,
  author    = {Bojar, Ond
{r}ej  and  Federmann, Christian  and  Fishel, Mark
    and Graham, Yvette  and  Haddow, Barry  and  Huck, Matthias  and
    Koehn, Philipp  and  Monz, Christof},
  title     = {Findings of the 2018 Conference on Machine Translation (WMT18)},
  booktitle = {Proceedings of the Third Conference on Machine Translation,
    Volume 2: Shared Task Papers},
  month     = {October},
  year      = {2018},
  address   = {Belgium, Brussels},
  publisher = {Association for Computational Linguistics},
  pages     = {272--307},
  url       = {http://www.aclweb.org/anthology/W18-6401}
}

"wmt19_translate"

Translate dataset based on the data from statmt.org.

Versions exists for the different years using a combination of multiple data sources. The base wmt_translate allows you to create your own config to choose your own data/language pair by creating a custom tfds.translate.wmt.WmtConfig.

config = tfds.translate.wmt.WmtConfig(
    version="0.0.1",
    language_pair=("fr", "de"),
    subsets={
        tfds.Split.TRAIN: ["commoncrawl_frde"],
        tfds.Split.VALIDATION: ["euelections_dev2019"],
    },
)
builder = tfds.builder("wmt_translate", config=config)

wmt19_translate is configured with tfds.translate.wmt19.WmtConfig and has the following configurations predefined (defaults to the first one):

  • "cs-en" (v0.0.3) (Size: 1.88 GiB): WMT 2019 cs-en translation task dataset.

  • "de-en" (v0.0.3) (Size: 9.71 GiB): WMT 2019 de-en translation task dataset.

  • "fi-en" (v0.0.3) (Size: 959.46 MiB): WMT 2019 fi-en translation task dataset.

  • "gu-en" (v0.0.3) (Size: 37.03 MiB): WMT 2019 gu-en translation task dataset.

  • "kk-en" (v0.0.3) (Size: 39.58 MiB): WMT 2019 kk-en translation task dataset.

  • "lt-en" (v0.0.3) (Size: 392.20 MiB): WMT 2019 lt-en translation task dataset.

  • "ru-en" (v0.0.3) (Size: 3.86 GiB): WMT 2019 ru-en translation task dataset.

  • "zh-en" (v0.0.3) (Size: 2.04 GiB): WMT 2019 zh-en translation task dataset.

  • "fr-de" (v0.0.3) (Size: 722.20 MiB): WMT 2019 fr-de translation task dataset.

"wmt19_translate/cs-en"

Translation({
    'cs': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt19_translate/de-en"

Translation({
    'de': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt19_translate/fi-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'fi': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt19_translate/gu-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'gu': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt19_translate/kk-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'kk': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt19_translate/lt-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'lt': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt19_translate/ru-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'ru': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt19_translate/zh-en"

Translation({
    'en': Text(shape=(), dtype=tf.string, encoder=None),
    'zh': Text(shape=(), dtype=tf.string, encoder=None),
})

"wmt19_translate/fr-de"

Translation({
    'de': Text(shape=(), dtype=tf.string, encoder=None),
    'fr': Text(shape=(), dtype=tf.string, encoder=None),
})

Statistics

Split Examples
ALL 9,825,988
TRAIN 9,824,476
VALIDATION 1,512

Urls

Supervised keys (for as_supervised=True)

(u'fr', u'de')

Citation

@ONLINE {wmt19translate,
    author = "Wikimedia Foundation",
    title  = "ACL 2019 Fourth Conference on Machine Translation (WMT19), Shared Task: Machine Translation of News",
    url    = "http://www.statmt.org/wmt19/translation-task.html"
}

"wmt_t2t_translate"

Translate dataset based on the data from statmt.org.

Versions exists for the different years using a combination of multiple data sources. The base wmt_translate allows you to create your own config to choose your own data/language pair by creating a custom tfds.translate.wmt.WmtConfig.

config = tfds.translate.wmt.WmtConfig(
    version="0.0.1",
    language_pair=("fr", "de"),
    subsets={
        tfds.Split.TRAIN: ["commoncrawl_frde"],
        tfds.Split.VALIDATION: ["euelections_dev2019"],
    },
)
builder = tfds.builder("wmt_translate", config=config)

wmt_t2t_translate is configured with tfds.translate.wmt_t2t.WmtConfig and has the following configurations predefined (defaults to the first one):

  • "de-en" (v0.0.1) (Size: 1.61 GiB): WMT T2T EnDe translation task dataset.

"wmt_t2t_translate/de-en"

Translation({
    'de': Text(shape=(), dtype=tf.string, encoder=None),
    'en': Text(shape=(), dtype=tf.string, encoder=None),
})

Statistics

Split Examples
ALL 4,598,292
TRAIN 4,592,289
TEST 3,003
VALIDATION 3,000

Urls

Supervised keys (for as_supervised=True)

(u'de', u'en')

Citation

@InProceedings{bojar-EtAl:2014:W14-33,
  author    = {Bojar, Ondrej  and  Buck, Christian  and  Federmann, Christian  and  Haddow, Barry  and  Koehn, Philipp  and  Leveling, Johannes  and  Monz, Christof  and  Pecina, Pavel  and  Post, Matt  and  Saint-Amand, Herve  and  Soricut, Radu  and  Specia, Lucia  and  Tamchyna, Ale
{s}},
  title     = {Findings of the 2014 Workshop on Statistical Machine Translation},
  booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation},
  month     = {June},
  year      = {2014},
  address   = {Baltimore, Maryland, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {12--58},
  url       = {http://www.aclweb.org/anthology/W/W14/W14-3302}
}

video

"bair_robot_pushing_small"

This data set contains roughly 44,000 examples of robot pushing motions, including one training set (train) and two test sets of previously seen (testseen) and unseen (testnovel) objects. This is the small 64x64 version.

Features

Sequence({
    'action': Tensor(shape=(4,), dtype=tf.float32),
    'endeffector_pos': Tensor(shape=(3,), dtype=tf.float32),
    'image_aux1': Image(shape=(64, 64, 3), dtype=tf.uint8),
    'image_main': Image(shape=(64, 64, 3), dtype=tf.uint8),
})

Statistics

Split Examples
ALL 43,520
TRAIN 43,264
TEST 256

Urls

Supervised keys (for as_supervised=True)

None

Citation

@misc{1710.05268,
  Author = {Frederik Ebert and Chelsea Finn and Alex X. Lee and Sergey Levine},
  Title = {Self-Supervised Visual Planning with Temporal Skip Connections},
  Year = {2017},
  Eprint = {arXiv:1710.05268},
}

"moving_mnist"

Moving variant of MNIST database of handwritten digits. This is the data used by the authors for reporting model performance. See tfds.video.moving_mnist.image_as_moving_sequence for generating training/validation data from the MNIST dataset.

Features

FeaturesDict({
    'image_sequence': Video(shape=(20, 64, 64, 1), dtype=tf.uint8, feature=Image(shape=(64, 64, 1), dtype=tf.uint8)),
})

Statistics

Split Examples
TEST 10,000
ALL 10,000

Urls

Supervised keys (for as_supervised=True)

None

Citation

@article{DBLP:journals/corr/SrivastavaMS15,
  author    = {Nitish Srivastava and
               Elman Mansimov and
               Ruslan Salakhutdinov},
  title     = {Unsupervised Learning of Video Representations using LSTMs},
  journal   = {CoRR},
  volume    = {abs/1502.04681},
  year      = {2015},
  url       = {http://arxiv.org/abs/1502.04681},
  archivePrefix = {arXiv},
  eprint    = {1502.04681},
  timestamp = {Mon, 13 Aug 2018 16:47:05 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/SrivastavaMS15},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

"starcraft_video"

This data set contains videos generated from Starcraft.

starcraft_video is configured with tfds.video.starcraft.StarcraftVideoConfig and has the following configurations predefined (defaults to the first one):

  • "brawl_64" (v0.1.2) (Size: 6.40 GiB): Brawl map with 64x64 resolution.

  • "brawl_128" (v0.1.2) (Size: 20.76 GiB): Brawl map with 128x128 resolution.

  • "collect_mineral_shards_64" (v0.1.2) (Size: 7.83 GiB): CollectMineralShards map with 64x64 resolution.

  • "collect_mineral_shards_128" (v0.1.2) (Size: 24.83 GiB): CollectMineralShards map with 128x128 resolution.

  • "move_unit_to_border_64" (v0.1.2) (Size: 1.77 GiB): MoveUnitToBorder map with 64x64 resolution.

  • "move_unit_to_border_128" (v0.1.2) (Size: 5.75 GiB): MoveUnitToBorder map with 128x128 resolution.

  • "road_trip_with_medivac_64" (v0.1.2) (Size: 2.48 GiB): RoadTripWithMedivac map with 64x64 resolution.

  • "road_trip_with_medivac_128" (v0.1.2) (Size: 7.80 GiB): RoadTripWithMedivac map with 128x128 resolution.

"starcraft_video/brawl_64"

FeaturesDict({
    'rgb_screen': Video(shape=(None, 64, 64, 3), dtype=tf.uint8, feature=Image(shape=(64, 64, 3), dtype=tf.uint8)),
})

"starcraft_video/brawl_128"

FeaturesDict({
    'rgb_screen': Video(shape=(None, 128, 128, 3), dtype=tf.uint8, feature=Image(shape=(128, 128, 3), dtype=tf.uint8)),
})

"starcraft_video/collect_mineral_shards_64"

FeaturesDict({
    'rgb_screen': Video(shape=(None, 64, 64, 3), dtype=tf.uint8, feature=Image(shape=(64, 64, 3), dtype=tf.uint8)),
})

"starcraft_video/collect_mineral_shards_128"

FeaturesDict({
    'rgb_screen': Video(shape=(None, 128, 128, 3), dtype=tf.uint8, feature=Image(shape=(128, 128, 3), dtype=tf.uint8)),
})

"starcraft_video/move_unit_to_border_64"

FeaturesDict({
    'rgb_screen': Video(shape=(None, 64, 64, 3), dtype=tf.uint8, feature=Image(shape=(64, 64, 3), dtype=tf.uint8)),
})

"starcraft_video/move_unit_to_border_128"

FeaturesDict({
    'rgb_screen': Video(shape=(None, 128, 128, 3), dtype=tf.uint8, feature=Image(shape=(128, 128, 3), dtype=tf.uint8)),
})

"starcraft_video/road_trip_with_medivac_64"

FeaturesDict({
    'rgb_screen': Video(shape=(None, 64, 64, 3), dtype=tf.uint8, feature=Image(shape=(64, 64, 3), dtype=tf.uint8)),
})

"starcraft_video/road_trip_with_medivac_128"

FeaturesDict({
    'rgb_screen': Video(shape=(None, 128, 128, 3), dtype=tf.uint8, feature=Image(shape=(128, 128, 3), dtype=tf.uint8)),
})

Statistics

Split Examples
ALL 14,000
TRAIN 10,000
VALIDATION 2,000
TEST 2,000

Urls

Supervised keys (for as_supervised=True)

None

Citation

@article{DBLP:journals/corr/abs-1812-01717,
  author    = {Thomas Unterthiner and
               Sjoerd van Steenkiste and
               Karol Kurach and
               Rapha{"{e}}l Marinier and
               Marcin Michalski and
               Sylvain Gelly},
  title     = {Towards Accurate Generative Models of Video: {A} New Metric and
               Challenges},
  journal   = {CoRR},
  volume    = {abs/1812.01717},
  year      = {2018},
  url       = {http://arxiv.org/abs/1812.01717},
  archivePrefix = {arXiv},
  eprint    = {1812.01717},
  timestamp = {Tue, 01 Jan 2019 15:01:25 +0100},
  biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1812-01717},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

"ucf101"

A 101-label video classification dataset.

ucf101 is configured with tfds.video.ucf101.Ucf101Config and has the following configurations predefined (defaults to the first one):

  • "ucf101_1_256" (v1.0.0) (Size: 6.48 GiB): 256x256 UCF with the first action recognition split.

"ucf101/ucf101_1_256"

FeaturesDict({
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=101),
    'video': Video(shape=(None, 256, 256, 3), dtype=tf.uint8, feature=Image(shape=(256, 256, 3), dtype=tf.uint8)),
})

Statistics

Split Examples
ALL 13,320
TRAIN 9,537
TEST 3,783

Urls

Supervised keys (for as_supervised=True)

None

Citation

@article{DBLP:journals/corr/abs-1212-0402,
  author    = {Khurram Soomro and
               Amir Roshan Zamir and
               Mubarak Shah},
  title     = { {UCF101:} {A} Dataset of 101 Human Actions Classes From Videos in
               The Wild},
  journal   = {CoRR},
  volume    = {abs/1212.0402},
  year      = {2012},
  url       = {http://arxiv.org/abs/1212.0402},
  archivePrefix = {arXiv},
  eprint    = {1212.0402},
  timestamp = {Mon, 13 Aug 2018 16:47:45 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1212-0402},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}