youtube_vis

  • Description:

Youtube-vis is a video instance segmentation dataset. It contains 2,883 high-resolution YouTube videos, a per-pixel category label set including 40 common objects such as person, animals and vehicles, 4,883 unique video instances, and 131k high-quality manual annotations.

The YouTube-VIS dataset is split into 2,238 training videos, 302 validation videos and 343 test videos.

No files were removed or altered during preprocessing.

  • Additional Documentation: Explore on Papers With Code

  • Homepage: https://youtube-vos.org/dataset/vis/

  • Source code: tfds.video.youtube_vis.YoutubeVis

  • Versions:

    • 1.0.0 (default): Initial release.
  • Download size: Unknown size

  • Manual download instructions: This dataset requires you to download the source data manually into download_config.manual_dir (defaults to ~/tensorflow_datasets/downloads/manual/):
    Please download all files for the 2019 version of the dataset (test_all_frames.zip, test.json, train_all_frames.zip, train.json, valid_all_frames.zip, valid.json) from the youtube-vis website and move them to ~/tensorflow_datasets/downloads/manual/.

Note that the dataset landing page is located at https://youtube-vos.org/dataset/vis/, and it will then redirect you to a page on https://competitions.codalab.org where you can download the 2019 version of the dataset. You will need to make an account on codalab to download the data. Note that at the time of writing this, you will need to bypass a "Connection not secure" warning when accessing codalab.

@article{DBLP:journals/corr/abs-1905-04804,
  author    = {Linjie Yang and
               Yuchen Fan and
               Ning Xu},
  title     = {Video Instance Segmentation},
  journal   = {CoRR},
  volume    = {abs/1905.04804},
  year      = {2019},
  url       = {http://arxiv.org/abs/1905.04804},
  archivePrefix = {arXiv},
  eprint    = {1905.04804},
  timestamp = {Tue, 28 May 2019 12:48:08 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1905-04804.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

youtube_vis/full (default config)

  • Config description: The full resolution version of the dataset, with all frames, including those without labels, included.

  • Dataset size: 33.31 GiB

  • Splits:

Split Examples
'test' 343
'train' 2,238
'validation' 302
  • Feature structure:
FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(None, None, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(None, None, 3), dtype=uint8)),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
metadata FeaturesDict
metadata/height Tensor int32
metadata/num_frames Tensor int32
metadata/video_name Tensor string
metadata/width Tensor int32
tracks Sequence
tracks/areas Sequence(Tensor) (None,) float32
tracks/bboxes Sequence(BBoxFeature) (None, 4) float32
tracks/category ClassLabel int64
tracks/frames Sequence(Tensor) (None,) int32
tracks/is_crowd Tensor bool
tracks/segmentations Video(Image) (None, None, None, 1) uint8
video Video(Image) (None, None, None, 3) uint8

youtube_vis/480_640_full

  • Config description: All images are bilinearly resized to 480 X 640 with all frames included.

  • Dataset size: 130.02 GiB

  • Splits:

Split Examples
'test' 343
'train' 2,238
'validation' 302
  • Feature structure:
FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(480, 640, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(480, 640, 3), dtype=uint8)),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
metadata FeaturesDict
metadata/height Tensor int32
metadata/num_frames Tensor int32
metadata/video_name Tensor string
metadata/width Tensor int32
tracks Sequence
tracks/areas Sequence(Tensor) (None,) float32
tracks/bboxes Sequence(BBoxFeature) (None, 4) float32
tracks/category ClassLabel int64
tracks/frames Sequence(Tensor) (None,) int32
tracks/is_crowd Tensor bool
tracks/segmentations Video(Image) (None, 480, 640, 1) uint8
video Video(Image) (None, 480, 640, 3) uint8

youtube_vis/480_640_only_frames_with_labels

  • Config description: All images are bilinearly resized to 480 X 640 with only frames with labels included.

  • Dataset size: 26.27 GiB

  • Splits:

Split Examples
'test' 343
'train' 2,238
'validation' 302
  • Feature structure:
FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(480, 640, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(480, 640, 3), dtype=uint8)),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
metadata FeaturesDict
metadata/height Tensor int32
metadata/num_frames Tensor int32
metadata/video_name Tensor string
metadata/width Tensor int32
tracks Sequence
tracks/areas Sequence(Tensor) (None,) float32
tracks/bboxes Sequence(BBoxFeature) (None, 4) float32
tracks/category ClassLabel int64
tracks/frames Sequence(Tensor) (None,) int32
tracks/is_crowd Tensor bool
tracks/segmentations Video(Image) (None, 480, 640, 1) uint8
video Video(Image) (None, 480, 640, 3) uint8

youtube_vis/only_frames_with_labels

  • Config description: Only images with labels included at their native resolution.

  • Dataset size: 6.91 GiB

  • Splits:

Split Examples
'test' 343
'train' 2,238
'validation' 302
  • Feature structure:
FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(None, None, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(None, None, 3), dtype=uint8)),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
metadata FeaturesDict
metadata/height Tensor int32
metadata/num_frames Tensor int32
metadata/video_name Tensor string
metadata/width Tensor int32
tracks Sequence
tracks/areas Sequence(Tensor) (None,) float32
tracks/bboxes Sequence(BBoxFeature) (None, 4) float32
tracks/category ClassLabel int64
tracks/frames Sequence(Tensor) (None,) int32
tracks/is_crowd Tensor bool
tracks/segmentations Video(Image) (None, None, None, 1) uint8
video Video(Image) (None, None, None, 3) uint8

youtube_vis/full_train_split

  • Config description: The full resolution version of the dataset, with all frames, including those without labels, included. The val and test splits are manufactured from the training data.

  • Dataset size: 26.09 GiB

  • Splits:

Split Examples
'test' 200
'train' 1,838
'validation' 200
  • Feature structure:
FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(None, None, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(None, None, 3), dtype=uint8)),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
metadata FeaturesDict
metadata/height Tensor int32
metadata/num_frames Tensor int32
metadata/video_name Tensor string
metadata/width Tensor int32
tracks Sequence
tracks/areas Sequence(Tensor) (None,) float32
tracks/bboxes Sequence(BBoxFeature) (None, 4) float32
tracks/category ClassLabel int64
tracks/frames Sequence(Tensor) (None,) int32
tracks/is_crowd Tensor bool
tracks/segmentations Video(Image) (None, None, None, 1) uint8
video Video(Image) (None, None, None, 3) uint8

youtube_vis/480_640_full_train_split

  • Config description: All images are bilinearly resized to 480 X 640 with all frames included. The val and test splits are manufactured from the training data.

  • Dataset size: 101.57 GiB

  • Splits:

Split Examples
'test' 200
'train' 1,838
'validation' 200
  • Feature structure:
FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(480, 640, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(480, 640, 3), dtype=uint8)),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
metadata FeaturesDict
metadata/height Tensor int32
metadata/num_frames Tensor int32
metadata/video_name Tensor string
metadata/width Tensor int32
tracks Sequence
tracks/areas Sequence(Tensor) (None,) float32
tracks/bboxes Sequence(BBoxFeature) (None, 4) float32
tracks/category ClassLabel int64
tracks/frames Sequence(Tensor) (None,) int32
tracks/is_crowd Tensor bool
tracks/segmentations Video(Image) (None, 480, 640, 1) uint8
video Video(Image) (None, 480, 640, 3) uint8

youtube_vis/480_640_only_frames_with_labels_train_split

  • Config description: All images are bilinearly resized to 480 X 640 with only frames with labels included. The val and test splits are manufactured from the training data.

  • Dataset size: 20.55 GiB

  • Splits:

Split Examples
'test' 200
'train' 1,838
'validation' 200
  • Feature structure:
FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(480, 640, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(480, 640, 3), dtype=uint8)),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
metadata FeaturesDict
metadata/height Tensor int32
metadata/num_frames Tensor int32
metadata/video_name Tensor string
metadata/width Tensor int32
tracks Sequence
tracks/areas Sequence(Tensor) (None,) float32
tracks/bboxes Sequence(BBoxFeature) (None, 4) float32
tracks/category ClassLabel int64
tracks/frames Sequence(Tensor) (None,) int32
tracks/is_crowd Tensor bool
tracks/segmentations Video(Image) (None, 480, 640, 1) uint8
video Video(Image) (None, 480, 640, 3) uint8

youtube_vis/only_frames_with_labels_train_split

  • Config description: Only images with labels included at their native resolution. The val and test splits are manufactured from the training data.

  • Dataset size: 5.46 GiB

  • Splits:

Split Examples
'test' 200
'train' 1,838
'validation' 200
  • Feature structure:
FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(None, None, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(None, None, 3), dtype=uint8)),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
metadata FeaturesDict
metadata/height Tensor int32
metadata/num_frames Tensor int32
metadata/video_name Tensor string
metadata/width Tensor int32
tracks Sequence
tracks/areas Sequence(Tensor) (None,) float32
tracks/bboxes Sequence(BBoxFeature) (None, 4) float32
tracks/category ClassLabel int64
tracks/frames Sequence(Tensor) (None,) int32
tracks/is_crowd Tensor bool
tracks/segmentations Video(Image) (None, None, None, 1) uint8
video Video(Image) (None, None, None, 3) uint8