wit_kaggle

  • Description:

Wikipedia - Image/Caption Matching Kaggle Competition.

This competition is organized by the Research team at the Wikimedia Foundation in collaboration with Google Research and a few external collaborators. This competition is based on the WIT dataset published by Google Research as detailed in thisSIGIR paper.

In this competition, you’ll build a model that automatically retrieves the text closest to an image. Specifically, you'll train your model to associate given images with article titles or complex captions, in multiple languages. The best models will account for the semantic granularity of Wikipedia images. If successful, you'll be contributing to the accessibility of the largest online encyclopedia. The millions of Wikipedia readers and edietors will be able to more easily understand, search, and describe media at scale. As a result, you’ll contribute to an open model to improve learning for all.

  • Homepage: https://www.kaggle.com/c/wikipedia-image-caption/code

  • Source code: tfds.vision_language.wit_kaggle.WitKaggle

  • Versions:

    • 1.0.0: Initial release. It provides the train and test datasets from the Wikipedia - Image/Caption Matching Kaggle competition (https://www.kaggle.com/c/wikipedia-image-caption/data).

      The goal of the competition is to build a model that automatically retrieves the text closest to an image. Specifically, the model shuld be trained to associate given images with article titles or complex captions, in multiple languages. The best models will account for the semantic granularity of Wikipedia images.

      Note that this release doesn't provide the ground truth for the test set, as it hasn't been provided by the Kaggle competition yet.

      Note that not all of the training observations have corresponding image data. The released images exclude all images containing humans. For samples which are not associated with image data, the following image features are used: image is a byte-64 encoded blank image, embedding is a vector of 2048 zeros.

      The samples released for the competition can be loaded as: tfds.load("wit_kaggle/train_with_extended_features") tfds.load("wit_kaggle/test_without_gold")

    • 1.0.1: Optimize Beam pipeline to avoid strugglers, ignoring rows without an image URL. Also added more Beam counters.

    • 1.0.2 (default): Fixes parsing of boolean fields.

  • Download size: Unknown size

  • Manual download instructions: This dataset requires you to download the source data manually into download_config.manual_dir (defaults to ~/tensorflow_datasets/downloads/manual/):
    Depending on the config called, manual_dir should contain some of the following subdirectories:

    • train
    • train-{0000x}-of-00005.tsv.zip
    • image_data_train/
      • image_pixels/
      • train_image_pixels_part-00{000-199}.csv.gz
      • resnet_embeddings/
      • train_resnet_embeddings_part-00{000-214}.csv.gz
    • test
    • test.tsv.zip
    • image_data_test/
      • image_pixels/
      • test_image_pixels_part-0000{0-4}.csv
      • resnet_embeddings/
      • test_resnet_embeddings_part-0000{0-9}.csv

Registration at https://www.kaggle.com/c/wikipedia-image-caption/data is needed to get the links to download the dataset.

@article{srinivasan2021wit,
  title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning},
  author={Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc},
  journal={arXiv preprint arXiv:2103.01913},
  year={2021}
}

wit_kaggle/train_with_extended_features (default config)

  • Config description: Training samples for the Wikipedia-Image/Caption Matching competition.

  • Dataset size: 1.16 TiB

  • Splits:

Split Examples
'train_with_extended_features' 37,046,386
  • Feature structure:
FeaturesDict({
    'attribution_passes_lang_id': bool,
    'caption_alt_text_description': Text(shape=(), dtype=string),
    'caption_attribution_description': Text(shape=(), dtype=string),
    'caption_reference_description': Text(shape=(), dtype=string),
    'caption_title_and_reference_description': Text(shape=(), dtype=string),
    'context_page_description': Text(shape=(), dtype=string),
    'context_section_description': Text(shape=(), dtype=string),
    'embedding': Tensor(shape=(2048,), dtype=float32),
    'hierarchical_section_title': Text(shape=(), dtype=string),
    'image': Image(shape=(None, None, 3), dtype=uint8),
    'image_url': Text(shape=(), dtype=string),
    'is_main_image': bool,
    'language': Text(shape=(), dtype=string),
    'metadata_url': Text(shape=(), dtype=string),
    'mime_type': Text(shape=(), dtype=string),
    'original_height': int32,
    'original_width': int32,
    'page_changed_recently': bool,
    'page_title': Text(shape=(), dtype=string),
    'page_url': Text(shape=(), dtype=string),
    'section_title': Text(shape=(), dtype=string),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
attribution_passes_lang_id Tensor bool
caption_alt_text_description Text string
caption_attribution_description Text string
caption_reference_description Text string
caption_title_and_reference_description Text string
context_page_description Text string
context_section_description Text string
embedding Tensor (2048,) float32
hierarchical_section_title Text string
image Image (None, None, 3) uint8
image_url Text string
is_main_image Tensor bool
language Text string
metadata_url Text string
mime_type Text string
original_height Tensor int32
original_width Tensor int32
page_changed_recently Tensor bool
page_title Text string
page_url Text string
section_title Text string

Visualization

wit_kaggle/test_without_gold

  • Config description: Test samples (without gold answers) for the Wikipedia-Image/Caption Matching competition.

  • Dataset size: 3.37 GiB

  • Splits:

Split Examples
'test_without_gold' 92,366
  • Feature structure:
FeaturesDict({
    'caption_title_and_reference_description': Text(shape=(), dtype=string),
    'embedding': Tensor(shape=(2048,), dtype=float32),
    'id': Text(shape=(), dtype=string),
    'image': Image(shape=(None, None, 3), dtype=uint8),
    'image_url': Text(shape=(), dtype=string),
    'metadata_url': Text(shape=(), dtype=string),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
caption_title_and_reference_description Text string
embedding Tensor (2048,) float32
id Text string
image Image (None, None, 3) uint8
image_url Text string
metadata_url Text string

Visualization