laion400m

  • Description:

The LAION-400M dataset is completely openly, freely accessible.

Check https://laion.ai/laion-400-open-dataset/ for the full description of this dataset.

All images and texts in the LAION-400M dataset have been filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3. The threshold of 0.3 had been determined through human evaluations and seemed to be a good heuristic for estimating semantic image-text-content matching.

The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021.

Split Examples
@article{DBLP:journals/corr/abs-2111-02114,
  author    = {Christoph Schuhmann and
               Richard Vencu and
               Romain Beaumont and
               Robert Kaczmarczyk and
               Clayton Mullis and
               Aarush Katta and
               Theo Coombes and
               Jenia Jitsev and
               Aran Komatsuzaki},
  title     = { {LAION-400M:} Open Dataset of CLIP-Filtered 400 Million Image-Text
               Pairs},
  journal   = {CoRR},
  volume    = {abs/2111.02114},
  year      = {2021},
  url       = {https://arxiv.org/abs/2111.02114},
  eprinttype = {arXiv},
  eprint    = {2111.02114},
  timestamp = {Fri, 05 Nov 2021 15:25:54 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2111-02114.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

laion400m/images (default config)

  • Feature structure:
FeaturesDict({
    'caption': Text(shape=(), dtype=string),
    'image': Image(shape=(None, None, 3), dtype=uint8, description=image),
    'license': Text(shape=(), dtype=string),
    'nsfw': ClassLabel(shape=(), dtype=int64, num_classes=4),
    'original_height': Scalar(shape=(), dtype=int32, description=original height of the image),
    'original_width': Scalar(shape=(), dtype=int32, description=original width of the image),
    'similarity': Scalar(shape=(), dtype=float64, description=cosine similarity score between the text and image embedding. Missing values default to -1.0),
    'url': Text(shape=(), dtype=string),
})
  • Feature documentation:
Feature Class Shape Dtype Description Value range
FeaturesDict
caption Text string HTML alt-text attribute
image Image (None, None, 3) uint8 image
license Text string type of Creative Commons license (if applicable)
nsfw ClassLabel int64 NSFW tag (detected with CLIP). Incohesive and missing tags are replaced with UNTAGGED
original_height Scalar int32 original height of the image
original_width Scalar int32 original width of the image
similarity Scalar float64 cosine similarity score between the text and image embedding. Missing values default to -1.0 [0.0, 1.0]
url Text string image URL

laion400m/embeddings

  • Feature structure:
FeaturesDict({
    'caption': Text(shape=(), dtype=string),
    'image_embedding': Tensor(shape=(512,), dtype=float16, description=CLIP image embedding),
    'license': Text(shape=(), dtype=string),
    'nsfw': ClassLabel(shape=(), dtype=int64, num_classes=4),
    'original_height': Scalar(shape=(), dtype=int32, description=original height of the image),
    'original_width': Scalar(shape=(), dtype=int32, description=original width of the image),
    'similarity': Scalar(shape=(), dtype=float64, description=cosine similarity score between the text and image embedding. Missing values default to -1.0),
    'text_embedding': Tensor(shape=(512,), dtype=float16, description=CLIP text embedding),
    'url': Text(shape=(), dtype=string),
})
  • Feature documentation:
Feature Class Shape Dtype Description Value range
FeaturesDict
caption Text string HTML alt-text attribute
image_embedding Tensor (512,) float16 CLIP image embedding
license Text string type of Creative Commons license (if applicable)
nsfw ClassLabel int64 NSFW tag (detected with CLIP). Incohesive and missing tags are replaced with UNTAGGED
original_height Scalar int32 original height of the image
original_width Scalar int32 original width of the image
similarity Scalar float64 cosine similarity score between the text and image embedding. Missing values default to -1.0 [0.0, 1.0]
text_embedding Tensor (512,) float16 CLIP text embedding
url Text string image URL