Module google/universal-sentence-encoder-large/2

Encoder of greater-than-word length text trained on a variety of data.

Module URL: https://tfhub.dev/google/universal-sentence-encoder-large/2

Open Colab notebok

Overview

The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.

The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. The universal-sentence-encoder-large model is trained with a Transformer encoder.

To learn more about text embeddings, refer to the TensorFlow Embeddings documentation. Our encoder differs from word level embedding models in that we train on a number of natural language prediction tasks that require modeling the meaning of word sequences rather than just individual words. Details are available in the paper "Universal Sentence Encoder" [1].

Example use

embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-large/2")
embeddings = embed([
    "The quick brown fox jumps over the lazy dog.",
    "I am a sentence for which I would like to get its embedding"])

print session.run(embeddings)

# The following are example embedding output of 512 dimensions per sentence
# Embedding for: The quick brown fox jumps over the lazy dog.
# [0.0387121252716, -0.00189059402328, -0.00533464271575, ...]
# Embedding for: I am a sentence for which I would like to get its embedding.
# [0.0489358529449, -0.0615833997726, 0.0032471306622, ...]

This module is about 800MB. Depending on your network speed, it might take a while to load the first time you instantiate it. After that, loading the model should be faster as modules are cached by default (learn more about caching). Further, once a module is loaded to memory, inference time should be relatively fast.

Please see Universal Sentence Encoder for details and see this notebook for code examples.

Changelog

Version 1

  • Initial release.

Version 2

  • Exposed internal variables as Trainable.

References

[1] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Céspedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil. Universal Sentence Encoder. arXiv:1803.11175, 2018.