Encoder of greater-than-word length text trained on a variety of data.
The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.
The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. We apply this model to the STS benchmark for semantic similarity, and the results can be seen in the example notebook made available. The universal-sentence-encoder model is trained with a deep averaging network (DAN) encoder.
To learn more about text embeddings, refer to the TensorFlow Embeddings documentation. Our encoder differs from word level embedding models in that we train on a number of natural language prediction tasks that require modeling the meaning of word sequences rather than just individual words. Details are available in the paper "Universal Sentence Encoder" .
embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/1") embeddings = embed([ "The quick brown fox jumps over the lazy dog.", "I am a sentence for which I would like to get its embedding"]) print session.run(embeddings) # The following are example embedding output of 512 dimensions per sentence # Embedding for: The quick brown fox jumps over the lazy dog. # [-0.016987282782793045, -0.008949815295636654, -0.0070627182722091675, ...] # Embedding for: I am a sentence for which I would like to get its embedding. # [0.03531332314014435, -0.025384284555912018, -0.007880025543272495, ...]
This module is about 1GB. Depending on your network speed, it might take a while to load the first time you instantiate it. After that, loading the model should be faster as modules are cached by default (learn more about caching). Further, once a module is loaded to memory, inference time should be relatively fast.
Semantic similarity is a measure of the degree to which two pieces of text carry the same meaning. This is broadly useful in obtaining good coverage over the numerous ways that a thought can be expressed using language without needing to manually enumerate them.
Simple applications include improving the coverage of systems that trigger behaviors on certain keywords, phrases or utterances. This section of the notebook shows how to encode text and compare encoding distances as a proxy for semantic similarity.
This notebook shows how to train a simple binary text classifier on top of any TF-Hub module that can embed sentences. The Universal Sentence Encoder was partly trained with custom text classification tasks in mind. It can be trained to perform a wide variety of classification tasks often with a very small amount of labeled examples.
 Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Céspedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil. Universal Sentence Encoder. arXiv:1803.11175, 2018.