Module google/universal-sentence-encoder-lite/2

Encoder of greater-than-word length text trained on a variety of data.

Module URL: https://tfhub.dev/google/universal-sentence-encoder-lite/2

Open Colab notebok

Overview

The Universal Sentence Encoder Lite module is a lightweight version of Universal Sentence Encoder. This lite version is good for use cases when your computation resource is limited. For example, on-device inference. It's small and still gives good performance on various natural language understanding tasks.

The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. We apply this model to the STS benchmark for semantic similarity, and the results can be seen in the example notebook made available. To learn more about text embeddings, refer to the TensorFlow Embeddings documentation. Our encoder differs from word level embedding models in that we train on a number of natural language prediction tasks that require modeling the meaning of word sequences rather than just individual words. Details are available in the paper "Universal Sentence Encoder" [1].

Signatures

This module provides two signatures:

  • "default": the sentence encoding signature.
    • inputs: This signature takes IDs produced from SentencePiece processor on the input sentences. The IDs should be represened in tf.SparseTensor style by three name arguments "values", "indices", and "dense_shape". See 'process_to_IDs_in_sparse_format()' function in the example below.
    • output: A 512 dimensional vector for each sentences.
  • "spm_path": this signatures returns the path to the SenteicePiece model required when processing the sentences. See the next section for details.

Prerequisites

You need to process all the sentences with SentencePiece library and the SentencePiece model published with the module together. On Google Colaboratory, SentencePiece library is available by:

!pip3 install sentencepiece
import sentencepiece

To initialize a SentencePiece processor with the SentencePiece model published with the module together:

with tf.Session() as sess:
  module = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-lite/2")
  spm_path = sess.run(module(signature="spm_path"))
  # spm_path now contains a path to the SentencePiece model stored inside the
  # TF-Hub module

sp = spm.SentencePieceProcessor()
sp.Load(spm_path)

Example use

import sentencepiece as spm

def process_to_IDs_in_sparse_format(sp, sentences):
  # An utility method that processes sentences with the sentence piece processor
  # 'sp' and returns the results in tf.SparseTensor-similar format:
  # (values, indices, dense_shape)
  ids = [sp.EncodeAsIds(x) for x in sentences]
  max_len = max(len(x) for x in ids)
  dense_shape=(len(ids), max_len)
  values=[item for sublist in ids for item in sublist]
  indices=[[row,col] for row in range(len(ids)) for col in range(len(ids[row]))]
  return (values, indices, dense_shape)

sp = spm.SentencePieceProcessor()
sp.Load("/path/to/sentence_piece/model")

module = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-lite/2")
sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "I am a sentence for which I would like to get its embedding"]

input_placeholder = tf.sparse_placeholder(tf.int64, shape=[None, None])
embeddings = module(
    inputs=dict(
        values=input_placeholder.values,
        indices=input_placeholder.indices,
        dense_shape=input_placeholder.dense_shape))

values, indices, dense_shape = process_to_IDs_in_sparse_format(sentences)

message_embeddings = session.run(
      embeddings,
      feed_dict={input_placeholder.values: values,
                input_placeholder.indices: indices,
                input_placeholder.dense_shape: dense_shape})

print(message_embeddings)

# The following are example embedding output of 512 dimensions per sentence
# Embedding for: The quick brown fox jumps over the lazy dog.
# [0.0560572519898, 0.0534118898213, -0.0112254749984, ...]
# Embedding for: I am a sentence for which I would like to get its embedding.
# [-0.0343746766448, -0.0529498048127, 0.0469399243593, ...]

Please see Universal Sentence Encoder for details and see this notebook for code examples.

Changelog

Version 1

  • Initial release.

Version 2

  • Exposed internal variables as Trainable.

References

[1] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Céspedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil. Universal Sentence Encoder. arXiv:1803.11175, 2018.