Module google/nnlm-de-dim128/1

Token based text embedding trained on German Google News 30B corpus.

Module URL: https://tfhub.dev/google/nnlm-de-dim128/1

Overview

Text embedding based on feed-forward Neural-Net Language Models[1] with pre-built OOV. Maps from text to 128-dimensional embedding vectors.

Example use

embed = hub.Module("https://tfhub.dev/google/nnlm-de-dim128/1")
embeddings = embed(["cat is on the mat", "dog is in the fog"])

Details

Based on NNLM with two hidden layers.

Input

The module takes a batch of sentences in a 1-D tensor of strings as input.

Preprocessing

The module preprocesses its input by splitting on spaces.

Out of vocabulary tokens

Small fraction of the least frequent tokens and embeddings (~2.5%) are replaced by hash buckets. Each hash bucket is initialized using the remaining embedding vectors that hash to the same bucket.

Sentence embeddings

Word embeddings are combined into sentence embedding using the sqrtn combiner (see tf.nn.embedding_lookup_sparse).

References

[1] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin. A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3:1137-1155, 2003.