Module google/nnlm-de-dim50/1

Token based text embedding trained on German Google News 30B corpus.

Module URL: https://tfhub.dev/google/nnlm-de-dim50/1

Overview

Text embedding based on feed-forward Neural-Net Language Models[1] with pre-built OOV. Maps from text to 50-dimensional embedding vectors.

Example use

embed = hub.Module("https://tfhub.dev/google/nnlm-de-dim50/1")
embeddings = embed(["cat is on the mat", "dog is in the fog"])

Details

Based on NNLM with two hidden layers.

Input

The module takes a batch of sentences in a 1-D tensor of strings as input.

Preprocessing

The module preprocesses its input by splitting on spaces.

Out of vocabulary tokens

Small fraction of the least frequent tokens and embeddings (~2.5%) are replaced by hash buckets. Each hash bucket is initialized using the remaining embedding vectors that hash to the same bucket.

References

[1] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin. A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3:1137-1155, 2003.