Token based text embedding trained on Chinese Google News 100B corpus.
Module URL: https://tfhub.dev/google/nnlm-zh-dim128/1
Text embedding based on feed-forward Neural-Net Language Models with pre-built OOV. Maps from text to 128-dimensional embedding vectors.
embed = hub.Module("https://tfhub.dev/google/nnlm-zh-dim128/1") embeddings = embed(["cat is on the mat", "dog is in the fog"])
Based on NNLM with two hidden layers.
The module takes a batch of sentences in a 1-D tensor of strings as input.
The module preprocesses its input by splitting on spaces.
Out of vocabulary tokens
Small fraction of the least frequent tokens and embeddings (~2.5%) are replaced by hash buckets. Each hash bucket is initialized using the remaining embedding vectors that hash to the same bucket.
Word embeddings are combined into sentence embedding using the
 Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin. A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3:1137-1155, 2003.