Module google/elmo/2

Embeddings from a language model trained on the 1 Billion Word Benchmark.

Module URL:


Computes contextualized word representations using character-based word representations and bidirectional LSTMs, as described in the paper "Deep contextualized word representations" [1].

This modules supports inputs both in the form of raw text strings or tokenized text strings.

The module outputs fixed embeddings at each LSTM layer, a learnable aggregation of the 3 layers, and a fixed mean-pooled vector representation of the input.

The complex architecture achieves state of the art results on several benchmarks. Note that this is a very computationally expensive module compared to word embedding modules that only perform embedding lookups. The use of an accelerator is recommended.

Trainable parameters

The module exposes 4 trainable scalar weights for layer aggregation.

Example use

elmo = hub.Module("", trainable=True)
embeddings = elmo(
    ["the cat is on the mat", "dogs are in the fog"],
elmo = hub.Module("", trainable=True)
tokens_input = [["the", "cat", "is", "on", "the", "mat"],
                ["dogs", "are", "in", "the", "fog", ""]]
tokens_length = [6, 5]
embeddings = elmo(
        "tokens": tokens_input,
        "sequence_len": tokens_length

We set the trainable parameter to True when creating the module so that the 4 scalar weights (as described in the paper) can be trained. In this setting, the module still keeps all other parameters fixed.


The module defines two signatures: default, and tokens.

With the default signature, the module takes untokenized sentences as input. The input tensor is a string tensor with shape [batch_size]. The module tokenizes each string by splitting on spaces.

With the tokens signature, the module takes tokenized sentences as input. The input tensor is a string tensor with shape [batch_size, max_length] and an int32 tensor with shape [batch_size] corresponding to the sentence length. The length input is necessary to exclude padding in the case of sentences with varying length.


The output dictionary contains:

  • word_emb: the character-based word representations with shape [batch_size, max_length, 512].
  • lstm_outputs1: the first LSTM hidden state with shape [batch_size, max_length, 1024].
  • lstm_outputs2: the second LSTM hidden state with shape [batch_size, max_length, 1024].
  • elmo: the weighted sum of the 3 layers, where the weights are trainable. This tensor has shape [batch_size, max_length, 1024]
  • default: a fixed mean-pooling of all contextualized word representations with shape [batch_size, 1024].


Version 1

  • Initial release.

Version 2

  • Restricted trainable variables to the 4 scalar weights as described in the paper.


[1] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018.