Module google/elmo/1

Embeddings from a language model trained on the 1 Billion Word Benchmark.

Module URL:


Computes contextualized word representations using character-based word representations and bidirectional LSTMs, as described in the paper "Deep contextualized word representations" [1].

This modules supports inputs both in the form of raw text strings or tokenized text strings.

The module outputs fixed embeddings at each LSTM layer, a learnable aggregation of the 3 layers, and a fixed mean-pooled vector representation of the input.

The complex architecture achieves state of the art results on several benchmarks. Note that this is a very computationally expensive module compared to word embedding modules that only perform embedding lookups. The use of an accelerator is recommended.

Trainable parameters

The module exposes 4 trainable scalar weights for layer aggregation and all variables of the LSTM cells.

Example use

elmo = hub.Module("", trainable=True)
embeddings = elmo(
    ["the cat is on the mat", "dogs are in the fog"],
elmo = hub.Module("", trainable=True)
tokens_input = [["the", "cat", "is", "on", "the", "mat"],
                ["dogs", "are", "in", "the", "fog", ""]]
tokens_length = [6, 5]
embeddings = elmo(
        "tokens": tokens_input,
        "sequence_len": tokens_length

We set the trainable parameter to True when creating the module so that the 4 scalar weights (as described in the paper) and all LSTM cell variables can be trained. In this setting, the module still keeps all other parameters fixed.


The module defines two signatures: default, and tokens.

With the default signature, the module takes untokenized sentences as input. The input tensor is a string tensor with shape [batch_size]. The module tokenizes each string by splitting on spaces.

With the tokens signature, the module takes tokenized sentences as input. The input tensor is a string tensor with shape [batch_size, max_length] and an int32 tensor with shape [batch_size] corresponding to the sentence length. The length input is necessary to exclude padding in the case of sentences with varying length.


The output dictionary contains:

  • word_emb: the character-based word representations with shape [batch_size, max_length, 512].
  • lstm_outputs1: the first LSTM hidden state with shape [batch_size, max_length, 1024].
  • lstm_outputs2: the second LSTM hidden state with shape [batch_size, max_length, 1024].
  • elmo: the weighted sum of the 3 layers, where the weights are trainable. This tensor has shape [batch_size, max_length, 1024]
  • default: a fixed mean-pooling of all contextualized word representations with shape [batch_size, 1024].


[1] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018.