Have a question? Connect with the community at the TensorFlow Forum Visit Forum


Bi-directional Transformer-based encoder network.

This network implements a bi-directional Transformer-based encoder as described in "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (https://arxiv.org/abs/1810.04805). It includes the embedding lookups and transformer layers, but not the masked language model or classification task networks.

The default values for this object are taken from the BERT-Base implementation in "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".

vocab_size The size of the token vocabulary.
hidden_size The size of the transformer hidden layers.
num_layers The number of transformer layers.
num_attention_heads The number of attention heads for each transformer. The hidden size must be divisible by the number of attention heads.
sequence_length [Deprecated]. user is using it.
max_sequence_length The maximum sequence length that this encoder can consume. If None, max_sequence_length uses the value from sequence length. This determines the variable shape for positional embeddings.
type_vocab_size The number of types that the 'type_ids' input can take.
intermediate_size The intermediate size for the transformer layers.
activation The activation to use for the transformer layers.
dropout_rate The dropout rate to use for the transformer layers.
attention_dropout_rate The dropout rate to use for the attention layers within the transformer layers.
initializer The initialzer to use for all weights in this encoder.
return_all_encoder_outputs Whether to output sequence embedding outputs of all encoder transformer layers. Note: when the following dict_outputs argument is True, all encoder outputs are always returned in the dict, keyed by encoder_outputs.
output_range The sequence output range, [0, output_range), by slicing the target sequence of the last transformer layer. None means the entire target sequence will attend to the source sequence, which yeilds the full output.
embedding_width The width of the word embeddings. If the embedding width is not equal to hidden size, embedding parameters will be factorized into two matrices in the shape of ['vocab_size', 'embedding_width'] and 'embedding_width', 'hidden_size'.
embedding_layer The word embedding layer. None means we will create a new embedding layer. Otherwise, we will reuse the given embedding layer. This parameter is originally added for ELECTRA model which needs to tie the generator embeddings with the discriminator embeddings.
dict_outputs Whether to use a dictionary as the model outputs.

pooler_layer The pooler dense layer after the transformer layers.
transformer_layers List of Transformer layers in the encoder.



Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

inputs A tensor or list of tensors.
training Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask A mask or list of masks. A mask can be either a tensor or None (no mask).

A tensor if there is a single output, or a list of tensors if there are more than one outputs.