tfm.nlp.layers.TransformerEncoderBlock

TransformerEncoderBlock layer.

This layer implements the Transformer Encoder from "Attention Is All You Need". (https://arxiv.org/abs/1706.03762), which combines a tf.keras.layers.MultiHeadAttention layer with a two-layer feedforward network.

Attention Is All You Need BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

num_attention_heads Number of attention heads.
inner_dim The output dimension of the first Dense layer in a two-layer feedforward network.
inner_activation The activation for the first Dense layer in a two-layer feedforward network.
output_range the sequence output range, [0, output_range) for slicing the target sequence. None means the target sequence is not sliced.
kernel_initializer Initializer for dense layer kernels.
bias_initializer Initializer for dense layer biases.
kernel_regularizer Regularizer for dense layer kernels.
bias_regularizer Regularizer for dense layer biases.
activity_regularizer Regularizer for dense layer activity.
kernel_constraint Constraint for dense layer kernels.
bias_constraint Constraint for dense layer kernels.
use_bias Whether to enable use_bias in attention layer. If set False, use_bias in attention layer is disabled.
norm_first Whether to normalize inputs to attention and intermediate dense layers. If set False, output of attention and intermediate dense layers is normalized.
norm_epsilon Epsilon value to initialize normalization layers.
output_dropout Dropout probability for the post-attention and output dropout.
attention_dropout Dropout probability for within the attention layer.
inner_dropout Dropout probability for the first Dense layer in a two-layer feedforward network.
attention_initializer Initializer for kernels of attention layers. If set None, attention layers use kernel_initializer as initializer for kernel.
attention_axes axes over which the attention is applied. None means attention over all axes, but batch, heads, and features.
use_query_residual Toggle to execute residual connection after attention.
key_dim key_dim for the tf.keras.layers.MultiHeadAttention. If None, we use the first input_shape's last dim.
value_dim value_dim for the tf.keras.layers.MultiHeadAttention.
output_last_dim Final dimension of the output of this module. This also dictates the value for the final dimension of the multi-head-attention. When it's None, we use, in order of decreasing precedence, key_dim * num_heads or the first input_shape's last dim as the output's last dim.
diff_q_kv_att_layer_norm If True, create a separate attention layer norm layer for query and key-value if norm_first is True. Invalid to set to True if norm_first is False.
**kwargs keyword arguments.

Methods

call

View source

Transformer self-attention encoder block call.

Args
inputs a single tensor or a list of tensors. input tensor as the single sequence of embeddings. [input tensor, attention mask] to have the additional attention mask. [query tensor, key value tensor, attention mask] to have separate input streams for the query, and key/value to the multi-head attention.

Returns
An output tensor with the same dimensions as input/query tensor.