Transformer layer.

This layer implements the Transformer from "Attention Is All You Need". (https://arxiv.org/abs/1706.03762).

num_attention_heads Number of attention heads.
intermediate_size Size of the intermediate layer.
intermediate_activation Activation for the intermediate layer.
dropout_rate Dropout probability for the post-attention and output dropout.
attention_dropout_rate Dropout probability for within the attention layer.
output_range the sequence output range, [0, output_range) by slicing the target sequence. None means the target sequence is not sliced.
kernel_initializer Initializer for dense layer kernels.
bias_initializer Initializer for dense layer biases.
kernel_regularizer Regularizer for dense layer kernels.
bias_regularizer Regularizer for dense layer biases.
activity_regularizer Regularizer for dense layer activity.
kernel_constraint Constraint for dense layer kernels.
bias_constraint Constraint for dense layer kernels.
use_bias Whether to enable use_bias in attention layer. If set False, use_bias in attention layer is disabled.
norm_first Whether to normalize inputs to attention and intermediate dense layers. If set False, output of attention and intermediate dense layers is normalized.
norm_epsilon Epsilon value to initialize normalization layers.
intermediate_dropout Dropout probability for intermediate_dropout_layer.
attention_initializer Initializer for kernels of attention layers. If set None, attention layers use kernel_initializer as initializer for kernel.



Transformer self-attention encoder block call.

inputs a single tensor or a list of tensors. input tensor as the single sequence of embeddings. [input tensor, attention mask] to have the additional attention mask. [query tensor, key value tensor, attention mask] to have separate input streams for the query, and key/value to the multi-head attention.

An ouput tensor with the same dimensions as input/query tensor.