tfm.nlp.layers.StridedTransformerEncoderBlock

Transformer layer for packing optimization to stride over inputs.

Inherits From: TransformerEncoderBlock

num_attention_heads Number of attention heads.
inner_dim The output dimension of the first Dense layer in a two-layer feedforward network.
inner_activation The activation for the first Dense layer in a two-layer feedforward network.
output_range the sequence output range, [0, output_range) for slicing the target sequence. None means the target sequence is not sliced.
kernel_initializer Initializer for dense layer kernels.
bias_initializer Initializer for dense layer biases.
kernel_regularizer Regularizer for dense layer kernels.
bias_regularizer Regularizer for dense layer biases.
activity_regularizer Regularizer for dense layer activity.
kernel_constraint Constraint for dense layer kernels.
bias_constraint Constraint for dense layer kernels.
use_bias Whether to enable use_bias in attention layer. If set False, use_bias in attention layer is disabled.
norm_first Whether to normalize inputs to attention and intermediate dense layers. If set False, output of attention and intermediate dense layers is normalized.
norm_epsilon Epsilon value to initialize normalization layers.
output_dropout Dropout probability for the post-attention and output dropout.
attention_dropout Dropout probability for within the attention layer.
inner_dropout Dropout probability for the first Dense layer in a two-layer feedforward network.
attention_initializer Initializer for kernels of attention layers. If set None, attention layers use kernel_initializer as initializer for kernel.
attention_axes axes over which the attention is applied. None means attention over all axes, but batch, heads, and features.
use_query_residual Toggle to execute residual connection after attention.
key_dim key_dim for the tf.keras.layers.MultiHeadAttention. If None, we use the first input_shape's last dim.
value_dim value_dim for the tf.keras.layers.MultiHeadAttention.
output_last_dim Final dimension of the output of this module. This also dictates the value for the final dimension of the multi-head-attention. When it's None, we use, in order of decreasing precedence, key_dim * num_heads or the first input_shape's last dim as the output's last dim.
diff_q_kv_att_layer_norm If True, create a separate attention layer norm layer for query and key-value if norm_first is True. Invalid to set to True if norm_first is False.
return_attention_scores If True, the output of this layer will be a tuple and additionally contain the attention scores in the shape of [batch_size, num_attention_heads, seq_dim, seq_dim].
**kwargs keyword arguments.

Methods

call

View source

Transformer self-attention encoder block call.

Args
inputs a single tensor or a list of tensors. input tensor as the single sequence of embeddings. [input tensor, attention mask] to have the additional attention mask. [query tensor, key value tensor, attention mask] to have separate input streams for the query, and key/value to the multi-head attention.
output_range the sequence output range, [0, output_range) for slicing the target sequence. None means the target sequence is not sliced. If you would like to have no change to the model training, it is better to only set the output_range for serving.

Returns
An output tensor with the same dimensions as input/query tensor.