Transformer layer for packing optimization to stride over inputs.
Inherits From: TransformerEncoderBlock
tfm.nlp.layers.StridedTransformerEncoderBlock(
*args, **kwargs
)
Args |
num_attention_heads
|
Number of attention heads.
|
inner_dim
|
The output dimension of the first Dense layer in a two-layer
feedforward network.
|
inner_activation
|
The activation for the first Dense layer in a two-layer
feedforward network.
|
output_range
|
the sequence output range, [0, output_range) for slicing the
target sequence. None means the target sequence is not sliced.
|
kernel_initializer
|
Initializer for dense layer kernels.
|
bias_initializer
|
Initializer for dense layer biases.
|
kernel_regularizer
|
Regularizer for dense layer kernels.
|
bias_regularizer
|
Regularizer for dense layer biases.
|
activity_regularizer
|
Regularizer for dense layer activity.
|
kernel_constraint
|
Constraint for dense layer kernels.
|
bias_constraint
|
Constraint for dense layer kernels.
|
use_bias
|
Whether to enable use_bias in attention layer. If set False,
use_bias in attention layer is disabled.
|
norm_first
|
Whether to normalize inputs to attention and intermediate
dense layers. If set False, output of attention and intermediate dense
layers is normalized.
|
norm_epsilon
|
Epsilon value to initialize normalization layers.
|
output_dropout
|
Dropout probability for the post-attention and output
dropout.
|
attention_dropout
|
Dropout probability for within the attention layer.
|
inner_dropout
|
Dropout probability for the first Dense layer in a
two-layer feedforward network.
|
attention_initializer
|
Initializer for kernels of attention layers. If set
None , attention layers use kernel_initializer as initializer for
kernel.
|
attention_axes
|
axes over which the attention is applied. None means
attention over all axes, but batch, heads, and features.
|
use_query_residual
|
Toggle to execute residual connection after attention.
|
key_dim
|
key_dim for the tf.keras.layers.MultiHeadAttention . If
None , we use the first input_shape 's last dim.
|
value_dim
|
value_dim for the tf.keras.layers.MultiHeadAttention .
|
output_last_dim
|
Final dimension of the output of this module. This also
dictates the value for the final dimension of the multi-head-attention.
When it's None , we use, in order of decreasing precedence, key_dim *
num_heads or the first input_shape 's last dim as the output's last
dim.
|
diff_q_kv_att_layer_norm
|
If True , create a separate attention layer
norm layer for query and key-value if norm_first is True . Invalid to
set to True if norm_first is False .
|
return_attention_scores
|
If True , the output of this layer will be a
tuple and additionally contain the attention scores in the shape of
[batch_size, num_attention_heads, seq_dim, seq_dim] .
|
**kwargs
|
keyword arguments.
|
Methods
call
View source
call(
inputs, stride: tf.Tensor
)
Transformer self-attention encoder block call.
Args |
inputs
|
a single tensor or a list of tensors. input tensor as the single
sequence of embeddings. [input tensor , attention mask ] to have the
additional attention mask. [query tensor , key value tensor ,
attention mask ] to have separate input streams for the query, and
key/value to the multi-head attention.
|
output_range
|
the sequence output range, [0, output_range) for slicing the
target sequence. None means the target sequence is not sliced. If you
would like to have no change to the model training, it is better to only
set the output_range for serving.
|
Returns |
An output tensor with the same dimensions as input/query tensor.
|