Have a question? Connect with the community at the TensorFlow Forum Visit Forum


Transformer scaffold layer.

This layer implements the Transformer from "Attention Is All You Need". (https://arxiv.org/abs/1706.03762), with a customizable attention layer and feedforward layer option. Users can pass a class to attention_cls/feedforward_cls and associated config to attention_cfg/feedforward_cfg, in which case the scaffold will instantiate the class with the config, or pass a class instance to attention_cls/feedforward_cls.

num_attention_heads Number of attention heads.
intermediate_size Size of the intermediate layer.
intermediate_activation Activation for the intermediate layer.
attention_cls A class to instantiate attention layer, or a layer instance.
attention_cfg The config with which to instantiate attention_cls. Ignored if attention_cls is a layer instance or None. If attention_cls is a class, but attention_cfg is None, following kwargs will be used to instantiate the attention instance: { "num_heads": num_attention_heads, "key_dim": int(hidden_size // num_attention_heads), "dropout": attention_dropout_rate, "name": "self_attention" }, where hidden_size is the input tensor's last dimension.
feedforward_cls A class to instantiate feedforward layer, or a layer instance. If None, will use the standard feedforward layer as described in "Attention Is All You Need" paper. If not None, the instantiated feedforward layer is expected to take the output of attention as input and its output is this transformer layer's output.
feedforward_cfg The config with which to instantiate feedforward_cls. Ignored if feedforward_cls is a layer instance or is None. If feedforward_cls is a class, but feedforward_cfg is None, following kwargs will be used to instantiate the feedforward instance: { "intermediate_size": intermediate_size, "intermediate_activation": intermediate_activation, "dropout": dropout_rate, "name": "feedforward" }.
dropout_rate Dropout probability for the post-attention and output dropout.
attention_dropout_rate Dropout probability for within the attention layer.
kernel_initializer Initializer for dense layer kernels.
bias_initializer Initializer for dense layer biases.
kernel_regularizer Regularizer for dense layer kernels.
bias_regularizer Regularizer for dense layer biases.
activity_regularizer Regularizer for dense layer activity.
kernel_constraint Constraint for dense layer kernels.
bias_constraint Constraint for dense layer kernels.



View source

This is where the layer's logic lives.

Note here that call() method in tf.keras is little bit different from keras API. In keras API, you can pass support masking for layers as additional arguments. Whereas tf.keras has compute_mask() method to support masking.

inputs Input tensor, or list/tuple of input tensors.
*args Additional positional arguments. Currently unused.
**kwargs Additional keyword arguments. Currently unused.

A tensor or list/tuple of tensors.