Transformer scaffold layer.

This layer implements the Transformer from "Attention Is All You Need". (, with a customizable attention layer and feedforward layer option. Users can pass a class to attention_cls/feedforward_cls and associated config to attention_cfg/feedforward_cfg, in which case the scaffold will instantiate the class with the config, or pass a class instance to attention_cls/feedforward_cls.

num_attention_heads Number of attention heads.
inner_dim The output dimension of the first Dense layer in a two-layer feedforward network.
inner_activation The activation for the first Dense layer in a two-layer feedforward network.
attention_cls A class to instantiate attention layer, or a layer instance.
attention_cfg The config with which to instantiate attention_cls. Ignored if attention_cls is a layer instance or None. If attention_cls is a class, but attention_cfg is None, following kwargs will be used to instantiate the attention instance: { "num_heads": num_attention_heads, "key_dim": int(hidden_size // num_attention_heads), "dropout": attention_dropout_rate, "name": "self_attention" }, where hidden_size is the input tensor's last dimension.
feedforward_cls A class to instantiate feedforward layer, or a layer instance. If None, will use the standard feedforward layer as described in "Attention Is All You Need" paper. If not None, the instantiated feedforward layer is expected to take the output of attention as input and its output is this transformer layer's output.
feedforward_cfg The config with which to instantiate feedforward_cls. Ignored if feedforward_cls is a layer instance or is None. If feedforward_cls is a class, but feedforward_cfg is None, following kwargs will be used to instantiate the feedforward instance: { "inner_dim": inner_dim, "inner_activation": inner_activation, "dropout": dropout_rate, "name": "feedforward" }.
dropout_rate Dropout probability for the post-attention and output dropout.
attention_dropout_rate Dropout probability for within the attention layer.
norm_first Whether to normalize inputs to attention and intermediate dense layers. If set False, output of attention and intermediate dense layers is normalized.
norm_epsilon Epsilon value to initialize normalization layers.
kernel_initializer Initializer for dense layer kernels.
bias_initializer Initializer for dense layer biases.
kernel_regularizer Regularizer for dense layer kernels.
bias_regularizer Regularizer for dense layer biases.
activity_regularizer Regularizer for dense layer activity.
kernel_constraint Constraint for dense layer kernels.
bias_constraint Constraint for dense layer kernels.



View source

This is where the layer's logic lives.

The call() method may not create state (except in its first invocation, wrapping the creation of variables or other resources in tf.init_scope()). It is recommended to create state, including tf.Variable instances and nested Layer instances, in __init__(), or in the build() method that is called automatically before call() executes for the first time.

inputs Input tensor, or dict/list/tuple of input tensors. The first positional inputs argument is subject to special rules:

  • inputs must be explicitly passed. A layer cannot have zero arguments, and inputs cannot be provided via the default value of a keyword argument.
  • NumPy array or Python scalar values in inputs get cast as tensors.
  • Keras mask metadata is only collected from inputs.
  • Layers are built (build(input_shape) method) using shape info from inputs only.
  • input_spec compatibility is only checked against inputs.
  • Mixed precision input casting is only applied to inputs. If a layer has tensor arguments in *args or **kwargs, their casting behavior in mixed precision should be handled manually.
  • The SavedModel input specification is generated using inputs only.
  • Integration with various ecosystem packages like TFMOT, TFLite, TF.js, etc is only supported for inputs and not for tensors in positional and keyword arguments.
*args Additional positional arguments. May contain tensors, although this is not recommended, for the reasons above.
**kwargs Additional keyword arguments. May contain tensors, although this is not recommended, for the reasons above. The following optional keyword arguments are reserved:
  • training: Boolean scalar tensor of Python boolean indicating whether the call is meant for training or inference.
  • mask: Boolean input mask. If the layer's call() method takes a mask argument, its default value will be set to the mask generated for inputs by the previous layer (if input did come from a layer that generated a corresponding mask, i.e. if it came from a Keras layer with masking support).
  • Returns
    A tensor or list/tuple of tensors.