Sparse MoE layer plus a FeedForward layer evaluated for all tokens.

Uses Keras add_loss() and add_metric() APIs.

moe Instance of MoeLayer with experts and router.
backbone_d_ff Dimension of feed-forward layer of a lightweight backbone, which is evaluated for all tokens.
inner_dropout The dropout probability to be applied after intermediate activations for the backbone.
output_dropout The dropout probability to be applied after the output of the backbone.
activation (Nonlinear) transform applied in the backbone.
kernel_initializer Initialization scheme for kernels in the backbone.
bias_initializer Initialization scheme for biases in the backbone.
name Layer name.
**kwargs Forwarded to super.



View source

Applies MoeLayerWithBackbone layer.

inputs Batch of input embeddings of shape [batch_size, seq_length, hidden_dim].
training Only apply dropout and jitter noise during training. If not provided taken from tf.keras.backend.

Transformed inputs with same shape as inputs: [batch_size, seq_length, hidden_dim].