Feed-forward layer with multiple experts.

Note that call() takes inputs with shape [num_groups, num_experts, expert_capacity, hidden_dim] which is different from the usual [batch_size, seq_len, hidden_dim] used by the FeedForward layer.

The experts are independent FeedForward layers of the same shape, i.e. the kernel doesn't have shape [hidden_dim, out_dim], but [num_experts, hidden_dim, out_dim].

num_experts Number of experts (i.e. number of independent feed-forward blocks).
d_ff Dimension of feed-forward layer of each expert.
inner_dropout The dropout probability to be applied after intermediate activations.
output_dropout The dropout probability to be applied after output layer.
activation (Nonlinear) transform applied in layer.
kernel_initializer Initialization scheme for kernel.
bias_initializer Initialization scheme for bias.
name Layer name.
**kwargs Forwarded to super.



View source

Applies layer to inputs.

inputs Inputs of shape [num_groups, num_experts, expert_capacity, hidden_dim].
training Only apply dropout during training.

Transformed inputs with the same shape as inputs [num_groups, num_experts, expert_capacity, hidden_dim].