Transformer block for MobileBERT.

An implementation of one layer (block) of Transformer with bottleneck and inverted-bottleneck for MobilerBERT.

Original paper for MobileBERT:

hidden_size Hidden size for the Transformer input and output tensor.
num_attention_heads Number of attention heads in the Transformer.
intermediate_size The size of the "intermediate" (a.k.a., feed forward) layer.
intermediate_act_fn The non-linear activation function to apply to the output of the intermediate/feed-forward layer.
hidden_dropout_prob Dropout probability for the hidden layers.
attention_probs_dropout_prob Dropout probability of the attention probabilities.
intra_bottleneck_size Size of bottleneck.
use_bottleneck_attention Use attention inputs from the bottleneck transformation. If true, the following key_query_shared_bottleneck will be ignored.
key_query_shared_bottleneck Whether to share linear transformation for keys and queries.
num_feedforward_networks Number of stacked feed-forward networks.
normalization_type The type of normalization_type, only no_norm and layer_norm are supported. no_norm represents the element-wise linear transformation for the student model, as suggested by the original MobileBERT paper. layer_norm is used for the teacher model.
initializer The initializer to use for the embedding weights and linear projection weights.
**kwargs keyword arguments.

ValueError A Tensor shape or parameter is invalid.



Implementes the forward pass.

input_tensor Float tensor of shape (batch_size, seq_length, hidden_size).
attention_mask (optional) int32 tensor of shape (batch_size, seq_length, seq_length), with 1 for positions that can be attended to and 0 in positions that should not be.
return_attention_scores If return attention score.

layer_output Float tensor of shape (batch_size, seq_length, hidden_size).
attention_scores Optional

Only when return_attention_scores is True.

ValueError A Tensor shape or parameter is invalid.