Number of attention heads in the transformer block.
The size of the "intermediate" (a.k.a., feed
The non-linear activation function to apply
to the output of the intermediate/feed-forward layer.
Dropout probability for the hidden layers.
Dropout probability of the attention
Size of bottleneck.
The stddev of the truncated_normal_initializer for
initializing all weight matrices.
Use attention inputs from the bottleneck
transformation. If true, the following key_query_shared_bottleneck
will be ignored.
Whether to share linear transformation for
keys and queries.
Number of stacked feed-forward networks.
The type of normalization_type, only no_norm and
layer_norm are supported. no_norm represents the element-wise linear
transformation for the student model, as suggested by the original
MobileBERT paper. layer_norm is used for the teacher model.
If using the tanh activation for the final
representation of the [CLS] token in fine-tuning.
The dtype of input_mask tensor, which is one of the
input tensors of this encoder. Defaults to int32. If you want
to use tf.lite quantization, which does not support Cast op,
please set this argument to tf.float32 and feed input_mask
tensor with values in float32 to avoid tf.cast in the computation.
Other keyworded and arguments.
The pooler dense layer after the transformer layers.
List of Transformer layers in the encoder.
inputs, training=None, mask=None
Calls the model on new inputs and returns the outputs as tensors.
In this case call() just reapplies
all ops in the graph to the new inputs
(e.g. build a new computational graph from the provided inputs).
Input tensor, or dict/list/tuple of input tensors.
Boolean or boolean scalar tensor, indicating whether to run
the Network in training mode or inference mode.
A mask or list of masks. A mask can be either a boolean tensor or
None (no mask). For more details, check the guide
A tensor if there is a single output, or
a list of tensors if there are more than one outputs.