This network allows users to flexibly implement an encoder similar to the one
described in "BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding" (https://arxiv.org/abs/1810.04805).
In this network, users can choose to provide a custom embedding subnetwork
(which will replace the standard embedding logic) and/or a custom hidden layer
class (which will replace the Transformer instantiation in the encoder). For
each of these custom injection points, users can pass either a class or a
class instance. If a class is passed, that class will be instantiated using
the embedding_cfg or hidden_cfg argument, respectively; if an instance
is passed, that instance will be invoked. (In the case of hidden_cls, the
instance will be invoked 'num_hidden_instances' times.
If the hidden_cls is not overridden, a default transformer layer will be
The dimension of pooled output.
The initializer for the classification layer.
The class or instance to use to embed the input data. This
class or instance defines the inputs to this encoder and outputs (1)
embeddings tensor with shape (batch_size, seq_length, hidden_size) and
(2) attention masking with tensor (batch_size, seq_length, seq_length).
If embedding_cls is not set, a default embedding network (from the
original BERT paper) will be created.
A dict of kwargs to pass to the embedding_cls, if it needs to
be instantiated. If embedding_cls is not set, a config dict must be
passed to embedding_cfg with the following values:
vocab_size: The size of the token vocabulary.
type_vocab_size: The size of the type vocabulary.
hidden_size: The hidden size for this encoder.
max_seq_length: The maximum sequence length for this encoder.
seq_length: The sequence length for this encoder.
initializer: The initializer for the embedding portion of this encoder.
dropout_rate: The dropout rate to apply before the encoding layers.
A reference to the embedding weights that will be used to
train the masked language model, if necessary. This is optional, and only
needed if (1) you are overriding embedding_cls and (2) are doing
The number of times to instantiate and/or invoke the
Three types of input are supported: (1) class (2) instance
(3) list of classes or instances, to encode the input data. If
hidden_cls is not set, a KerasBERT transformer layer will be used as the
encoder class. If hidden_cls is a list of classes or instances, these
classes (instances) are sequentially instantiated (invoked) on top of
embedding layer. Mixing classes and instances in the list is allowed.
A dict of kwargs to pass to the hidden_cls, if it needs to be
instantiated. If hidden_cls is not set, a config dict must be passed to
hidden_cfg with the following values:
num_attention_heads: The number of attention heads. The hidden size
must be divisible by num_attention_heads.
intermediate_size: The intermediate size of the transformer.
intermediate_activation: The activation to apply in the transfomer.
dropout_rate: The overall dropout rate for the transformer layers.
attention_dropout_rate: The dropout rate for the attention layers.
kernel_initializer: The initializer for the transformer layers.
The class to generate masks passed into hidden_cls() from inputs
and 2D mask indicating positions we can attend to. It is the caller's job
to make sure the output of the mask_layer can be used by hidden_layer.
A mask_cls is usually mapped to a hidden_cls.
A dict of kwargs pass to mask_cls.
Whether to add a layer norm before the pooling
layer. You probably want to turn this on if you set norm_first=True in
Whether to output sequence embedding outputs of
all encoder transformer layers.
Whether to use a dictionary as the model outputs.
Whether to include layer_idx in
attention_cfg in hidden_cfg.
whether the scaffold should feed layer index to hidden_cls.
whether to pass the second return of the hidden layer as the last
element among the inputs. None will be passed as the initial state.
List of hidden layers in the encoder.
The pooler dense layer after the transformer layers.
inputs, training=None, mask=None
Calls the model on new inputs and returns the outputs as tensors.
In this case call() just reapplies
all ops in the graph to the new inputs
(e.g. build a new computational graph from the provided inputs).
Input tensor, or dict/list/tuple of input tensors.
Boolean or boolean scalar tensor, indicating whether to
run the Network in training mode or inference mode.
A mask or list of masks. A mask can be either a boolean tensor
or None (no mask). For more details, check the guide
A tensor if there is a single output, or
a list of tensors if there are more than one outputs.