This network allows users to flexibly implement an encoder similar to the one
described in "BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding" (https://arxiv.org/abs/1810.04805).
In this network, users can choose to provide a custom embedding subnetwork
(which will replace the standard embedding logic) and/or a custom hidden layer
class (which will replace the Transformer instantiation in the encoder). For
each of these custom injection points, users can pass either a class or a
class instance. If a class is passed, that class will be instantiated using
the embedding_cfg or hidden_cfg argument, respectively; if an instance
is passed, that instance will be invoked. (In the case of hidden_cls, the
instance will be invoked 'num_hidden_instances' times.
If the hidden_cls is not overridden, a default transformer layer will be
instantiated.
Args
pooled_output_dim
The dimension of pooled output.
pooler_layer_initializer
The initializer for the classification layer.
embedding_cls
The class or instance to use to embed the input data. This
class or instance defines the inputs to this encoder and outputs (1)
embeddings tensor with shape (batch_size, seq_length, hidden_size) and
(2) attention masking with tensor (batch_size, seq_length, seq_length).
If embedding_cls is not set, a default embedding network (from the
original BERT paper) will be created.
embedding_cfg
A dict of kwargs to pass to the embedding_cls, if it needs to
be instantiated. If embedding_cls is not set, a config dict must be
passed to embedding_cfg with the following values:
vocab_size: The size of the token vocabulary.
type_vocab_size: The size of the type vocabulary.
hidden_size: The hidden size for this encoder.
max_seq_length: The maximum sequence length for this encoder.
seq_length: The sequence length for this encoder.
initializer: The initializer for the embedding portion of this encoder.
dropout_rate: The dropout rate to apply before the encoding layers.
embedding_data
A reference to the embedding weights that will be used to
train the masked language model, if necessary. This is optional, and only
needed if (1) you are overriding embedding_cls and (2) are doing
standard pretraining.
num_hidden_instances
The number of times to instantiate and/or invoke the
hidden_cls.
hidden_cls
Three types of input are supported: (1) class (2) instance
(3) list of classes or instances, to encode the input data. If
hidden_cls is not set, a KerasBERT transformer layer will be used as the
encoder class. If hidden_cls is a list of classes or instances, these
classes (instances) are sequentially instantiated (invoked) on top of
embedding layer. Mixing classes and instances in the list is allowed.
hidden_cfg
A dict of kwargs to pass to the hidden_cls, if it needs to be
instantiated. If hidden_cls is not set, a config dict must be passed to
hidden_cfg with the following values:
num_attention_heads: The number of attention heads. The hidden size
must be divisible by num_attention_heads.
intermediate_size: The intermediate size of the transformer.
intermediate_activation: The activation to apply in the transfomer.
dropout_rate: The overall dropout rate for the transformer layers.
attention_dropout_rate: The dropout rate for the attention layers.
kernel_initializer: The initializer for the transformer layers.
mask_cls
The class to generate masks passed into hidden_cls() from inputs
and 2D mask indicating positions we can attend to. It is the caller's job
to make sure the output of the mask_layer can be used by hidden_layer.
A mask_cls is usually mapped to a hidden_cls.
mask_cfg
A dict of kwargs pass to mask_cls.
layer_norm_before_pooling
Whether to add a layer norm before the pooling
layer. You probably want to turn this on if you set norm_first=True in
transformer layers.
return_all_layer_outputs
Whether to output sequence embedding outputs of
all encoder transformer layers.
dict_outputs
Whether to use a dictionary as the model outputs.
layer_idx_as_attention_seed
Whether to include layer_idx in
attention_cfg in hidden_cfg.
feed_layer_idx
whether the scaffold should feed layer index to hidden_cls.
recursive
whether to pass the second return of the hidden layer as the last
element among the inputs. None will be passed as the initial state.
Attributes
embedding_network
hidden_layers
List of hidden layers in the encoder.
pooler_layer
The pooler dense layer after the transformer layers.
Methods
call
call(
inputs, training=None, mask=None
)
Calls the model on new inputs and returns the outputs as tensors.
In this case call() just reapplies
all ops in the graph to the new inputs
(e.g. build a new computational graph from the provided inputs).
Args
inputs
Input tensor, or dict/list/tuple of input tensors.
training
Boolean or boolean scalar tensor, indicating whether to
run the Network in training mode or inference mode.
mask
A mask or list of masks. A mask can be either a boolean tensor
or None (no mask). For more details, check the guide
here.
Returns
A tensor if there is a single output, or
a list of tensors if there are more than one outputs.