![]() |
Bi-directional Transformer-based encoder network.
tfm.nlp.networks.BertEncoder(
vocab_size,
hidden_size=768,
num_layers=12,
num_attention_heads=12,
max_sequence_length=512,
type_vocab_size=16,
inner_dim=3072,
inner_activation=(lambda x: tf.keras.activations.gelu(x, approximate=True)),
output_dropout=0.1,
attention_dropout=0.1,
initializer=tf.keras.initializers.TruncatedNormal(stddev=0.02),
output_range=None,
embedding_width=None,
embedding_layer=None,
norm_first=False,
dict_outputs=False,
return_all_encoder_outputs=False,
return_attention_scores: bool = False,
**kwargs
)
This network implements a bi-directional Transformer-based encoder as described in "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (https://arxiv.org/abs/1810.04805). It includes the embedding lookups and transformer layers, but not the masked language model or classification task networks.
The default values for this object are taken from the BERT-Base implementation in "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".
Args | |
---|---|
vocab_size
|
The size of the token vocabulary. |
hidden_size
|
The size of the transformer hidden layers. |
num_layers
|
The number of transformer layers. |
num_attention_heads
|
The number of attention heads for each transformer. The hidden size must be divisible by the number of attention heads. |
max_sequence_length
|
The maximum sequence length that this encoder can consume. If None, max_sequence_length uses the value from sequence length. This determines the variable shape for positional embeddings. |
type_vocab_size
|
The number of types that the 'type_ids' input can take. |
inner_dim
|
The output dimension of the first Dense layer in a two-layer feedforward network for each transformer. |
inner_activation
|
The activation for the first Dense layer in a two-layer feedforward network for each transformer. |
output_dropout
|
Dropout probability for the post-attention and output dropout. |
attention_dropout
|
The dropout rate to use for the attention layers within the transformer layers. |
initializer
|
The initialzer to use for all weights in this encoder. |
output_range
|
The sequence output range, [0, output_range), by slicing the
target sequence of the last transformer layer. None means the entire
target sequence will attend to the source sequence, which yields the full
output.
|
embedding_width
|
The width of the word embeddings. If the embedding width is not equal to hidden size, embedding parameters will be factorized into two matrices in the shape of ['vocab_size', 'embedding_width'] and 'embedding_width', 'hidden_size'. |
embedding_layer
|
An optional Layer instance which will be called to generate embeddings for the input word IDs. |
norm_first
|
Whether to normalize inputs to attention and intermediate dense layers. If set False, output of attention and intermediate dense layers is normalized. |
dict_outputs
|
Whether to use a dictionary as the model outputs. |
return_all_encoder_outputs
|
Whether to output sequence embedding outputs of
all encoder transformer layers. Note: when the following dict_outputs
argument is True, all encoder outputs are always returned in the dict,
keyed by encoder_outputs .
|
return_attention_scores
|
Whether to add an additional output containing the
attention scores of all transformer layers. This will be a list of length
num_layers , and each element will be in the shape [batch_size,
num_attention_heads, seq_dim, seq_dim].
|
Attributes | |
---|---|
pooler_layer
|
The pooler dense layer after the transformer layers. |
transformer_layers
|
List of Transformer layers in the encoder. |
Methods
call
call(
inputs, training=None, mask=None
)
Calls the model on new inputs and returns the outputs as tensors.
In this case call()
just reapplies
all ops in the graph to the new inputs
(e.g. build a new computational graph from the provided inputs).
Args | |
---|---|
inputs
|
Input tensor, or dict/list/tuple of input tensors. |
training
|
Boolean or boolean scalar tensor, indicating whether to
run the Network in training mode or inference mode.
|
mask
|
A mask or list of masks. A mask can be either a boolean tensor or None (no mask). For more details, check the guide here. |
Returns | |
---|---|
A tensor if there is a single output, or a list of tensors if there are more than one outputs. |
get_embedding_layer
get_embedding_layer()
get_embedding_table
get_embedding_table()