![]() |
Bi-directional Transformer-based encoder network.
tfm.nlp.networks.BertEncoderV2(
vocab_size: int,
hidden_size: int = 768,
num_layers: int = 12,
num_attention_heads: int = 12,
max_sequence_length: int = 512,
type_vocab_size: int = 16,
inner_dim: int = 3072,
inner_activation: _Activation = _approx_gelu,
output_dropout: float = 0.1,
attention_dropout: float = 0.1,
initializer: _Initializer = tf.keras.initializers.TruncatedNormal(stddev=0.02),
output_range: Optional[int] = None,
embedding_width: Optional[int] = None,
embedding_layer: Optional[tf.keras.layers.Layer] = None,
norm_first: bool = False,
with_dense_inputs: bool = False,
return_attention_scores: bool = False,
**kwargs
)
This network implements a bi-directional Transformer-based encoder as described in "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (https://arxiv.org/abs/1810.04805). It includes the embedding lookups and transformer layers, but not the masked language model or classification task networks.
The default values for this object are taken from the BERT-Base implementation in "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".
Args | |
---|---|
vocab_size
|
The size of the token vocabulary. |
hidden_size
|
The size of the transformer hidden layers. |
num_layers
|
The number of transformer layers. |
num_attention_heads
|
The number of attention heads for each transformer. The hidden size must be divisible by the number of attention heads. |
max_sequence_length
|
The maximum sequence length that this encoder can consume. This determines the variable shape for positional embeddings. |
type_vocab_size
|
The number of types that the 'type_ids' input can take. |
inner_dim
|
The output dimension of the first Dense layer in a two-layer feedforward network for each transformer. |
inner_activation
|
The activation for the first Dense layer in a two-layer feedforward network for each transformer. |
output_dropout
|
Dropout probability for the post-attention and output dropout. |
attention_dropout
|
The dropout rate to use for the attention layers within the transformer layers. |
initializer
|
The initialzer to use for all weights in this encoder. |
output_range
|
The sequence output range, [0, output_range), by slicing the
target sequence of the last transformer layer. None means the entire
target sequence will attend to the source sequence, which yields the full
output.
|
embedding_width
|
The width of the word embeddings. If the embedding width is not equal to hidden size, embedding parameters will be factorized into two matrices in the shape of ['vocab_size', 'embedding_width'] and 'embedding_width', 'hidden_size'. |
embedding_layer
|
An optional Layer instance which will be called to generate embeddings for the input word IDs. |
norm_first
|
Whether to normalize inputs to attention and intermediate dense layers. If set False, output of attention and intermediate dense layers is normalized. |
with_dense_inputs
|
Whether to accept dense embeddings as the input. |
return_attention_scores
|
Whether to add an additional output containing the
attention scores of all transformer layers. This will be a list of length
num_layers , and each element will be in the shape [batch_size,
num_attention_heads, seq_dim, seq_dim].
|
Attributes | |
---|---|
pooler_layer
|
The pooler dense layer after the transformer layers. |
transformer_layers
|
List of Transformer layers in the encoder. |
Methods
call
call(
inputs
)
This is where the layer's logic lives.
The call()
method may not create state (except in its first
invocation, wrapping the creation of variables or other resources in
tf.init_scope()
). It is recommended to create state, including
tf.Variable
instances and nested Layer
instances,
in __init__()
, or in the build()
method that is
called automatically before call()
executes for the first time.
Args | |
---|---|
inputs
|
Input tensor, or dict/list/tuple of input tensors.
The first positional inputs argument is subject to special rules:
|
*args
|
Additional positional arguments. May contain tensors, although this is not recommended, for the reasons above. |
**kwargs
|
Additional keyword arguments. May contain tensors, although
this is not recommended, for the reasons above.
The following optional keyword arguments are reserved:
training : Boolean scalar tensor of Python boolean indicating
whether the call is meant for training or inference.mask : Boolean input mask. If the layer's call() method takes a
mask argument, its default value will be set to the mask
generated for inputs by the previous layer (if input did come
from a layer that generated a corresponding mask, i.e. if it came
from a Keras layer with masking support).
|
Returns | |
---|---|
A tensor or list/tuple of tensors. |
get_embedding_layer
get_embedding_layer()
get_embedding_table
get_embedding_table()