Base XLNet model.

vocab_size int, the number of tokens in vocabulary.
num_layers int, the number of layers.
hidden_size int, the hidden size.
num_attention_heads int, the number of attention heads.
head_size int, the dimension size of each attention head.
inner_size int, the hidden size in feed-forward layers.
dropout_rate float, dropout rate.
attention_dropout_rate float, dropout rate on attention probabilities.
attention_type str, "uni" or "bi".
bi_data bool, whether to use bidirectional input pipeline. Usually set to True during pretraining and False during finetuning.
initializer A tf initializer.
two_stream bool, whether or not to use TwoStreamRelativeAttention used in the XLNet pretrainer. If False, then it will use MultiHeadRelativeAttention as in Transformer XL.
tie_attention_biases bool, whether or not to tie the biases together. Usually set to True. Used for backwards compatibility.
memory_length int, the number of tokens to cache.
same_length bool, whether to use the same attention length for each token.
clamp_length int, clamp all relative distances larger than clamp_length. -1 means no clamping.
reuse_length int, the number of tokens in the currect batch to be cached and reused in the future.
inner_activation str, "relu" or "gelu".
use_cls_mask bool, whether or not cls mask is included in the input sequences.
embedding_width The width of the word embeddings. If the embedding width is not equal to hidden size, embedding parameters will be factorized into two matrices in the shape of ["vocab_size", "embedding_width"] and "embedding_width", "hidden_size".
embedding_layer The word embedding layer. None means we will create a new embedding layer. Otherwise, we will reuse the given embedding layer. This parameter is originally added for ELECTRA model which needs to tie the generator embeddings with the discriminator embeddings.



View source

Implements call() for the layer.


View source

Returns the embedding layer weights.