This is an implementation of the network structure surrounding a
Transformer-XL encoder as described in "XLNet: Generalized Autoregressive
Pretraining for Language Understanding" (https://arxiv.org/abs/1906.08237).
An XLNet/Transformer-XL based network. This network should output a
sequence output and list of state tensors.
The activation (if any) to use in the Masked LM network. If
None, then no activation will be used.
The initializer (if any) to use in the masked LM. Defaults
to a Glorot uniform initializer.