This is an implementation of the network structure surrounding a
Transformer-XL encoder as described in "XLNet: Generalized Autoregressive
Pretraining for Language Understanding" (https://arxiv.org/abs/1906.08237).
An XLNet/Transformer-XL based network. This network should output a
sequence output and list of state tensors.
Number of classes to predict from the classification network.
The initializer (if any) to use in the classification networks.
Defaults to a RandomNormal initializer.
Method used to summarize a sequence into a compact vector.