This is an implementation of the network structure surrounding a
Transformer-XL encoder as described in "XLNet: Generalized Autoregressive
Pretraining for Language Understanding" (https://arxiv.org/abs/1906.08237).
A transformer network. This network should output a sequence output
and a classification output. Furthermore, it should expose its embedding
table via a "get_embedding_table" method.
Beam size for span start.
Beam size for span end.
The dropout rate for the span labeling layer.
The activation for the span labeling head.
The initializer (if any) to use in the span labeling network.
Defaults to a Glorot uniform initializer.