Compute multi-head relative attention over inputs.
Number of heads (H): the number of attention heads.
Value size (V): the size of each value embedding per head.
Key size (K): the size of each key embedding per head. Equally, the size
of each query embedding per head. Typically K <= V.
Batch dimensions (B).
Query (target) attention axes shape (T).
Value (source) attention axes shape (S), the rank must match the target.
Encoding length (L): The relative positional encoding length.
A trainable bias parameter added to the query head
when calculating the content-based attention score.
A trainable bias parameter added to the query
head when calculating the position-based attention score.
relative positional encoding for key and
Optional Tensor representing segmentation IDs used in
Optional Tensor representing the segmentation encoding
as used in XLNet.
Optional trainable bias parameter added to the
query had when calculating the segment-based attention score used in
(default None) optional state. If passed, this is also attended
over as in TransformerXL.
(default None) Optional mask that is added to attention
logits. If state is not None, the mask source sequence dimension should
The result of the computation, of shape [B, T, E],
where T is for target sequence shapes and E is the query input last
dimension if output_shape is None. Otherwise, the multi-head outputs
are projected to the shape specified by output_shape.