|View source on GitHub|
BigBird, a sparse attention mechanism.
tfm.nlp.layers.BigBirdAttention( num_rand_blocks=3, from_block_size=64, to_block_size=64, max_rand_mask_length=MAX_SEQ_LEN, seed=None, **kwargs )
This layer follows the paper "Big Bird: Transformers for Longer Sequences" (https://arxiv.org/abs/2007.14062). It reduces this quadratic dependency of attention computation to linear.
Arguments are the same as
call( query, value, key=None, attention_mask=None, **kwargs )
This is where the layer's logic lives.
call() method may not create state (except in its first
invocation, wrapping the creation of variables or other resources in
tf.init_scope()). It is recommended to create state, including
tf.Variable instances and nested
__init__(), or in the
build() method that is
called automatically before
call() executes for the first time.
Input tensor, or dict/list/tuple of input tensors.
The first positional
||Additional positional arguments. May contain tensors, although this is not recommended, for the reasons above.|
Additional keyword arguments. May contain tensors, although
this is not recommended, for the reasons above.
The following optional keyword arguments are reserved:
|A tensor or list/tuple of tensors.|