tf.data.experimental.bucket_by_sequence_length( element_length_func, bucket_boundaries, bucket_batch_sizes, padded_shapes=None, padding_values=None, pad_to_bucket_boundary=False, no_padding=False )
A transformation that buckets elements in a
Dataset by length.
Elements of the
Dataset are grouped together by length and then are padded
This is useful for sequence tasks in which the elements have variable length. Grouping together elements that have similar lengths reduces the total fraction of padding in a batch which increases training step efficiency.
element_length_func: function from element in
tf.int32, determines the length of the element, which will determine the bucket it goes into.
list<int>, upper length boundaries of the buckets.
list<int>, batch size per bucket. Length should be
len(bucket_boundaries) + 1.
padded_shapes: Nested structure of
tf.TensorShapeto pass to
tf.data.Dataset.padded_batch. If not provided, will use
dataset.output_shapes, which will result in variable length dimensions being padded out to the maximum length in each batch.
padding_values: Values to pad with, passed to
tf.data.Dataset.padded_batch. Defaults to padding with 0.
pad_to_bucket_boundary: bool, if
False, will pad dimensions with unknown size to maximum length in batch. If
True, will pad dimensions with unknown size to bucket boundary minus 1 (i.e., the maximum length in each bucket), and caller must ensure that the source
Datasetdoes not contain any elements with length longer than
bool, indicates whether to pad the batch features (features need to be either of type
tf.SparseTensoror of same shape).
Dataset transformation function, which can be passed to
len(bucket_batch_sizes) != len(bucket_boundaries) + 1.