text.ByteSplitter

Splits a string tensor into bytes.

Inherits From: SplitterWithOffsets, Splitter

Methods

split

View source

Splits a string tensor into bytes.

The strings are split bytes. Thus, some unicode characters may be split into multiple bytes.

Example:

ByteSplitter().split("hello")
<tf.Tensor: shape=(5,), dtype=uint8, numpy=array([104, 101, 108, 108, 111],
dtype=uint8)>

Args
input A RaggedTensor or Tensor of strings with any shape.

Returns
A RaggedTensor of bytes. The returned shape is the shape of the input tensor with an added ragged dimension for the bytes that make up each string.

split_by_offsets

View source

Splits a string tensor into sub-strings.

The strings are split based upon the provided byte offsets.

Example:

splitter = ByteSplitter()
substrings = splitter.split_by_offsets("hello", [0, 4], [4, 5])
print(substrings.numpy())
[b&#x27;hell' b'o']

Args
input Tensor or RaggedTensor of strings of any shape to split.
start_offsets Tensor or RaggedTensor of byte offsets to start splits on (inclusive). This should be one more than the rank of input.
end_offsets Tensor or RaggedTensor of byte offsets to end splits on (exclusive). This should be one more than the rank of input.

Returns
A RaggedTensor or Tensor of substrings. The returned shape is the shape of the offsets.

split_with_offsets

View source

Splits a string tensor into bytes.

The strings are split bytes. Thus, some unicode characters may be split into multiple bytes.

Example:

splitter = ByteSplitter()
bytes, starts, ends = splitter.split_with_offsets("hello")
print(bytes.numpy(), starts.numpy(), ends.numpy())
[104 101 108 108 111] [0 1 2 3 4] [1 2 3 4 5]

Args
input A RaggedTensor or Tensor of strings with any shape.

Returns
A RaggedTensor of bytes. The returned shape is the shape of the input tensor with an added ragged dimension for the bytes that make up each string.

Returns
A tuple (bytes, offsets) where:

  • bytes: A RaggedTensor of bytes.
  • start_offsets: A RaggedTensor of the bytes' starting byte offset.
  • end_offsets: A RaggedTensor of the bytes' ending byte offset.