text.TokenizerWithOffsets

Base class for tokenizer implementations that return offsets.

Inherits From: Tokenizer, SplitterWithOffsets, Splitter

text.TokenizerWithOffsets(
    name=None
)

The offsets indicate which substring from the input string was used to generate each token. E.g., if input is a single string, then each token token[i] was generated from the substring input[starts[i]:ends[i]].

Each TokenizerWithOffsets subclass must implement the tokenize_with_offsets method, which returns a tuple containing both the pieces and the start and end offsets where those pieces occurred in the input string. I.e., if tokens, starts, ends = tokenize_with_offsets(s), then each token token[i] corresponds with tf.strings.substr(s, starts[i], ends[i] - starts[i]).

If the tokenizer encodes tokens as strings (and not token ids), then it will usually be the case that these corresponding strings are equal; but that is not technically required. For example, a tokenizer might choose to downcase strings

Example:

class CharTokenizer(TokenizerWithOffsets):
  def tokenize_with_offsets(self, input):
    chars, starts = tf.strings.unicode_split_with_offsets(input, 'UTF-8')
    lengths = tf.expand_dims(tf.strings.length(input), -1)
    ends = tf.concat([starts[..., 1:], tf.cast(lengths, tf.int64)], -1)
    return chars, starts, ends
  def tokenize(self, input):
    return self.tokenize_with_offsets(input)[0]
pieces, starts, ends = CharTokenizer().split_with_offsets("a😊c")
print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'\xf0\x9f\x98\x8a' b'c'] [0 1 5] [1 5 6]

Methods

`split`

View source

split(
    input
)

Alias for Tokenizer.tokenize.

`split_with_offsets`

View source

split_with_offsets(
    input
)

Alias for TokenizerWithOffsets.tokenize_with_offsets.

`tokenize`

View source

@abc.abstractmethod
tokenize(
    input
)

Tokenizes the input tensor.

Splits each string in the input tensor into a sequence of tokens. Tokens generally correspond to short substrings of the source string. Tokens can be encoded using either strings or integer ids.

Example:

print(tf_text.WhitespaceTokenizer().tokenize("small medium large"))
tf.Tensor([b'small' b'medium' b'large'], shape=(3,), dtype=string)

Args
`input`	An N-dimensional UTF-8 string (or optionally integer) `Tensor` or `RaggedTensor`.

Returns
An N+1-dimensional UTF-8 string or integer `Tensor` or `RaggedTensor`. For each string from the input tensor, the final, extra dimension contains the tokens that string was split into.

`tokenize_with_offsets`

View source

@abc.abstractmethod
tokenize_with_offsets(
    input
)

Tokenizes the input tensor and returns the result with byte-offsets.

The offsets indicate which substring from the input string was used to generate each token. E.g., if input is a tf.string tensor, then each token token[i] was generated from the substring tf.substr(input, starts[i], len=ends[i]-starts[i]).

Example:

splitter = tf_text.WhitespaceTokenizer()
pieces, starts, ends = splitter.tokenize_with_offsets("a bb ccc")
print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'bb' b'ccc'] [0 2 5] [1 4 8]
print(tf.strings.substr("a bb ccc", starts, ends-starts))
tf.Tensor([b'a' b'bb' b'ccc'], shape=(3,), dtype=string)

Args
`input`	An N-dimensional UTF-8 string (or optionally integer) `Tensor` or `RaggedTensor`.

Returns

Returns
A tuple `(tokens, start_offsets, end_offsets)` where: `tokens` is an N+1-dimensional UTF-8 string or integer `Tensor` or `RaggedTensor`. `start_offsets` is an N+1-dimensional integer `Tensor` or `RaggedTensor` containing the starting indices of each token (byte indices for input strings). `end_offsets` is an N+1-dimensional integer `Tensor` or `RaggedTensor` containing the exclusive ending indices of each token (byte indices for input strings).

A tuple (tokens, start_offsets, end_offsets) where:

tokens is an N+1-dimensional UTF-8 string or integer Tensor or RaggedTensor.
start_offsets is an N+1-dimensional integer Tensor or RaggedTensor containing the starting indices of each token (byte indices for input strings).
end_offsets is an N+1-dimensional integer Tensor or RaggedTensor containing the exclusive ending indices of each token (byte indices for input strings).