Tune in to the first Women in ML Symposium this Tuesday, October 19 at 9am PST Register now

text.TokenizerWithOffsets

Base class for tokenizer implementations that return offsets.

Inherits From: Tokenizer, SplitterWithOffsets, Splitter

The offsets indicate which substring from the input string was used to generate each token. E.g., if input is a single string, then each token token[i] was generated from the substring input[starts[i]:ends[i]].

Each TokenizerWithOffsets subclass must implement the tokenize_with_offsets method, which returns a tuple containing both the pieces and the start and end offsets where those pieces occurred in the input string. I.e., if tokens, starts, ends = tokenize_with_offsets(s), then each token token[i] corresponds with tf.strings.substr(s, starts[i], ends[i] - starts[i]).

If the tokenizer encodes tokens as strings (and not token ids), then it will usually be the case that these corresponding strings are equal; but that is not technically required. For example, a tokenizer might choose to downcase strings

Example:

class CharTokenizer(TokenizerWithOffsets):
  def tokenize_with_offsets(self, input):
    chars, starts = tf.strings.unicode_split_with_offsets(input, 'UTF-8')
    lengths = tf.expand_dims(tf.strings.length(input), -1)
    ends = tf.concat([starts[..., 1:], tf.cast(lengths, tf.int64)], -1)
    return chars, starts, ends
  def tokenize(self, input):
    return self.tokenize_with_offsets(input)[0]
pieces, starts, ends = CharTokenizer().split_with_offsets("a😊c")
print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'\xf0\x9f\x98\x8a' b'c'] [0 1 5] [1 5 6]

Methods

split

View source

Alias for Tokenizer.tokenize.

split_with_offsets

View source

Alias for TokenizerWithOffsets.tokenize_with_offsets.

tokenize

View source

Tokenizes the input tensor.

Splits each string in the input tensor into a sequence of tokens. Tokens generally correspond to short substrings of the source string. Tokens can be encoded using either strings or integer ids.

Example:

print(tf_text.WhitespaceTokenizer().tokenize("small medium large"))
tf.Tensor([b'small' b'medium' b'large'], shape=(3,), dtype=string)

Args
input An N-dimensional UTF-8 string (or optionally integer) Tensor or RaggedTensor.

Returns
An N+1-dimensional UTF-8 string or integer Tensor or RaggedTensor. For each string from the input tensor, the final, extra dimension contains the tokens that string was split into.

tokenize_with_offsets

View source

Tokenizes the input tensor and returns the result with offsets.

The offsets indicate which substring from the input string was used to generate each token. E.g., if input is a single string, then each token token[i] was generated from the substring input[starts[i]:ends[i]].

Example:

splitter = tf_text.WhitespaceTokenizer()
pieces, starts, ends = splitter.tokenize_with_offsets("a bb ccc")
print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'bb' b'ccc'] [0 2 5] [1 4 8]
print(tf.strings.substr("a bb ccc", starts, ends-starts))
tf.Tensor([b'a' b'bb' b'ccc'], shape=(3,), dtype=string)

Args
input An N-dimensional UTF-8 string (or optionally integer) Tensor or RaggedTensor.

Returns
A tuple (tokens, start_offsets, end_offsets) where:

  • tokens is an N+1-dimensional UTF-8 string or integer Tensor or RaggedTensor.
  • start_offsets is an N+1-dimensional integer Tensor or RaggedTensor containing the starting indices of each token (byte indices for input strings).
  • end_offsets is an N+1-dimensional integer Tensor or RaggedTensor containing the exclusive ending indices of each token (byte indices for input strings).