text.TokenizerWithOffsets

Base class for tokenizer implementations that return offsets.

Inherits From: Tokenizer, SplitterWithOffsets, Splitter

text.TokenizerWithOffsets(
    name=None
)

The offsets indicate which substring from the input string was used to generate each token. E.g., if input is a single string, then each token token[i] was generated from the substring input[starts[i]:ends[i]].

Each TokenizerWithOffsets subclass must implement the tokenize_with_offsets method, which returns a tuple containing both the pieces and the start and end offsets where those pieces occurred in the input string. I.e., if tokens, starts, ends = tokenize_with_offsets(s), then each token token[i] corresponds with tf.strings.substr(s, starts[i], ends[i] - starts[i]).

If the tokenizer encodes tokens as strings (and not token ids), then it will usually be the case that these corresponding strings are equal; but that is not technically required. For example, a tokenizer might choose to downcase strings

Example:

class CharTokenizer(TokenizerWithOffsets):
  def tokenize_with_offsets(self, input):
    chars, starts = tf.strings.unicode_split_with_offsets(input, 'UTF-8')
    lengths = tf.expand_dims(tf.strings.length(input), -1)
    ends = tf.concat([starts[..., 1:], tf.cast(lengths, tf.int64)], -1)
    return chars, starts, ends
  def tokenize(self, input):
    return self.tokenize_with_offsets(input)[0]
pieces, starts, ends = CharTokenizer().split_with_offsets("a😊c")
print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'\xf0\x9f\x98\x8a' b'c'] [0 1 5] [1 5 6]

Methods

`split`

View source

split(
    input
)

Alias for Tokenizer.tokenize.

`split_with_offsets`

View source

split_with_offsets(
    input
)

Alias for TokenizerWithOffsets.tokenize_with_offsets.

`tokenize`

View source

@abc.abstractmethod
tokenize(
    input
)

Tokenizes the input tensor.

Splits each string in the input tensor into a sequence of tokens. Tokens generally correspond to short substrings of the source string. Tokens can be encoded using either strings or integer ids.

Example:

print(tf_text.WhitespaceTokenizer().tokenize("small medium large"))
tf.Tensor([b'small' b'medium' b'large'], shape=(3,), dtype=string)

Args
`input`	An N-dimensional UTF-8 string (or optionally integer) `Tensor` or `RaggedTensor`.

Returns
An N+1-dimensional UTF-8 string or integer `Tensor` or `RaggedTensor`. For each string from the input tensor, the final, extra dimension contains the tokens that string was split into.

`tokenize_with_offsets`

View source

@abc.abstractmethod
tokenize_with_offsets(
    input
)

Tokenizes the input tensor and returns the result with byte-offsets.

The offsets indicate which substring from the input string was used to generate each token. E.g., if input is a tf.string tensor, then each token token[i] was generated from the substring tf.substr(input, starts[i], len=ends[i]-starts[i]).

Example:

splitter = tf_text.WhitespaceTokenizer()
pieces, starts, ends = splitter.tokenize_with_offsets("a bb ccc")
print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'bb' b'ccc'] [0 2 5] [1 4 8]
print(tf.strings.substr("a bb ccc", starts, ends-starts))
tf.Tensor([b'a' b'bb' b'ccc'], shape=(3,), dtype=string)

Args
`input`	An N-dimensional UTF-8 string (or optionally integer) `Tensor` or `RaggedTensor`.

Returns

Returns
A tuple `(tokens, start_offsets, end_offsets)` where: `tokens` is an N+1-dimensional UTF-8 string or integer `Tensor` or `RaggedTensor`. `start_offsets` is an N+1-dimensional integer `Tensor` or `RaggedTensor` containing the starting indices of each token (byte indices for input strings). `end_offsets` is an N+1-dimensional integer `Tensor` or `RaggedTensor` containing the exclusive ending indices of each token (byte indices for input strings).

A tuple (tokens, start_offsets, end_offsets) where:

tokens is an N+1-dimensional UTF-8 string or integer Tensor or RaggedTensor.
start_offsets is an N+1-dimensional integer Tensor or RaggedTensor containing the starting indices of each token (byte indices for input strings).
end_offsets is an N+1-dimensional integer Tensor or RaggedTensor containing the exclusive ending indices of each token (byte indices for input strings).

text.TokenizerWithOffsets Stay organized with collections Save and categorize content based on your preferences.

Example:

Methods

split

split_with_offsets

tokenize

Example:

tokenize_with_offsets

Example:

text.TokenizerWithOffsets

`split`

`split_with_offsets`

`tokenize`

`tokenize_with_offsets`