View source on GitHub |
Base class for tokenizer implementations that return offsets.
Inherits From: Tokenizer
, SplitterWithOffsets
, Splitter
text.TokenizerWithOffsets(
name=None
)
The offsets indicate which substring from the input string was used to
generate each token. E.g., if input
is a single string, then each token
token[i]
was generated from the substring input[starts[i]:ends[i]]
.
Each TokenizerWithOffsets subclass must implement the tokenize_with_offsets
method, which returns a tuple containing both the pieces and the start and
end offsets where those pieces occurred in the input string. I.e., if
tokens, starts, ends = tokenize_with_offsets(s)
, then each token token[i]
corresponds with tf.strings.substr(s, starts[i], ends[i] - starts[i])
.
If the tokenizer encodes tokens as strings (and not token ids), then it will usually be the case that these corresponding strings are equal; but that is not technically required. For example, a tokenizer might choose to downcase strings
Example:
class CharTokenizer(TokenizerWithOffsets):
def tokenize_with_offsets(self, input):
chars, starts = tf.strings.unicode_split_with_offsets(input, 'UTF-8')
lengths = tf.expand_dims(tf.strings.length(input), -1)
ends = tf.concat([starts[..., 1:], tf.cast(lengths, tf.int64)], -1)
return chars, starts, ends
def tokenize(self, input):
return self.tokenize_with_offsets(input)[0]
pieces, starts, ends = CharTokenizer().split_with_offsets("a😊c")
print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'\xf0\x9f\x98\x8a' b'c'] [0 1 5] [1 5 6]
Methods
split
split(
input
)
Alias for Tokenizer.tokenize
.
split_with_offsets
split_with_offsets(
input
)
Alias for TokenizerWithOffsets.tokenize_with_offsets
.
tokenize
@abc.abstractmethod
tokenize( input )
Tokenizes the input tensor.
Splits each string in the input tensor into a sequence of tokens. Tokens generally correspond to short substrings of the source string. Tokens can be encoded using either strings or integer ids.
Example:
print(tf_text.WhitespaceTokenizer().tokenize("small medium large"))
tf.Tensor([b'small' b'medium' b'large'], shape=(3,), dtype=string)
Args | |
---|---|
input
|
An N-dimensional UTF-8 string (or optionally integer) Tensor or
RaggedTensor .
|
Returns | |
---|---|
An N+1-dimensional UTF-8 string or integer Tensor or RaggedTensor .
For each string from the input tensor, the final, extra dimension contains
the tokens that string was split into.
|
tokenize_with_offsets
@abc.abstractmethod
tokenize_with_offsets( input )
Tokenizes the input tensor and returns the result with byte-offsets.
The offsets indicate which substring from the input string was used to
generate each token. E.g., if input
is a tf.string
tensor, then each
token token[i]
was generated from the substring
tf.substr(input, starts[i], len=ends[i]-starts[i])
.
Example:
splitter = tf_text.WhitespaceTokenizer()
pieces, starts, ends = splitter.tokenize_with_offsets("a bb ccc")
print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'bb' b'ccc'] [0 2 5] [1 4 8]
print(tf.strings.substr("a bb ccc", starts, ends-starts))
tf.Tensor([b'a' b'bb' b'ccc'], shape=(3,), dtype=string)
Args | |
---|---|
input
|
An N-dimensional UTF-8 string (or optionally integer) Tensor or
RaggedTensor .
|
Returns | |
---|---|
A tuple (tokens, start_offsets, end_offsets) where:
|