View source on GitHub |
An abstract base class for splitters that return offsets.
Inherits From: Splitter
text.SplitterWithOffsets(
name=None
)
Each SplitterWithOffsets subclass must implement the split_with_offsets
method, which returns a tuple containing both the pieces and the offsets where
those pieces occurred in the input string. E.g.:
class CharSplitter(SplitterWithOffsets):
def split_with_offsets(self, input):
chars, starts = tf.strings.unicode_split_with_offsets(input, 'UTF-8')
lengths = tf.expand_dims(tf.strings.length(input), -1)
ends = tf.concat([starts[..., 1:], tf.cast(lengths, tf.int64)], -1)
return chars, starts, ends
def split(self, input):
return self.split_with_offsets(input)[0]
pieces, starts, ends = CharSplitter().split_with_offsets("a😊c")
print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'\xf0\x9f\x98\x8a' b'c'] [0 1 5] [1 5 6]
Methods
split
@abc.abstractmethod
split( input )
Splits the input tensor into pieces.
Generally, the pieces returned by a splitter correspond to substrings of the original string, and can be encoded using either strings or integer ids.
Example:
print(tf_text.WhitespaceTokenizer().split("small medium large"))
tf.Tensor([b'small' b'medium' b'large'], shape=(3,), dtype=string)
Args | |
---|---|
input
|
An N-dimensional UTF-8 string (or optionally integer) Tensor or
RaggedTensor .
|
Returns | |
---|---|
An N+1-dimensional UTF-8 string or integer Tensor or RaggedTensor .
For each string from the input tensor, the final, extra dimension contains
the pieces that string was split into.
|
split_with_offsets
@abc.abstractmethod
split_with_offsets( input )
Splits the input tensor, and returns the resulting pieces with offsets.
Example:
splitter = tf_text.WhitespaceTokenizer()
pieces, starts, ends = splitter.split_with_offsets("a bb ccc")
print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'bb' b'ccc'] [0 2 5] [1 4 8]
Args | |
---|---|
input
|
An N-dimensional UTF-8 string (or optionally integer) Tensor or
RaggedTensor .
|
Returns | |
---|---|
A tuple (pieces, start_offsets, end_offsets) where:
|