View source on GitHub |
Base class for detokenizer implementations.
text.Detokenizer(
name=None
)
A Detokenizer is a module that combines tokens to form strings. Generally,
subclasses of Detokenizer
will also be subclasses of Tokenizer
; and the
detokenize
method will be the inverse of the tokenize
method. I.e.,
tokenizer.detokenize(tokenizer.tokenize(s)) == s
.
Each Detokenizer subclass must implement a detokenize
method, which combines
tokens together to form strings. E.g.:
class SimpleDetokenizer(tf_text.Detokenizer):
def detokenize(self, input):
return tf.strings.reduce_join(input, axis=-1, separator=" ")
text = tf.ragged.constant([["hello", "world"], ["a", "b", "c"]])
print(SimpleDetokenizer().detokenize(text))
tf.Tensor([b'hello world' b'a b c'], shape=(2,), dtype=string)
Methods
detokenize
@abc.abstractmethod
detokenize( input )
Assembles the tokens in the input tensor into a string.
Generally, detokenize
is the inverse of the tokenize
method, and can
be used to reconstrct a string from a set of tokens. This is especially
helpful in cases where the tokens are integer ids, such as indexes into a
vocabulary table -- in that case, the tokenized encoding is not very
human-readable (since it's just a list of integers), so the detokenize
method can be used to turn it back into something that's more readable.
Args | |
---|---|
input
|
An N-dimensional UTF-8 string or integer Tensor or
RaggedTensor .
|
Returns | |
---|---|
An (N-1)-dimensional UTF-8 string Tensor or RaggedTensor .
|