text.UnicodeCharTokenizer

Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter, Detokenizer

text.UnicodeCharTokenizer()

Used in the notebooks

Used in the guide
Tokenizing with TF Text

Resulting tokens are integers (unicode codepoints). Scalar input will produce a Tensor output containing the codepoints. Tensor inputs will produce RaggedTensor outputs.

Example:

tokenizer = tf_text.UnicodeCharTokenizer()
tokens = tokenizer.tokenize("abc")
print(tokens)
tf.Tensor([97 98 99], shape=(3,), dtype=int32)

tokens = tokenizer.tokenize(["abc", "de"])
print(tokens)
<tf.RaggedTensor [[97, 98, 99], [100, 101]]>

t = ["abc" + chr(0xfffe) + chr(0x1fffe) ]
tokens = tokenizer.tokenize(t)
print(tokens.to_list())
[[97, 98, 99, 65534, 131070]]

Passing malformed UTF-8 will result in unpredictable behavior. Make sure inputs conform to UTF-8.

Methods

`detokenize`

View source

detokenize(
    input, name=None
)

Detokenizes input codepoints (integers) to UTF-8 strings.

Example:

tokenizer = tf_text.UnicodeCharTokenizer()
tokens = tokenizer.tokenize(["abc", "de"])
s = tokenizer.detokenize(tokens)
print(s)
tf.Tensor([b'abc' b'de'], shape=(2,), dtype=string)

Args
`input`	A `RaggedTensor` or `Tensor` of codepoints (ints) with a rank of at least 1.
`name`	The name argument that is passed to the op function.

Returns
A N-1 dimensional string tensor of the text corresponding to the UTF-8 codepoints in the input.

`split`

View source

split(
    input
)

Alias for Tokenizer.tokenize.

`split_with_offsets`

View source

split_with_offsets(
    input
)

Alias for TokenizerWithOffsets.tokenize_with_offsets.

`tokenize`

View source

tokenize(
    input
)

Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.

Input strings are split on character boundaries using unicode_decode_with_offsets.

Args
`input`	A `RaggedTensor`or `Tensor` of UTF-8 strings with any shape.

Returns
A `RaggedTensor` of tokenized text. The returned shape is the shape of the input tensor with an added ragged dimension for tokens (characters) of each string.

`tokenize_with_offsets`

View source

tokenize_with_offsets(
    input
)

Tokenizes a tensor of UTF-8 strings to Unicode characters.

Example:

tokenizer = tf_text.UnicodeCharTokenizer()
tokens = tokenizer.tokenize_with_offsets("a"+chr(8364)+chr(10340))
print(tokens[0])
tf.Tensor([   97  8364 10340], shape=(3,), dtype=int32)
print(tokens[1])
tf.Tensor([0 1 4], shape=(3,), dtype=int64)
print(tokens[2])
tf.Tensor([1 4 7], shape=(3,), dtype=int64)

The start_offsets and end_offsets are in byte indices of the original string. When calling with multiple string inputs, the offset indices will be relative to the individual source strings:

toks = tokenizer.tokenize_with_offsets(["a"+chr(8364), "b"+chr(10300) ])
print(toks[0])
<tf.RaggedTensor [[97, 8364], [98, 10300]]>
print(toks[1])
<tf.RaggedTensor [[0, 1], [0, 1]]>
print(toks[2])
<tf.RaggedTensor [[1, 4], [1, 4]]>

Args
`input`	A `RaggedTensor`or `Tensor` of UTF-8 strings with any shape.

Returns

Returns
A tuple `(tokens, start_offsets, end_offsets)` where: `tokens`: A `RaggedTensor` of code points (integer type). `start_offsets`: A `RaggedTensor` of the tokens' starting byte offset. `end_offsets`: A `RaggedTensor` of the tokens' ending byte offset.

A tuple (tokens, start_offsets, end_offsets) where:

tokens: A RaggedTensor of code points (integer type).
start_offsets: A RaggedTensor of the tokens' starting byte offset.
end_offsets: A RaggedTensor of the tokens' ending byte offset.