text.UnicodeCharTokenizer

Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter, Detokenizer

Used in the notebooks

Used in the guide

Resulting tokens are integers (unicode codepoints). Scalar input will produce a Tensor output containing the codepoints. Tensor inputs will produce RaggedTensor outputs.

Example:

tokenizer = tf_text.UnicodeCharTokenizer()
tokens = tokenizer.tokenize("abc")
print(tokens)
tf.Tensor([97 98 99], shape=(3,), dtype=int32)
tokens = tokenizer.tokenize(["abc", "de"])
print(tokens)
<tf.RaggedTensor [[97, 98, 99], [100, 101]]>
t = ["abc" + chr(0xfffe) + chr(0x1fffe) ]
tokens = tokenizer.tokenize(t)
print(tokens.to_list())
[[97, 98, 99, 65534, 131070]]

Passing malformed UTF-8 will result in unpredictable behavior. Make sure inputs conform to UTF-8.

Methods

detokenize

View source

Detokenizes input codepoints (integers) to UTF-8 strings.

Example:

tokenizer = tf_text.UnicodeCharTokenizer()
tokens = tokenizer.tokenize(["abc", "de"])
s = tokenizer.detokenize(tokens)
print(s)
tf.Tensor([b'abc' b'de'], shape=(2,), dtype=string)

Args
input A RaggedTensor or Tensor of codepoints (ints) with a rank of at least 1.
name The name argument that is passed to the op function.

Returns
A N-1 dimensional string tensor of the text corresponding to the UTF-8 codepoints in the input.

split

View source

Alias for Tokenizer.tokenize.

split_with_offsets

View source

Alias for TokenizerWithOffsets.tokenize_with_offsets.

tokenize

View source

Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.

Input strings are split on character boundaries using unicode_decode_with_offsets.

Args
input A RaggedTensoror Tensor of UTF-8 strings with any shape.

Returns
A RaggedTensor of tokenized text. The returned shape is the shape of the input tensor with an added ragged dimension for tokens (characters) of each string.

tokenize_with_offsets

View source

Tokenizes a tensor of UTF-8 strings to Unicode characters.

Example:

tokenizer = tf_text.UnicodeCharTokenizer()
tokens = tokenizer.tokenize_with_offsets("a"+chr(8364)+chr(10340))
print(tokens[0])
tf.Tensor([   97  8364 10340], shape=(3,), dtype=int32)
print(tokens[1])
tf.Tensor([0 1 4], shape=(3,), dtype=int64)
print(tokens[2])
tf.Tensor([1 4 7], shape=(3,), dtype=int64)

The start_offsets and end_offsets are in byte indices of the original string. When calling with multiple string inputs, the offset indices will be relative to the individual source strings:

toks = tokenizer.tokenize_with_offsets(["a"+chr(8364), "b"+chr(10300) ])
print(toks[0])
<tf.RaggedTensor [[97, 8364], [98, 10300]]>
print(toks[1])
<tf.RaggedTensor [[0, 1], [0, 1]]>
print(toks[2])
<tf.RaggedTensor [[1, 4], [1, 4]]>

Args
input A RaggedTensoror Tensor of UTF-8 strings with any shape.

Returns
A tuple (tokens, start_offsets, end_offsets) where:

  • tokens: A RaggedTensor of code points (integer type).
  • start_offsets: A RaggedTensor of the tokens' starting byte offset.
  • end_offsets: A RaggedTensor of the tokens' ending byte offset.