View source on GitHub |
Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.
Inherits From: TokenizerWithOffsets
, Tokenizer
, SplitterWithOffsets
, Splitter
, Detokenizer
text.UnicodeCharTokenizer()
Used in the notebooks
Used in the guide |
---|
Resulting tokens are integers (unicode codepoints). Scalar input will
produce a Tensor
output containing the codepoints. Tensor inputs will
produce RaggedTensor
outputs.
Example:
tokenizer = tf_text.UnicodeCharTokenizer()
tokens = tokenizer.tokenize("abc")
print(tokens)
tf.Tensor([97 98 99], shape=(3,), dtype=int32)
tokens = tokenizer.tokenize(["abc", "de"])
print(tokens)
<tf.RaggedTensor [[97, 98, 99], [100, 101]]>
t = ["abc" + chr(0xfffe) + chr(0x1fffe) ]
tokens = tokenizer.tokenize(t)
print(tokens.to_list())
[[97, 98, 99, 65534, 131070]]
Passing malformed UTF-8 will result in unpredictable behavior. Make sure inputs conform to UTF-8.
Methods
detokenize
detokenize(
input, name=None
)
Detokenizes input codepoints (integers) to UTF-8 strings.
Example:
tokenizer = tf_text.UnicodeCharTokenizer()
tokens = tokenizer.tokenize(["abc", "de"])
s = tokenizer.detokenize(tokens)
print(s)
tf.Tensor([b'abc' b'de'], shape=(2,), dtype=string)
Args | |
---|---|
input
|
A RaggedTensor or Tensor of codepoints (ints) with a rank of at
least 1.
|
name
|
The name argument that is passed to the op function. |
Returns | |
---|---|
A N-1 dimensional string tensor of the text corresponding to the UTF-8 codepoints in the input. |
split
split(
input
)
Alias for Tokenizer.tokenize
.
split_with_offsets
split_with_offsets(
input
)
Alias for TokenizerWithOffsets.tokenize_with_offsets
.
tokenize
tokenize(
input
)
Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.
Input strings are split on character boundaries using unicode_decode_with_offsets.
Args | |
---|---|
input
|
A RaggedTensor or Tensor of UTF-8 strings with any shape.
|
Returns | |
---|---|
A RaggedTensor of tokenized text. The returned shape is the shape of the
input tensor with an added ragged dimension for tokens (characters) of
each string.
|
tokenize_with_offsets
tokenize_with_offsets(
input
)
Tokenizes a tensor of UTF-8 strings to Unicode characters.
Example:
tokenizer = tf_text.UnicodeCharTokenizer()
tokens = tokenizer.tokenize_with_offsets("a"+chr(8364)+chr(10340))
print(tokens[0])
tf.Tensor([ 97 8364 10340], shape=(3,), dtype=int32)
print(tokens[1])
tf.Tensor([0 1 4], shape=(3,), dtype=int64)
print(tokens[2])
tf.Tensor([1 4 7], shape=(3,), dtype=int64)
The start_offsets
and end_offsets
are in byte indices of the original
string. When calling with multiple string inputs, the offset indices will
be relative to the individual source strings:
toks = tokenizer.tokenize_with_offsets(["a"+chr(8364), "b"+chr(10300) ])
print(toks[0])
<tf.RaggedTensor [[97, 8364], [98, 10300]]>
print(toks[1])
<tf.RaggedTensor [[0, 1], [0, 1]]>
print(toks[2])
<tf.RaggedTensor [[1, 4], [1, 4]]>
Args | |
---|---|
input
|
A RaggedTensor or Tensor of UTF-8 strings with any shape.
|
Returns | |
---|---|
A tuple (tokens, start_offsets, end_offsets) where:
|