text.UnicodeScriptTokenizer

Tokenizes UTF-8 by splitting when there is a change in Unicode script.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter

text.UnicodeScriptTokenizer(
    keep_whitespace=False
)

Used in the notebooks

Used in the guide	Used in the tutorials
Tokenizing with TF Text	Load text

By default, this tokenizer leaves out scripts matching the whitespace unicode property (use the keep_whitespace argument to keep it), so in this case the results are similar to the WhitespaceTokenizer. Any punctuation will get its own token (since it is in a different script), and any script change in the input string will be the location of a split.

Example:

tokenizer = tf_text.UnicodeScriptTokenizer()
tokens = tokenizer.tokenize(["xy.,z de", "fg?h", "abαβ"])
print(tokens.to_list())
[[b'xy', b'.,', b'z', b'de'], [b'fg', b'?', b'h'],
 [b'ab', b'\xce\xb1\xce\xb2']]

tokens = tokenizer.tokenize(u"累計7239人")
print(tokens)
tf.Tensor([b'\xe7\xb4\xaf\xe8\xa8\x88' b'7239' b'\xe4\xba\xba'], shape=(3,),
          dtype=string)

Both the punctuation and the whitespace in the first string have been split, but the punctuation run is present as a token while the whitespace isn't emitted (by default). The third example shows the case of a script change without any whitespace. This results in a split at that boundary point.

Args
`keep_whitespace`	A boolean that specifices whether to emit whitespace tokens (default `False`).

Methods

`split`

View source

split(
    input
)

Alias for Tokenizer.tokenize.

`split_with_offsets`

View source

split_with_offsets(
    input
)

Alias for TokenizerWithOffsets.tokenize_with_offsets.

`tokenize`

View source

tokenize(
    input
)

Tokenizes UTF-8 by splitting when there is a change in Unicode script.

The strings are split when successive tokens change their Unicode script or change being whitespace or not. The script codes used correspond to International Components for Unicode (ICU) UScriptCode values. See: http://icu-project.org/apiref/icu4c/uscript_8h.html

ICU-defined whitespace characters are dropped, unless the keep_whitespace option was specified at construction time.

Args
`input`	A `RaggedTensor`or `Tensor` of UTF-8 strings with any shape.

Returns
A `RaggedTensor` of tokenized text. The returned shape is the shape of the input tensor with an added ragged dimension for tokens of each string.

`tokenize_with_offsets`

View source

tokenize_with_offsets(
    input
)

Tokenizes UTF-8 by splitting when there is a change in Unicode script.

The strings are split when a change in the Unicode script is detected between sequential tokens. The script codes used correspond to International Components for Unicode (ICU) UScriptCode values. See: http://icu-project.org/apiref/icu4c/uscript_8h.html

ICU defined whitespace characters are dropped, unless the keep_whitespace option was specified at construction time.

Example:

tokenizer = tf_text.UnicodeScriptTokenizer()
tokens = tokenizer.tokenize_with_offsets(["xy.,z de", "abαβ"])
print(tokens[0].to_list())
[[b'xy', b'.,', b'z', b'de'], [b'ab', b'\xce\xb1\xce\xb2']]
print(tokens[1].to_list())
[[0, 2, 4, 6], [0, 2]]
print(tokens[2].to_list())
[[2, 4, 5, 8], [2, 6]]

tokens = tokenizer.tokenize_with_offsets(u"累計7239人")
print(tokens[0])
tf.Tensor([b'\xe7\xb4\xaf\xe8\xa8\x88' b'7239' b'\xe4\xba\xba'],
    shape=(3,), dtype=string)
print(tokens[1])
tf.Tensor([ 0  6 10], shape=(3,), dtype=int64)
print(tokens[2])
tf.Tensor([ 6 10 13], shape=(3,), dtype=int64)

The start_offsets and end_offsets are in byte indices of the original string. When calling with multiple string inputs, the offset indices will be relative to the individual source strings.

Args
`input`	A `RaggedTensor`or `Tensor` of UTF-8 strings with any shape.

Returns

Returns
A tuple `(tokens, start_offsets, end_offsets)` where: `tokens`: A `RaggedTensor` of tokenized text. `start_offsets`: A `RaggedTensor` of the tokens' starting byte offset. `end_offsets`: A `RaggedTensor` of the tokens' ending byte offset.

A tuple (tokens, start_offsets, end_offsets) where:

tokens: A RaggedTensor of tokenized text.
start_offsets: A RaggedTensor of the tokens' starting byte offset.
end_offsets: A RaggedTensor of the tokens' ending byte offset.

text.UnicodeScriptTokenizer Stay organized with collections Save and categorize content based on your preferences.

Used in the notebooks

Example:

Args

Methods

split

split_with_offsets

tokenize

tokenize_with_offsets

Example:

text.UnicodeScriptTokenizer

`split`

`split_with_offsets`

`tokenize`

`tokenize_with_offsets`