text.normalize_utf8_with_offsets_map

Normalizes each UTF-8 string in the input tensor using the specified rule.

text.normalize_utf8_with_offsets_map(
    input, normalization_form='NFKC', name=None
)

Returns normalized strings and an offset map used by another operation to map post-normalized string offsets to pre-normalized string offsets.

See http://unicode.org/reports/tr15/

Examples:

# input: <string>[num_strings]
normalize_utf8_with_offsets_map(["株式会社", "ＫＡＤＯＫＡＷＡ"])
# output: <string>[num_strings], <variant>[num_strings]
NormalizeUTF8WithOffsetsMap(output=<tf.Tensor: shape=(2,), dtype=string,
numpy=
array([b'\xe6\xa0\xaa\xe5\xbc\x8f\xe4\xbc\x9a\xe7\xa4\xbe', b'KADOKAWA'],
      dtype=object)>, offsets_map=<tf.Tensor: shape=(2,), dtype=variant,
      numpy=<unprintable>>)

Args
`input`	A `Tensor` or `RaggedTensor` of type string. (Must be UTF-8.)
`normalization_form`	One of the following string values ('NFC', 'NFKC', 'NFD', 'NFKD'). Default is 'NFKC'. NOTE: `NFD` and `NFKD` for `normalize_utf8_with_offsets_map` will not be available until the tf.text release w/ ICU 69 (scheduled after 4/2021).
`name`	The name for this op (optional).

Returns
A tuple of (results, offsets_map) where:
`results`	A `Tensor` or `RaggedTensor` of type string, with normalized contents.
`offsets_map`	A `Tensor` or `RaggedTensor` of type `variant`, used to map the post-normalized string offsets to pre-normalized string offsets. It has the same shape as the results tensor. offsets_map is an input to `find_source_offsets` op.

text.normalize_utf8_with_offsets_map Stay organized with collections Save and categorize content based on your preferences.

Examples:

Args

Returns

text.normalize_utf8_with_offsets_map