Have a question? Connect with the community at the TensorFlow Forum Visit Forum

text.normalize_utf8_with_offsets_map

Normalizes each UTF-8 string in the input tensor using the specified rule.

Returns normalized strings and an offset map used by another operation to map post-normalized string offsets to pre-normalized string offsets.

See http://unicode.org/reports/tr15/

Examples:

# input: <string>[num_strings]
normalize_utf8_with_offsets_map(["株式会社", "KADOKAWA"])
# output: <string>[num_strings], <variant>[num_strings]
NormalizeUTF8WithOffsetsMap(output=<tf.Tensor: shape=(2,), dtype=string,
numpy=
array([b'\xe6\xa0\xaa\xe5\xbc\x8f\xe4\xbc\x9a\xe7\xa4\xbe', b'KADOKAWA'],
      dtype=object)>, offsets_map=<tf.Tensor: shape=(2,), dtype=variant,
      numpy=<unprintable>>)

input A Tensor or RaggedTensor of type string. (Must be UTF-8.) normalization_form: One of the following string values ('NFC', 'NFKC', 'NFD', 'NFKD'). Default is 'NFKC'. NOTE: NFD and NFKD for normalize_utf8_with_offsets_map will not be available until the tf.text release w/ ICU 69 (scheduled after 4/2021).
name The name for this op (optional).

A tuple of (results, offsets_map) where:
results A Tensor or RaggedTensor of type string, with normalized contents.
offsets_map A Tensor or RaggedTensor of type variant, used to map the post-normalized string offsets to pre-normalized string offsets. It has the same shape as the results tensor. offsets_map is an input to find_source_offsets op.