Have a question? Connect with the community at the TensorFlow Forum Visit Forum

text.coerce_to_structurally_valid_utf8

Coerce UTF-8 input strings to structurally valid UTF-8.

Any bytes which cause the input string to be invalid UTF-8 are substituted with the provided replacement character codepoint (default 65533). If you plan on overriding the default, use a single byte replacement character codepoint to preserve alignment to the source input string.

In this example, the character \xDEB2 is an invalid UTF-8 bit sequence; the call to coerce_to_structurally_valid_utf8 replaces it with \xef\xbf\xbd, which is the default replacement character encoding.

>>> input_data = ["A", b"\xDEB2", "C"]
>>> coerce_to_structurally_valid_utf8(input_data)
<tf.Tensor: shape=(3,), dtype=string,
            numpy=array([b'A', b'\xef\xbf\xbdB2', b'C'], dtype=object)>

input UTF-8 string tensor to coerce to valid UTF-8.
replacement_char The replacement character to be used in place of any invalid byte in the input. Any valid Unicode character may be used. The default value is the default Unicode replacement character which is 0xFFFD (or U+65533). Note that passing a replacement character expressible in 1 byte, such as ' ' or '?', will preserve string alignment to the source since individual invalid bytes will be replaced with a 1-byte replacement. (optional)
name A name for the operation (optional).

A tensor of type string with the same shape as the input.