text.coerce_to_structurally_valid_utf8
Stay organized with collections
Save and categorize content based on your preferences.
Coerce UTF-8 input strings to structurally valid UTF-8.
text.coerce_to_structurally_valid_utf8(
input, replacement_char=_unichr(65533), name=None
)
Any bytes which cause the input string to be invalid UTF-8 are substituted
with the provided replacement character codepoint (default 65533). If you plan
on overriding the default, use a single byte replacement character codepoint
to preserve alignment to the source input string.
In this example, the character \xDEB2 is an invalid UTF-8 bit sequence; the
call to coerce_to_structurally_valid_utf8
replaces it with \xef\xbf\xbd,
which is the default replacement character encoding.
>>> input_data = ["A", b"\xDEB2", "C"]
>>> coerce_to_structurally_valid_utf8(input_data)
<tf.Tensor: shape=(3,), dtype=string,
numpy=array([b'A', b'\xef\xbf\xbdB2', b'C'], dtype=object)>
Args |
input
|
UTF-8 string tensor to coerce to valid UTF-8.
|
replacement_char
|
The replacement character to be used in place of any
invalid byte in the input. Any valid Unicode character may be used. The
default value is the default Unicode replacement character which is
0xFFFD (or U+65533). Note that passing a replacement character
expressible in 1 byte, such as ' ' or '?', will preserve string
alignment to the source since individual invalid bytes will be replaced
with a 1-byte replacement. (optional)
|
name
|
A name for the operation (optional).
|
Returns |
A tensor of type string with the same shape as the input.
|
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2024-12-20 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2024-12-20 UTC."],[],[]]