UnicodeDecode

public final class UnicodeDecode

Decodes each string in `input` into a sequence of Unicode code points.

The character codepoints for all strings are returned using a single vector `char_values`, with strings expanded to characters in row-major order.

The `row_splits` tensor indicates where the codepoints for each input string begin and end within the `char_values` tensor. In particular, the values for the `i`th string (in row-major order) are stored in the slice `[row_splits[i]:row_splits[i+1]]`. Thus:

  • `char_values[row_splits[i]+j]` is the Unicode codepoint for the `j`th character in the `i`th string (in row-major order).
  • `row_splits[i+1] - row_splits[i]` is the number of characters in the `i`th string (in row-major order).

Nested Classes

class UnicodeDecode.Options Optional attributes for UnicodeDecode

Public Methods

Output <Integer>
charValues ()
A 1D int32 Tensor containing the decoded codepoints.
static <T extends Number> UnicodeDecode <T>
create ( Scope scope, Operand <String> input, String inputEncoding, Class<T> Tsplits, Options... options)
Factory method to create a class wrapping a new UnicodeDecode operation.
static UnicodeDecode <Long>
create ( Scope scope, Operand <String> input, String inputEncoding, Options... options)
Factory method to create a class wrapping a new UnicodeDecode operation using default output types.
static UnicodeDecode.Options
errors (String errors)
static UnicodeDecode.Options
replaceControlCharacters (Boolean replaceControlCharacters)
static UnicodeDecode.Options
replacementChar (Long replacementChar)
Output <T>
rowSplits ()
A 1D int32 tensor containing the row splits.

Inherited Methods

Public Methods

public Output <Integer> charValues ()

A 1D int32 Tensor containing the decoded codepoints.

public static UnicodeDecode <T> create ( Scope scope, Operand <String> input, String inputEncoding, Class<T> Tsplits, Options... options)

Factory method to create a class wrapping a new UnicodeDecode operation.

Parameters
scope current scope
input The text to be decoded. Can have any shape. Note that the output is flattened to a vector of char values.
inputEncoding Text encoding of the input strings. This is any of the encodings supported by ICU ucnv algorithmic converters. Examples: `"UTF-16", "US ASCII", "UTF-8"`.
options carries optional attributes values
Returns
  • a new instance of UnicodeDecode

public static UnicodeDecode <Long> create ( Scope scope, Operand <String> input, String inputEncoding, Options... options)

Factory method to create a class wrapping a new UnicodeDecode operation using default output types.

Parameters
scope current scope
input The text to be decoded. Can have any shape. Note that the output is flattened to a vector of char values.
inputEncoding Text encoding of the input strings. This is any of the encodings supported by ICU ucnv algorithmic converters. Examples: `"UTF-16", "US ASCII", "UTF-8"`.
options carries optional attributes values
Returns
  • a new instance of UnicodeDecode

public static UnicodeDecode.Options errors (String errors)

Parameters
errors Error handling policy when there is invalid formatting found in the input. The value of 'strict' will cause the operation to produce a InvalidArgument error on any invalid input formatting. A value of 'replace' (the default) will cause the operation to replace any invalid formatting in the input with the `replacement_char` codepoint. A value of 'ignore' will cause the operation to skip any invalid formatting in the input and produce no corresponding output character.

public static UnicodeDecode.Options replaceControlCharacters (Boolean replaceControlCharacters)

Parameters
replaceControlCharacters Whether to replace the C0 control characters (00-1F) with the `replacement_char`. Default is false.

public static UnicodeDecode.Options replacementChar (Long replacementChar)

Parameters
replacementChar The replacement character codepoint to be used in place of any invalid formatting in the input when `errors='replace'`. Any valid unicode codepoint may be used. The default value is the default unicode replacement character is 0xFFFD or U+65533.)

public Output <T> rowSplits ()

A 1D int32 tensor containing the row splits.