Have a question? Connect with the community at the TensorFlow Forum Visit Forum

UnicodeDecodeWithOffsets

public final class UnicodeDecodeWithOffsets

Decodes each string in `input` into a sequence of Unicode code points.

The character codepoints for all strings are returned using a single vector `char_values`, with strings expanded to characters in row-major order. Similarly, the character start byte offsets are returned using a single vector `char_to_byte_starts`, with strings expanded in row-major order.

The `row_splits` tensor indicates where the codepoints and start offsets for each input string begin and end within the `char_values` and `char_to_byte_starts` tensors. In particular, the values for the `i`th string (in row-major order) are stored in the slice `[row_splits[i]:row_splits[i+1]]`. Thus:

  • `char_values[row_splits[i]+j]` is the Unicode codepoint for the `j`th character in the `i`th string (in row-major order).
  • `char_to_bytes_starts[row_splits[i]+j]` is the start byte offset for the `j`th character in the `i`th string (in row-major order).
  • `row_splits[i+1] - row_splits[i]` is the number of characters in the `i`th string (in row-major order).

Nested Classes

class UnicodeDecodeWithOffsets.Options Optional attributes for UnicodeDecodeWithOffsets

Constants

String OP_NAME The name of this op, as known by TensorFlow core engine

Public Methods

Output < TInt64 >
charToByteStarts ()
A 1D int32 Tensor containing the byte index in the input string where each character in `char_values` starts.
Output < TInt32 >
charValues ()
A 1D int32 Tensor containing the decoded codepoints.
static UnicodeDecodeWithOffsets < TInt64 >
create ( Scope scope, Operand < TString > input, String inputEncoding, Options... options)
Factory method to create a class wrapping a new UnicodeDecodeWithOffsets operation using default output types.
static <T extends TNumber > UnicodeDecodeWithOffsets <T>
create ( Scope scope, Operand < TString > input, String inputEncoding, Class<T> Tsplits, Options... options)
Factory method to create a class wrapping a new UnicodeDecodeWithOffsets operation.
static UnicodeDecodeWithOffsets.Options
errors (String errors)
static UnicodeDecodeWithOffsets.Options
replaceControlCharacters (Boolean replaceControlCharacters)
static UnicodeDecodeWithOffsets.Options
replacementChar (Long replacementChar)
Output <T>
rowSplits ()
A 1D int32 tensor containing the row splits.

Inherited Methods

Constants

public static final String OP_NAME

The name of this op, as known by TensorFlow core engine

Constant Value: "UnicodeDecodeWithOffsets"

Public Methods

public Output < TInt64 > charToByteStarts ()

A 1D int32 Tensor containing the byte index in the input string where each character in `char_values` starts.

public Output < TInt32 > charValues ()

A 1D int32 Tensor containing the decoded codepoints.

public static UnicodeDecodeWithOffsets < TInt64 > create ( Scope scope, Operand < TString > input, String inputEncoding, Options... options)

Factory method to create a class wrapping a new UnicodeDecodeWithOffsets operation using default output types.

Parameters
scope current scope
input The text to be decoded. Can have any shape. Note that the output is flattened to a vector of char values.
inputEncoding Text encoding of the input strings. This is any of the encodings supported by ICU ucnv algorithmic converters. Examples: `"UTF-16", "US ASCII", "UTF-8"`.
options carries optional attributes values
Returns
  • a new instance of UnicodeDecodeWithOffsets

public static UnicodeDecodeWithOffsets <T> create ( Scope scope, Operand < TString > input, String inputEncoding, Class<T> Tsplits, Options... options)

Factory method to create a class wrapping a new UnicodeDecodeWithOffsets operation.

Parameters
scope current scope
input The text to be decoded. Can have any shape. Note that the output is flattened to a vector of char values.
inputEncoding Text encoding of the input strings. This is any of the encodings supported by ICU ucnv algorithmic converters. Examples: `"UTF-16", "US ASCII", "UTF-8"`.
options carries optional attributes values
Returns
  • a new instance of UnicodeDecodeWithOffsets

public static UnicodeDecodeWithOffsets.Options errors (String errors)

Parameters
errors Error handling policy when there is invalid formatting found in the input. The value of 'strict' will cause the operation to produce a InvalidArgument error on any invalid input formatting. A value of 'replace' (the default) will cause the operation to replace any invalid formatting in the input with the `replacement_char` codepoint. A value of 'ignore' will cause the operation to skip any invalid formatting in the input and produce no corresponding output character.

public static UnicodeDecodeWithOffsets.Options replaceControlCharacters (Boolean replaceControlCharacters)

Parameters
replaceControlCharacters Whether to replace the C0 control characters (00-1F) with the `replacement_char`. Default is false.

public static UnicodeDecodeWithOffsets.Options replacementChar (Long replacementChar)

Parameters
replacementChar The replacement character codepoint to be used in place of any invalid formatting in the input when `errors='replace'`. Any valid unicode codepoint may be used. The default value is the default unicode replacement character is 0xFFFD or U+65533.)

public Output <T> rowSplits ()

A 1D int32 tensor containing the row splits.