tf.strings.unicode_decode_with_offsets

Decodes each string into a sequence of code points with start offsets.

Used in the notebooks

Used in the guide

This op is similar to tf.strings.decode(...), but it also returns the start offset for each character in its respective string. This information can be used to align the characters with the original byte sequence.

Returns a tuple (codepoints, start_offsets) where:

  • codepoints[i1...iN, j] is the Unicode codepoint for the jth character in input[i1...iN], when decoded using input_encoding.
  • start_offsets[i1...iN, j] is the start byte offset for the jth character in input[i1...iN], when decoded using input_encoding.

input An N dimensional potentially ragged string tensor with shape [D1...DN]. N must be s