text.regex_split_with_offsets

Split input by delimiters that match a regex pattern; returns offsets.

regex_split_with_offsets will split input using delimiters that match a regex pattern in delim_regex_pattern. It will return three tensors: one containing the split substrings ('result' in the examples below), one containing the offsets of the starts of each substring ('begin' in the examples below), and one containing the offsets of the ends of each substring ('end' in the examples below).

Here is an example:

text_input=["hello there"]
# split by whitespace
result, begin, end = regex_split_with_offsets(input=text_input,
                                              delim_regex_pattern="\s")
print("result: %s\nbegin: %s\nend: %s" % (result, begin, end))
result: <tf.RaggedTensor [[b'hello', b'there']]>
begin: <tf.RaggedTensor [[0, 6]]>
end: <tf.RaggedTensor [[5, 11]]>

By default, delimiters are not included in the split string results. Delimiters may be included by specifying a regex pattern keep_delim_regex_pattern. For example:

text_input=["hello there"]
# split by whitespace
result, begin, end = regex_split_with_offsets(input=text_input,
                                            delim_regex_pattern="\s",
                                            keep_delim_regex_pattern="\s")
print("result: %s\nbegin: %s\nend: %s" % (result, begin, end))
result: <tf.RaggedTensor [[b'hello', b' ', b'there']]>
begin: <tf.RaggedTensor [[0, 5, 6]]>
end: <tf.RaggedTensor [[5, 6, 11]]>

If there are multiple delimiters in a row, there are no empty splits emitted. For example:

text_input=["hello  there"]  #  Note the two spaces between the words.
# split by whitespace
result, begin, end = regex_split_with_offsets(input=text_input,
                                              delim_regex_pattern="\s")
print("result: %s\nbegin: %s\nend: %s" % (result, begin, end))
result: <tf.RaggedTensor [[b'hello', b'there']]>
begin: <tf.RaggedTensor [[0, 7]]>
end: <tf.RaggedTensor [[5, 12]]>

See https://github.com/google/re2/wiki/Syntax for the full list of supported expressions.

input A Tensor or RaggedTensor of string input.
delim_regex_pattern A string containing the regex pattern of a delimiter.
keep_delim_regex_pattern (optional) Regex pattern of delimiters that should be kept in the result.
name (optional) Name of the op.

A tuple of RaggedTensors containing: (split_results, begin_offsets, end_offsets) where tokens is of type string, begin_offsets and end_offsets are of type int64.