text.RegexSplitter

RegexSplitter splits text on the given regular expression.

Inherits From: SplitterWithOffsets, Splitter

text.RegexSplitter(
    split_regex=None
)

Used in the notebooks

Used in the guide
Tokenizing with TF Text

The default is a newline character pattern. It can also return the beginning and ending byte offsets as well.

By default, this splitter will break on newlines, ignoring any trailing ones.

>>> splitter = RegexSplitter()
>>> text_input=[
...       b"Hi there.\nWhat time is it?\nIt is gametime.",
...       b"Who let the dogs out?\nWho?\nWho?\nWho?\n\n",
...   ]
>>> splitter.split(text_input)
<tf.RaggedTensor [[b'Hi there.', b'What time is it?', b'It is gametime.'],
                  [b'Who let the dogs out?', b'Who?', b'Who?', b'Who?']]>

The splitter can be passed a custom split pattern, as well. The pattern can be any string, but we're using a single character (tab) in this example.

>>> splitter = RegexSplitter(split_regex='\t')
>>> text_input=[
...       b"Hi there.\tWhat time is it?\tIt is gametime.",
...       b"Who let the dogs out?\tWho?\tWho?\tWho?\t\t",
...   ]
>>> splitter.split(text_input)
<tf.RaggedTensor [[b'Hi there.', b'What time is it?', b'It is gametime.'],
                  [b'Who let the dogs out?', b'Who?', b'Who?', b'Who?']]>

Args
`split_regex`	(optional) A string containing the regex pattern of a delimiter to split on. Default is '\r?\n'.

Methods

`split`

View source

split(
    input
)

Splits the input tensor into pieces.

Generally, the pieces returned by a splitter correspond to substrings of the original string, and can be encoded using either strings or integer ids.

Example:

print(tf_text.WhitespaceTokenizer().split("small medium large"))
tf.Tensor([b'small' b'medium' b'large'], shape=(3,), dtype=string)

Args
`input`	An N-dimensional UTF-8 string (or optionally integer) `Tensor` or `RaggedTensor`.

Returns
An N+1-dimensional UTF-8 string or integer `Tensor` or `RaggedTensor`. For each string from the input tensor, the final, extra dimension contains the pieces that string was split into.

`split_with_offsets`

View source

split_with_offsets(
    input
)

Splits the input tensor, and returns the resulting pieces with offsets.

Example:

splitter = tf_text.WhitespaceTokenizer()
pieces, starts, ends = splitter.split_with_offsets("a bb ccc")
print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'bb' b'ccc'] [0 2 5] [1 4 8]

Args
`input`	An N-dimensional UTF-8 string (or optionally integer) `Tensor` or `RaggedTensor`.

Returns

Returns
A tuple `(pieces, start_offsets, end_offsets)` where: `pieces` is an N+1-dimensional UTF-8 string or integer `Tensor` or `RaggedTensor`. `start_offsets` is an N+1-dimensional integer `Tensor` or `RaggedTensor` containing the starting indices of each piece (byte indices for input strings). `end_offsets` is an N+1-dimensional integer `Tensor` or `RaggedTensor` containing the exclusive ending indices of each piece (byte indices for input strings).

A tuple (pieces, start_offsets, end_offsets) where:

pieces is an N+1-dimensional UTF-8 string or integer Tensor or RaggedTensor.
start_offsets is an N+1-dimensional integer Tensor or RaggedTensor containing the starting indices of each piece (byte indices for input strings).
end_offsets is an N+1-dimensional integer Tensor or RaggedTensor containing the exclusive ending indices of each piece (byte indices for input strings).

text.RegexSplitter

Used in the notebooks

Args

Methods

split

Example:

split_with_offsets

Example:

`split`

`split_with_offsets`