text.WhitespaceTokenizer

Tokenizes a tensor of UTF-8 strings on whitespaces.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter

Used in the notebooks

Used in the guide Used in the tutorials

Methods

split

View source

Alias for Tokenizer.tokenize.

split_with_offsets

View source

Alias for TokenizerWithOffsets.tokenize_with_offsets.

tokenize

View source

Tokenizes a tensor of UTF-8 strings on whitespaces.

The strings are split on ICU defined whitespace characters. These whitespace characters are dropped.

Example:

WhitespaceTokenizer().tokenize("small medium large")
<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'small', b'medium',
b'large'], dtype=object)>

Args
input A RaggedTensor or Tensor of UTF-8 strings with any shape.

Returns
A RaggedTensor of tokenized text. The returned shape is the shape of the input tensor with an added ragged dimension for tokens of each string.

tokenize_with_offsets

View source

Tokenizes a tensor of UTF-8 strings on whitespaces.

The strings are split on ICU defined whitespace characters. These whitespace characters are dropped.

Example:

splitter = WhitespaceTokenizer()
pieces, starts, ends = splitter.tokenize_with_offsets("a bb ccc")
print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'bb' b'ccc'] [0 2 5] [1 4 8]

Args
input A RaggedTensoror Tensor of UTF-8 strings with any shape.

Returns
A tuple (tokens, start_offsets, end_offsets) where:

  • tokens: A RaggedTensor of tokenized text.
  • start_offsets: A RaggedTensor of the tokens' starting byte offset.
  • end_offsets: A RaggedTensor of the tokens' ending byte offset.