View source on GitHub |
Tokenizes a tensor of UTF-8 strings on whitespaces.
Inherits From: TokenizerWithOffsets
, Tokenizer
, SplitterWithOffsets
, Splitter
text.WhitespaceTokenizer()
Used in the notebooks
Used in the guide | Used in the tutorials |
---|---|
Methods
split
split(
input
)
Alias for Tokenizer.tokenize
.
split_with_offsets
split_with_offsets(
input
)
Alias for TokenizerWithOffsets.tokenize_with_offsets
.
tokenize
tokenize(
input
)
Tokenizes a tensor of UTF-8 strings on whitespaces.
The strings are split on ICU defined whitespace characters. These whitespace characters are dropped.
Example:
WhitespaceTokenizer().tokenize("small medium large")
<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'small', b'medium',
b'large'], dtype=object)>
Args | |
---|---|
input
|
A RaggedTensor or Tensor of UTF-8 strings with any shape.
|
Returns | |
---|---|
A RaggedTensor of tokenized text. The returned shape is the shape of the
input tensor with an added ragged dimension for tokens of each string.
|
tokenize_with_offsets
tokenize_with_offsets(
input
)
Tokenizes a tensor of UTF-8 strings on whitespaces.
The strings are split on ICU defined whitespace characters. These whitespace characters are dropped.
Example:
splitter = WhitespaceTokenizer()
pieces, starts, ends = splitter.tokenize_with_offsets("a bb ccc")
print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'bb' b'ccc'] [0 2 5] [1 4 8]
Args | |
---|---|
input
|
A RaggedTensor or Tensor of UTF-8 strings with any shape.
|
Returns | |
---|---|
A tuple (tokens, start_offsets, end_offsets) where:
|