text.SentencepieceTokenizer

Tokenizes a tensor of UTF-8 strings.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter, Detokenizer

Used in the notebooks

Used in the guide

SentencePiece is an unsupervised text tokenizer and detokenizer. It is used mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units with the extension of direct training from raw sentences.

Before using the tokenizer, you will need to train a vocabulary and build a model configuration for it. Please visit the Sentencepiece repository for the most up-to-date instructions on this process.

model The sentencepiece model serialized proto.
out_type output type. tf.int32 or tf.string (Default = tf.int32) Setting tf.int32 directly encodes the string into an id sequence.
nbest_size A scalar for sampling.

  • nbest_size = {0,1}: No sampling is performed. (default)
  • nbest_size > 1: samples from the nbest_size results.
  • nbest_size < 0: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.
alpha A scalar for a smoothing parameter. Inverse temperature for probability rescaling.
reverse Reverses the tokenized sequence (Default = false)
add_bos Add beginning of sentence token to the result (Default = false)
add_eos Add end of sentence token to the result (Default = false). When reverse=True beginning/end of sentence tokens are added after reversing.
return_nbest If True requires that nbest_size is a scalar and > 1. Returns the nbest_size best tokenizations for each sentence instead of a single one. The returned tensor has shape [batch * nbest, (tokens)].
name The name argument that is passed to the op function.

Methods

detokenize

View source

Detokenizes tokens into preprocessed text.

This function accepts tokenized text, and reforms it back into sentences.

Args
input A RaggedTensor or Tensor of UTF-8 string tokens with a rank of at least 1.
name The name argument that is passed to the op function.

Returns
A N-1 dimensional string Tensor or RaggedTensor of the detokenized text.

id_to_string

View source

Converts vocabulary id into a token.

Args
input An arbitrary tensor of int32 representing the token IDs.
name The name argument that is passed to the op function.

Returns
A tensor of string with the same shape as input.

split

View source

Alias for Tokenizer.tokenize.

split_with_offsets

View source

Alias for TokenizerWithOffsets.tokenize_with_offsets.

string_to_id

View source

Converts token into a vocabulary id.

This function is particularly helpful for determining the IDs for any special tokens whose ID could not be determined through normal tokenization.

Args
input An arbitrary tensor of string tokens.
name The name argument that is passed to the op function.

Returns
A tensor of int32 representing the IDs with the same shape as input.

tokenize

View source

Tokenizes a tensor of UTF-8 strings.

Args
input A RaggedTensor or Tensor of UTF-8 strings with any shape.
name The name argument that is passed to the op function.

Returns
A RaggedTensor of tokenized text. The returned shape is the shape of the input tensor with an added ragged dimension for tokens of each string.

tokenize_with_offsets

View source

Tokenizes a tensor of UTF-8 strings.

This function returns a tuple containing the tokens along with start and end byte offsets that mark where in the original string each token was located.

Args
input A RaggedTensor or Tensor of UTF-8 strings with any shape.
name The name argument that is passed to the op function.

Returns
A tuple (tokens, start_offsets, end_offsets) where:
tokens is an N+1-dimensional UTF-8 string or integer Tensor or RaggedTensor.
start_offsets is an N+1-dimensional integer Tensor or RaggedTensor containing the starting indices of each token (byte indices for input strings).
end_offsets is an N+1-dimensional integer Tensor or RaggedTensor containing the exclusive ending indices of each token (byte indices for input strings).

vocab_size

View source

Returns the vocabulary size.

The number of tokens from within the Sentencepiece vocabulary provided at the time of initialization.

Args
name The name argument that is passed to the op function.

Returns
A scalar representing the vocabulary size.