Have a question? Connect with the community at the TensorFlow Forum Visit Forum

tfnlp.layers.SentencepieceTokenizer

Wraps tf_text.SentencepieceTokenizer as a Keras Layer.

lower_case A Python boolean indicating whether to lowercase the string before tokenization. NOTE: New models are encouraged to build *_cf (case folding) normalization into the Sentencepiece model itself and avoid this extra step.
model_file_path A Python string with the path of the sentencepiece model. Exactly one of model_file_path and model_serialized_proto can be specified. In either case, the Keras model config for this layer will store the actual proto (not a filename passed here).
model_serialized_proto The sentencepiece model serialized proto string.
tokenize_with_offsets A Python boolean. If true, this layer calls SentencepieceTokenizer.tokenize_with_offsets() instead of plain .tokenize() and outputs a triple of (tokens, start_offsets, limit_offsets) insead of just tokens. Note that when following strip_diacritics is set to True, returning offsets is not supported now.
nbest_size A scalar for sampling: nbest_size = {0,1}: No sampling is performed. (default) nbest_size > 1: samples from the nbest_size results. nbest_size < 0: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.
alpha A scalar for a smoothing parameter. Inverse temperature for probability rescaling.
strip_diacritics Whether to strip diacritics or not. Note that stripping diacritics requires additional text normalization and dropping bytes, which makes it impossible to keep track of the offsets now. Hence when strip_diacritics is set to True, we don't yet support tokenize_with_offsets. NOTE: New models are encouraged to put this into custom normalization rules for the Sentencepiece model itself to avoid this extra step and the limitation regarding offsets.
**kwargs standard arguments to Layer().

ImportError if importing tensorflow_text failed.

tokenize_with_offsets If true, calls SentencepieceTokenizer.tokenize_with_offsets() instead of plain .tokenize() and outputs a triple of (tokens, start_offsets, limit_offsets).
vocab_size

Methods

call

View source

Calls text.SentencepieceTokenizer on inputs.

Args
inputs A string Tensor of shape [batch_size].

Returns
One or three of RaggedTensors if tokenize_with_offsets is False or True, respectively. These are
tokens A RaggedTensor of shape [batch_size, (pieces)] and type int32. tokens[i,j] contains the j-th piece in the i-th input. start_offsets, limit_offsets: If tokenize_with_offsets is True, RaggedTensors of type int64 with the same indices as tokens. Element [i,j] contains the byte offset at the start, or past the end, resp., for the j-th piece in the i-th input.

get_special_tokens_dict

View source

Returns dict of token ids, keyed by standard names for their purpose.

Returns
A dict from Python strings to Python integers. Each key is a standard name for a special token describing its use. (For example, "padding_id" is what Sentencepiece calls "" but others may call "[PAD]".) The corresponding value is the integer token id. If a special token is not found, its entry is omitted from the dict.

The supported keys and tokens are:

  • start_of_sequence_id: looked up from "[CLS]"
  • end_of_segment_id: looked up from "[SEP]"
  • padding_id: looked up from ""
  • mask_id: looked up from "[MASK]"
  • vocab_size: one past the largest token id used