Wraps tf_text.SentencepieceTokenizer
as a Keras Layer.
tfm.nlp.layers.SentencepieceTokenizer(
*,
lower_case: bool,
model_file_path: Optional[str] = None,
model_serialized_proto: Optional[str] = None,
tokenize_with_offsets: bool = False,
nbest_size: int = 0,
alpha: float = 1.0,
strip_diacritics: bool = False,
**kwargs
)
Args |
lower_case
|
A Python boolean indicating whether to lowercase the string
before tokenization. NOTE: New models are encouraged to build *_cf
(case folding) normalization into the Sentencepiece model itself and
avoid this extra step.
|
model_file_path
|
A Python string with the path of the sentencepiece model.
Exactly one of model_file_path and model_serialized_proto can be
specified. In either case, the Keras model config for this layer will
store the actual proto (not a filename passed here).
|
model_serialized_proto
|
The sentencepiece model serialized proto string.
|
tokenize_with_offsets
|
A Python boolean. If true, this layer calls
SentencepieceTokenizer.tokenize_with_offsets() instead of
plain .tokenize() and outputs a triple of
(tokens, start_offsets, limit_offsets) insead of just tokens.
Note that when following strip_diacritics is set to True, returning
offsets is not supported now.
|
nbest_size
|
A scalar for sampling:
nbest_size = {0,1}: No sampling is performed. (default)
nbest_size > 1: samples from the nbest_size results.
nbest_size < 0: assuming that nbest_size is infinite and samples
from the all hypothesis (lattice) using
forward-filtering-and-backward-sampling algorithm.
|
alpha
|
A scalar for a smoothing parameter. Inverse temperature for
probability rescaling.
|
strip_diacritics
|
Whether to strip diacritics or not. Note that stripping
diacritics requires additional text normalization and dropping bytes,
which makes it impossible to keep track of the offsets now. Hence
when strip_diacritics is set to True, we don't yet support
tokenize_with_offsets . NOTE: New models are encouraged to put this
into custom normalization rules for the Sentencepiece model itself to
avoid this extra step and the limitation regarding offsets.
|
**kwargs
|
standard arguments to Layer() .
|
Raises |
ImportError
|
if importing tensorflow_text failed.
|
Methods
call
View source
call(
inputs: tf.Tensor
)
Calls text.SentencepieceTokenizer
on inputs.
Args |
inputs
|
A string Tensor of shape (batch_size,) .
|
Returns |
One or three of RaggedTensors if tokenize_with_offsets is False or True,
respectively. These are
|
tokens
|
A RaggedTensor of shape [batch_size, (pieces)] and type int32 .
tokens[i,j] contains the j-th piece in the i-th input.
start_offsets, limit_offsets: If tokenize_with_offsets is True,
RaggedTensors of type int64 with the same indices as tokens.
Element [i,j] contains the byte offset at the start, or past the
end, resp., for the j-th piece in the i-th input.
|
get_special_tokens_dict
View source
get_special_tokens_dict()
Returns dict of token ids, keyed by standard names for their purpose.
Returns |
A dict from Python strings to Python integers. Each key is a standard
name for a special token describing its use. (For example, "padding_id"
is what Sentencepiece calls "" but others may call "[PAD]".)
The corresponding value is the integer token id. If a special token
is not found, its entry is omitted from the dict.
The supported keys and tokens are:
- start_of_sequence_id: looked up from "[CLS]"
- end_of_segment_id: looked up from "[SEP]"
- padding_id: looked up from ""
- mask_id: looked up from "[MASK]"
- vocab_size: one past the largest token id used
|