![]() |
A bert tokenizer keras layer using text.FastWordpieceTokenizer.
tfm.nlp.layers.FastWordpieceBertTokenizer(
*,
vocab_file: str,
lower_case: bool,
tokenize_with_offsets: bool = False,
**kwargs
)
See details: "Fast WordPiece Tokenization" (https://arxiv.org/abs/2012.15524)
Attributes | |
---|---|
vocab_size
|
Methods
call
call(
inputs: tf.Tensor
)
Calls text.BertTokenizer on inputs.
Args | |
---|---|
inputs
|
A string Tensor of shape [batch_size]. |
Returns | |
---|---|
One or three of RaggedTensors if tokenize_with_offsets is False or True, respectively. These are | |
tokens
|
A RaggedTensor of shape [batch_size, (words), (pieces_per_word)] and type int32. tokens[i,j,k] contains the k-th wordpiece of the j-th word in the i-th input. start_offsets, limit_offsets: If tokenize_with_offsets is True, RaggedTensors of type int64 with the same indices as tokens. Element [i,j,k] contains the byte offset at the start, or past the end, resp., for the k-th wordpiece of the j-th word in the i-th input. |
get_special_tokens_dict
get_special_tokens_dict()
Returns dict of token ids, keyed by standard names for their purpose.
Returns | |
---|---|
A dict from Python strings to Python integers. Each key is a standard
name for a special token describing its use. (For example, "padding_id"
is what BERT traditionally calls "[PAD]" but others may call " The supported keys and tokens are:
|