A bert tokenizer keras layer using text.FastWordpieceTokenizer.
tfm.nlp.layers.FastWordpieceBertTokenizer(
*,
vocab_file: str,
lower_case: bool,
tokenize_with_offsets: bool = False,
**kwargs
)
See details: "Fast WordPiece Tokenization" (https://arxiv.org/abs/2012.15524)
Args |
vocab_file
|
A Python string with the path of the vocabulary file. This is
a text file with newline-separated wordpiece tokens. This layer loads
a list of tokens from it to create text.FastWordpieceTokenizer.
|
lower_case
|
A Python boolean forwarded to text.BasicTokenizer. If true,
input text is converted to lower case (where applicable) before
tokenization. This must be set to match the way in which the vocab_file
was created.
|
tokenize_with_offsets
|
A Python boolean. If true, this layer calls
FastWordpieceTokenizer.tokenize_with_offsets() instead of plain
.tokenize() and outputs a triple of (tokens, start_offsets,
limit_offsets) insead of just tokens.
|
**kwargs
|
standard arguments to Layer().
|
Methods
call
View source
call(
inputs: tf.Tensor
)
Calls text.BertTokenizer on inputs.
Args |
inputs
|
A string Tensor of shape [batch_size].
|
Returns |
One or three of RaggedTensors if tokenize_with_offsets is False or True,
respectively. These are
|
tokens
|
A RaggedTensor of shape [batch_size, (words), (pieces_per_word)]
and type int32. tokens[i,j,k] contains the k-th wordpiece of the
j-th word in the i-th input.
start_offsets, limit_offsets: If tokenize_with_offsets is True,
RaggedTensors of type int64 with the same indices as tokens.
Element [i,j,k] contains the byte offset at the start, or past the
end, resp., for the k-th wordpiece of the j-th word in the i-th input.
|
get_special_tokens_dict
View source
get_special_tokens_dict()
Returns dict of token ids, keyed by standard names for their purpose.
Returns |
A dict from Python strings to Python integers. Each key is a standard
name for a special token describing its use. (For example, "padding_id"
is what BERT traditionally calls "[PAD]" but others may call "".)
The corresponding value is the integer token id. If a special token
is not found, its entry is omitted from the dict.
The supported keys and tokens are:
- start_of_sequence_id: looked up from "[CLS]"
- end_of_segment_id: looked up from "[SEP]"
- padding_id: looked up form "[PAD]"
- mask_id: looked up from "[MASK]"
- vocab_size: one past the largest token id used
|