Wraps TF.Text's BertTokenizer with pre-defined vocab as a Keras Layer.
tfm.nlp.layers.BertTokenizer(
*,
vocab_file: str,
lower_case: Optional[bool] = None,
tokenize_with_offsets: bool = False,
tokenizer_kwargs: Optional[Mapping[Text, Any]] = None,
**kwargs
)
Args |
vocab_file
|
A Python string with the path of the vocabulary file.
This is a text file with newline-separated wordpiece tokens.
This layer initializes a lookup table from it that gets used with
text.BertTokenizer .
|
lower_case
|
Optional boolean forwarded to text.BertTokenizer .
If true, input text is converted to lower case (where applicable)
before tokenization. This must be set to match the way in which
the vocab_file was created. If passed, this overrides whatever value
may have been passed in tokenizer_kwargs .
|
tokenize_with_offsets
|
A Python boolean. If true, this layer calls
text.BertTokenizer.tokenize_with_offsets() instead of plain
text.BertTokenizer.tokenize() and outputs a triple of
(tokens, start_offsets, limit_offsets)
insead of just tokens.
|
tokenizer_kwargs
|
Optional mapping with keyword arguments to forward to
text.BertTokenizer 's constructor.
|
**kwargs
|
Standard arguments to Layer() .
|
Raises |
ImportError
|
If importing tensorflow_text failed.
|
Attributes |
tokenize_with_offsets
|
If true, calls
text.BertTokenizer.tokenize_with_offsets() instead of plain
text.BertTokenizer.tokenize() and outputs a triple of
(tokens, start_offsets, limit_offsets) .
|
raw_table_access
|
An object with methods .lookup(keys) and .size()that operate on the raw lookup table of tokens. It can be used to
look up special token synbols like [MASK].
</td>
</tr><tr>
<td> vocab_size`
|
|
Methods
call
View source
call(
inputs: tf.Tensor
)
Calls text.BertTokenizer
on inputs.
Args |
inputs
|
A string Tensor of shape (batch_size,) .
|
Returns |
One or three of RaggedTensors if tokenize_with_offsets is False or
True, respectively. These are
tokens: A RaggedTensor of shape
[batch_size, (words), (pieces_per_word)]
and type int32. tokens[i,j,k] contains the k-th wordpiece of the
j-th word in the i-th input.
start_offsets, limit_offsets: If tokenize_with_offsets is True,
RaggedTensors of type int64 with the same indices as tokens.
Element [i,j,k] contains the byte offset at the start, or past the
end, resp., for the k-th wordpiece of the j-th word in the i-th input.
|
get_special_tokens_dict
View source
get_special_tokens_dict()
Returns dict of token ids, keyed by standard names for their purpose.
Returns |
A dict from Python strings to Python integers. Each key is a standard
name for a special token describing its use. (For example, "padding_id"
is what BERT traditionally calls "[PAD]" but others may call "".)
The corresponding value is the integer token id. If a special token
is not found, its entry is omitted from the dict.
The supported keys and tokens are:
- start_of_sequence_id: looked up from "[CLS]"
- end_of_segment_id: looked up from "[SEP]"
- padding_id: looked up form "[PAD]"
- mask_id: looked up from "[MASK]"
- vocab_size: one past the largest token id used
|