tfm.nlp.layers.BertTokenizer

Wraps TF.Text's BertTokenizer with pre-defined vocab as a Keras Layer.

vocab_file A Python string with the path of the vocabulary file. This is a text file with newline-separated wordpiece tokens. This layer initializes a lookup table from it that gets used with text.BertTokenizer.
lower_case Optional boolean forwarded to text.BertTokenizer. If true, input text is converted to lower case (where applicable) before tokenization. This must be set to match the way in which the vocab_file was created. If passed, this overrides whatever value may have been passed in tokenizer_kwargs.
tokenize_with_offsets A Python boolean. If true, this layer calls text.BertTokenizer.tokenize_with_offsets() instead of plain text.BertTokenizer.tokenize() and outputs a triple of (tokens, start_offsets, limit_offsets) insead of just tokens.
tokenizer_kwargs Optional mapping with keyword arguments to forward to text.BertTokenizer's constructor.
**kwargs Standard arguments to Layer().

ImportError If importing tensorflow_text failed.

tokenize_with_offsets If true, calls text.BertTokenizer.tokenize_with_offsets() instead of plain text.BertTokenizer.tokenize() and outputs a triple of (tokens, start_offsets, limit_offsets).
raw_table_access An object with methods .lookup(keys) and.size()that operate on the raw lookup table of tokens. It can be used to look up special token synbols like[MASK]. </td> </tr><tr> <td>vocab_size`

Methods

call

View source

Calls text.BertTokenizer on inputs.

Args
inputs A string Tensor of shape (batch_size,).

Returns
One or three of RaggedTensors if tokenize_with_offsets is False or True, respectively. These are tokens: A RaggedTensor of shape [batch_size, (words), (pieces_per_word)] and type int32. tokens[i,j,k] contains the k-th wordpiece of the j-th word in the i-th input. start_offsets, limit_offsets: If tokenize_with_offsets is True, RaggedTensors of type int64 with the same indices as tokens. Element [i,j,k] contains the byte offset at the start, or past the end, resp., for the k-th wordpiece of the j-th word in the i-th input.

get_special_tokens_dict

View source

Returns dict of token ids, keyed by standard names for their purpose.

Returns
A dict from Python strings to Python integers. Each key is a standard name for a special token describing its use. (For example, "padding_id" is what BERT traditionally calls "[PAD]" but others may call "".) The corresponding value is the integer token id. If a special token is not found, its entry is omitted from the dict.

The supported keys and tokens are:

  • start_of_sequence_id: looked up from "[CLS]"
  • end_of_segment_id: looked up from "[SEP]"
  • padding_id: looked up form "[PAD]"
  • mask_id: looked up from "[MASK]"
  • vocab_size: one past the largest token id used