Have a question? Connect with the community at the TensorFlow Forum Visit Forum

tfnlp.layers.BertTokenizer

Wraps BertTokenizer with pre-defined vocab as a Keras Layer.

vocab_file A Python string with the path of the vocabulary file. This is a text file with newline-separated wordpiece tokens. This layer initializes a lookup table from it that gets used with text.BertTokenizer.
lower_case A Python boolean forwarded to text.BertTokenizer. If true, input text is converted to lower case (where applicable) before tokenization. This must be set to match the way in which the vocab_file was created.
tokenize_with_offsets A Python boolean. If true, this layer calls BertTokenizer.tokenize_with_offsets() instead of plain .tokenize() and outputs a triple of (tokens, start_offsets, limit_offsets) insead of just tokens.
**kwargs standard arguments to Layer().

ImportError if importing tensorflow_text failed.

tokenize_with_offsets If true, calls BertTokenizer.tokenize_with_offsets() instead of plain .tokenize() and outputs a triple of (tokens, start_offsets, limit_offsets).
raw_table_access An object with methods .lookup(keys) and .size() that operate on the raw lookup table of tokens. It can be used to look up special token synbols like [MASK].
vocab_size

Methods

call

View source

Calls text.BertTokenizer on inputs.

Args
inputs A string Tensor of shape [batch_size].

Returns
One or three of RaggedTensors if tokenize_with_offsets is False or True, respectively. These are
tokens A RaggedTensor of shape [batch_size, (words), (pieces_per_word)] and type int32. tokens[i,j,k] contains the k-th wordpiece of the j-th word in the i-th input. start_offsets, limit_offsets: If tokenize_with_offsets is True, RaggedTensors of type int64 with the same indices as tokens. Element [i,j,k] contains the byte offset at the start, or past the end, resp., for the k-th wordpiece of the j-th word in the i-th input.

get_special_tokens_dict

View source

Returns dict of token ids, keyed by standard names for their purpose.

Returns
A dict from Python strings to Python integers. Each key is a standard name for a special token describing its use. (For example, "padding_id" is what BERT traditionally calls "[PAD]" but others may call "".) The corresponding value is the integer token id. If a special token is not found, its entry is omitted from the dict.

The supported keys and tokens are:

  • start_of_sequence_id: looked up from "[CLS]"
  • end_of_segment_id: looked up from "[SEP]"
  • padding_id: looked up form "[PAD]"
  • mask_id: looked up from "[MASK]"
  • vocab_size: one past the largest token id used