tfm.nlp.layers.FastWordpieceBertTokenizer

A bert tokenizer keras layer using text.FastWordpieceTokenizer.

See details: "Fast WordPiece Tokenization" (https://arxiv.org/abs/2012.15524)

vocab_file A Python string with the path of the vocabulary file. This is a text file with newline-separated wordpiece tokens. This layer loads a list of tokens from it to create text.FastWordpieceTokenizer.
lower_case A Python boolean forwarded to text.BasicTokenizer. If true, input text is converted to lower case (where applicable) before tokenization. This must be set to match the way in which the vocab_file was created.
tokenize_with_offsets A Python boolean. If true, this layer calls FastWordpieceTokenizer.tokenize_with_offsets() instead of plain .tokenize() and outputs a triple of (tokens, start_offsets, limit_offsets) insead of just tokens.
**kwargs standard arguments to Layer().

vocab_size

Methods

call

View source

Calls text.BertTokenizer on inputs.

Args
inputs A string Tensor of shape [batch_size].

Returns
One or three of RaggedTensors if tokenize_with_offsets is False or True, respectively. These are
tokens A RaggedTensor of shape [batch_size, (words), (pieces_per_word)] and type int32. tokens[i,j,k] contains the k-th wordpiece of the j-th word in the i-th input. start_offsets, limit_offsets: If tokenize_with_offsets is True, RaggedTensors of type int64 with the same indices as tokens. Element [i,j,k] contains the byte offset at the start, or past the end, resp., for the k-th wordpiece of the j-th word in the i-th input.

get_special_tokens_dict

View source

Returns dict of token ids, keyed by standard names for their purpose.

Returns
A dict from Python strings to Python integers. Each key is a standard name for a special token describing its use. (For example, "padding_id" is what BERT traditionally calls "[PAD]" but others may call "".) The corresponding value is the integer token id. If a special token is not found, its entry is omitted from the dict.

The supported keys and tokens are:

  • start_of_sequence_id: looked up from "[CLS]"
  • end_of_segment_id: looked up from "[SEP]"
  • padding_id: looked up form "[PAD]"
  • mask_id: looked up from "[MASK]"
  • vocab_size: one past the largest token id used