text.FastBertTokenizer

Tokenizer used for BERT, a faster version with TFLite support.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter, Detokenizer

This tokenizer applies an end-to-end, text string to wordpiece tokenization. It is equivalent to BertTokenizer for most common scenarios while running faster and supporting TFLite. It does not support certain special settings (see the docs below).

See WordpieceTokenizer for details on the subword tokenization.

For an example of use, see https://www.tensorflow.org/text/guide/bert_preprocessing_guide

vocab (optional) The list of tokens in the vocabulary.
suffix_indicator (optional) The characters prepended to a wordpiece to indicate that it is a suffix to another subword.
max_bytes_per_word (optional) Max size of input token.
token_out_type (optional) The type of the token to return. This can be tf.int64 or tf.int32 IDs, or tf.string subwords.
unknown_token (optional) The string value to substitute for an unknown token. It must be included in vocab.
no_pretokenization (optional) By default, the input is split on whitespaces and punctuations before applying the Wordpiece tokenization. When true, the input is assumed to be pretokenized already.
support_detokenization (optional) Whether to make the tokenizer support doing detokenization. Setting it to true expands the size of the model flatbuffer. As a reference, when using 120k multilingual BERT WordPiece vocab, the flatbuffer's size increases from ~5MB to ~6MB.
fast_wordpiece_model_buffer (optional) Bytes object (or a uint8 tf.Tenosr) that contains the wordpiece model in flatbuffer format (see fast_wordpiece_tokenizer_model.fbs). If not None, all other arguments related to FastWordPieceTokenizer (except token_output_type) are ignored.
lower_case_nfd_strip_accents (optional) .

  • If true, it first lowercases the text, applies NFD normalization, strips accents characters, and then replaces control characters with whitespaces.
  • If false, it only replaces control characters with whitespaces.
fast_bert_normalizer_model_buffer (optional) bytes object (or a uint8 tf.Tenosr) that contains the fast bert normalizer model in flatbuffer format (see fast_bert_normalizer_model.fbs). If not None, lower_case_nfd_strip_accents is ignored.

Methods

detokenize

View source

Convert a Tensor or RaggedTensor of wordpiece IDs to string-words.

See WordpieceTokenizer.detokenize for details.

Example:

vocab = ['they', "##'", '##re', 'the', 'great', '##est', '[UNK]']
tokenizer = FastBertTokenizer(vocab=vocab, support_detokenization=True)
tokenizer.detokenize([[4, 5]])
<tf.Tensor: shape=(1,), dtype=string, numpy=array([b'greatest'],
dtype=object)>

Args
token_ids A RaggedTensor or Tensor with an int dtype.

Returns
A RaggedTensor with dtype string and the same rank as the input token_ids.

split

View source

Alias for Tokenizer.tokenize.

split_with_offsets

View source

Alias for TokenizerWithOffsets.tokenize_with_offsets.

tokenize

View source

Tokenizes a tensor of string tokens into subword tokens for BERT.

Example:

vocab = ['they', "##'", '##re', 'the', 'great', '##est', '[UNK]']
tokenizer = FastBertTokenizer(vocab=vocab)
text_inputs = tf.constant(['greatest'.encode('utf-8') ])
tokenizer.tokenize(text_inputs)
<tf.RaggedTensor [[4, 5]]>

Args
text_input input: A Tensor or RaggedTensor of untokenized UTF-8 strings.

Returns
A RaggedTensor of tokens where tokens[i1...iN, j] is the string contents (or ID in the vocab_lookup_table representing that string) of the jth token in input[i1...iN]

tokenize_with_offsets

View source

Tokenizes a tensor of string tokens into subword tokens for BERT.

Example:

vocab = ['they', "##'", '##re', 'the', 'great', '##est', '[UNK]']
tokenizer = FastBertTokenizer(vocab=vocab)
text_inputs = tf.constant(['greatest'.encode('utf-8')])
tokenizer.tokenize_with_offsets(text_inputs)
(<tf.RaggedTensor [[4, 5]]>,
 <tf.RaggedTensor [[0, 5]]>,
 <tf.RaggedTensor [[5, 8]]>)

Args
text_input input: A Tensor or RaggedTensor of untokenized UTF-8 strings.

Returns
A tuple of RaggedTensors where the first element is the tokens where tokens[i1...iN, j], the second element is the starting offsets, the third element is the end offset. (Please look at tokenize for details on tokens.)