text.BertTokenizer

Tokenizer used for BERT.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter, Detokenizer

Used in the notebooks

Used in the guide Used in the tutorials

This tokenizer applies an end-to-end, text string to wordpiece tokenization. It first applies basic tokenization, followed by wordpiece tokenization.

See WordpieceTokenizer for details on the subword tokenization.

For an example of use, see https://www.tensorflow.org/text/guide/bert_preprocessing_guide

vocab_lookup_table A lookup table implementing the LookupInterface containing the vocabulary of subwords or a string which is the file path to the vocab.txt file.
suffix_indicator (optional) The characters prepended to a wordpiece to indicate that it is a suffix to another subword. Default is '##'.
max_bytes_per_word (optional) Max size of input token. Default is 100.
max_chars_per_token (optional) Max size of subwords, excluding suffix indicator. If known, providing this improves the efficiency of decoding long words.
token_out_type (optional) The type of the token to return. This can be tf.int64 IDs, or tf.string subwords. The default is tf.int64.
unknown_token (optional) The value to use when an unknown token is found. Default is "[UNK]". If this is set to a string, and token_out_type is tf.int64, the vocab_lookup_table is used to convert the unknown_token to an integer. If this is set to None, out-of-vocabulary tokens are left as is.
split_unknown_characters (optional) Whether to split out single unknown characters as subtokens. If False (default), words containing unknown characters will be treated as single unknown tokens.
lower_case bool - If true, a preprocessing step is added to lowercase the text, apply NFD normalization, and strip accents characters.
keep_whitespace bool - If true, preserves whitespace characters instead of stripping them away.
normalization_form If set to a valid value and lower_case=False, the input text will be normalized to normalization_form. See normalize_utf8() op for a list of valid values.
preserve_unused_token If true, text in the regex format \\[unused\\d+\\] will be treated as a token and thus remain preserved as is to be looked up in the vocabulary.
basic_tokenizer_class If set, the class to use instead of BasicTokenizer

Methods

detokenize

View source

Convert a Tensor or RaggedTensor of wordpiece IDs to string-words.

See WordpieceTokenizer.detokenize for details.

Example:

import pathlib
pathlib.Path('/tmp/tok_vocab.txt').write_text(
   "they ##' ##re the great ##est".replace(' ', '\n'))
tokenizer = BertTokenizer(
   vocab_lookup_table='/tmp/tok_vocab.txt')
text_inputs = tf.constant(['greatest'.encode('utf-8')])
tokenizer.detokenize([[4, 5]])
<tf.RaggedTensor [[b&#x27;greatest']]>

Args
token_ids A RaggedTensor or Tensor with an int dtype.

Returns
A RaggedTensor with dtype string and the same rank as the input token_ids.

split

View source

Alias for Tokenizer.tokenize.

split_with_offsets

View source

Alias for TokenizerWithOffsets.tokenize_with_offsets.

tokenize

View source

Tokenizes a tensor of string tokens into subword tokens for BERT.

Example:

import pathlib
pathlib.Path(&#x27;/tmp/tok_vocab.txt').write_text(
    "they ##' ##re the great ##est".replace(&#x27; ', '\n'))
tokenizer = BertTokenizer(
    vocab_lookup_table=&#x27;/tmp/tok_vocab.txt')
text_inputs = tf.constant([&#x27;greatest'.encode('utf-8') ])
tokenizer.tokenize(text_inputs)
<tf.RaggedTensor [[[4, 5]]]>

Args
text_input input: A Tensor or RaggedTensor of untokenized UTF-8 strings.

Returns
A RaggedTensor of tokens where tokens[i1...iN, j] is the string contents (or ID in the vocab_lookup_table representing that string) of the jth token in input[i1...iN]

tokenize_with_offsets

View source

Tokenizes a tensor of string tokens into subword tokens for BERT.

Example:

import pathlib
pathlib.Path(&#x27;/tmp/tok_vocab.txt').write_text(
    "they ##' ##re the great ##est".replace(&#x27; ', '\n'))
tokenizer = BertTokenizer(
    vocab_lookup_table=&#x27;/tmp/tok_vocab.txt')
text_inputs = tf.constant([&#x27;greatest'.encode('utf-8')])
tokenizer.tokenize_with_offsets(text_inputs)
(<tf.RaggedTensor [[[4, 5]]]>,
 <tf.RaggedTensor [[[0, 5]]]>,
 <tf.RaggedTensor [[[5, 8]]]>)

Args
text_input input: A Tensor or RaggedTensor of untokenized UTF-8 strings.

Returns
A tuple of RaggedTensors where the first element is the tokens where tokens[i1...iN, j], the second element is the starting offsets, the third element is the end offset. (Please look at tokenize for details on tokens.)