text.BertTokenizer
bookmark_border Stay organized with collections Save and categorize content based on your preferences.

Tokenizer used for BERT.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter, Detokenizer

text.BertTokenizer(
    vocab_lookup_table,
    suffix_indicator='##',
    max_bytes_per_word=100,
    max_chars_per_token=None,
    token_out_type=dtypes.int64,
    unknown_token='[UNK]',
    split_unknown_characters=False,
    lower_case=False,
    keep_whitespace=False,
    normalization_form=None,
    preserve_unused_token=False,
    basic_tokenizer_class=BasicTokenizer
)

Used in the notebooks

Used in the guide	Used in the tutorials
Subword tokenizers BERT Preprocessing with TF Text Tokenizing with TF Text	TensorFlow Ranking Keras pipeline for distributed training

This tokenizer applies an end-to-end, text string to wordpiece tokenization. It first applies basic tokenization, followed by wordpiece tokenization.

See WordpieceTokenizer for details on the subword tokenization.

For an example of use, see https://www.tensorflow.org/text/guide/bert_preprocessing_guide

Attributes
`vocab_lookup_table`	A lookup table implementing the LookupInterface containing the vocabulary of subwords or a string which is the file path to the vocab.txt file.
`suffix_indicator`	(optional) The characters prepended to a wordpiece to indicate that it is a suffix to another subword. Default is '##'.
`max_bytes_per_word`	(optional) Max size of input token. Default is 100.
`max_chars_per_token`	(optional) Max size of subwords, excluding suffix indicator. If known, providing this improves the efficiency of decoding long words.
`token_out_type`	(optional) The type of the token to return. This can be `tf.int64` IDs, or `tf.string` subwords. The default is `tf.int64`.
`unknown_token`	(optional) The value to use when an unknown token is found. Default is "[UNK]". If this is set to a string, and `token_out_type` is `tf.int64`, the `vocab_lookup_table` is used to convert the `unknown_token` to an integer. If this is set to `None`, out-of-vocabulary tokens are left as is.
`split_unknown_characters`	(optional) Whether to split out single unknown characters as subtokens. If False (default), words containing unknown characters will be treated as single unknown tokens.
`lower_case`	bool - If true, a preprocessing step is added to lowercase the text, apply NFD normalization, and strip accents characters.
`keep_whitespace`	bool - If true, preserves whitespace characters instead of stripping them away.
`normalization_form`	If set to a valid value and lower_case=False, the input text will be normalized to `normalization_form`. See normalize_utf8() op for a list of valid values.
`preserve_unused_token`	If true, text in the regex format `\\[unused\\d+\\]` will be treated as a token and thus remain preserved as is to be looked up in the vocabulary.
`basic_tokenizer_class`	If set, the class to use instead of BasicTokenizer

Methods

`detokenize`

View source

detokenize(
    token_ids
)

Convert a Tensor or RaggedTensor of wordpiece IDs to string-words.

See WordpieceTokenizer.detokenize for details.

Example:

import pathlib
pathlib.Path('/tmp/tok_vocab.txt').write_text(
   "they ##' ##re the great ##est".replace(' ', '\n'))
tokenizer = BertTokenizer(
   vocab_lookup_table='/tmp/tok_vocab.txt')
text_inputs = tf.constant(['greatest'.encode('utf-8')])
tokenizer.detokenize([[4, 5]])
<tf.RaggedTensor [[b'greatest']]>

Args
`token_ids`	A `RaggedTensor` or `Tensor` with an int dtype.

Returns
A `RaggedTensor` with dtype `string` and the same rank as the input `token_ids`.

`split`

View source

split(
    input
)

Alias for Tokenizer.tokenize.

`split_with_offsets`

View source

split_with_offsets(
    input
)

Alias for TokenizerWithOffsets.tokenize_with_offsets.

`tokenize`

View source

tokenize(
    text_input
)

Tokenizes a tensor of string tokens into subword tokens for BERT.

Example:

import pathlib
pathlib.Path('/tmp/tok_vocab.txt').write_text(
    "they ##' ##re the great ##est".replace(' ', '\n'))
tokenizer = BertTokenizer(
    vocab_lookup_table='/tmp/tok_vocab.txt')
text_inputs = tf.constant(['greatest'.encode('utf-8') ])
tokenizer.tokenize(text_inputs)
<tf.RaggedTensor [[[4, 5]]]>

Args
`text_input`	input: A `Tensor` or `RaggedTensor` of untokenized UTF-8 strings.

Returns
A `RaggedTensor` of tokens where `tokens[i1...iN, j]` is the string contents (or ID in the vocab_lookup_table representing that string) of the `jth` token in `input[i1...iN]`

`tokenize_with_offsets`

View source

tokenize_with_offsets(
    text_input
)

Tokenizes a tensor of string tokens into subword tokens for BERT.

Example:

import pathlib
pathlib.Path('/tmp/tok_vocab.txt').write_text(
    "they ##' ##re the great ##est".replace(' ', '\n'))
tokenizer = BertTokenizer(
    vocab_lookup_table='/tmp/tok_vocab.txt')
text_inputs = tf.constant(['greatest'.encode('utf-8')])
tokenizer.tokenize_with_offsets(text_inputs)
(<tf.RaggedTensor [[[4, 5]]]>,
 <tf.RaggedTensor [[[0, 5]]]>,
 <tf.RaggedTensor [[[5, 8]]]>)

Args
`text_input`	input: A `Tensor` or `RaggedTensor` of untokenized UTF-8 strings.

Returns
A tuple of `RaggedTensor`s where the first element is the tokens where `tokens[i1...iN, j]`, the second element is the starting offsets, the third element is the end offset. (Please look at `tokenize` for details on tokens.)

text.BertTokenizer bookmark_borderbookmark Stay organized with collections Save and categorize content based on your preferences.

Used in the notebooks

Attributes

Methods

detokenize

Example:

split

split_with_offsets

tokenize

Example:

tokenize_with_offsets

Example:

text.BertTokenizer
bookmark_border Stay organized with collections Save and categorize content based on your preferences.

`detokenize`

`split`

`split_with_offsets`

`tokenize`

`tokenize_with_offsets`