View source on GitHub |
Tokenizer used for BERT.
Inherits From: TokenizerWithOffsets
, Tokenizer
, SplitterWithOffsets
, Splitter
, Detokenizer
text.BertTokenizer(
vocab_lookup_table,
suffix_indicator='##',
max_bytes_per_word=100,
max_chars_per_token=None,
token_out_type=dtypes.int64,
unknown_token='[UNK]',
split_unknown_characters=False,
lower_case=False,
keep_whitespace=False,
normalization_form=None,
preserve_unused_token=False,
basic_tokenizer_class=BasicTokenizer
)
Used in the notebooks
Used in the guide | Used in the tutorials |
---|---|
This tokenizer applies an end-to-end, text string to wordpiece tokenization. It first applies basic tokenization, followed by wordpiece tokenization.
See WordpieceTokenizer
for details on the subword tokenization.
For an example of use, see https://www.tensorflow.org/text/guide/bert_preprocessing_guide
Attributes | |
---|---|
vocab_lookup_table
|
A lookup table implementing the LookupInterface containing the vocabulary of subwords or a string which is the file path to the vocab.txt file. |
suffix_indicator
|
(optional) The characters prepended to a wordpiece to indicate that it is a suffix to another subword. Default is '##'. |
max_bytes_per_word
|
(optional) Max size of input token. Default is 100. |
max_chars_per_token
|
(optional) Max size of subwords, excluding suffix indicator. If known, providing this improves the efficiency of decoding long words. |
token_out_type
|
(optional) The type of the token to return. This can be
tf.int64 IDs, or tf.string subwords. The default is tf.int64 .
|
unknown_token
|
(optional) The value to use when an unknown token is found.
Default is "[UNK]". If this is set to a string, and token_out_type is
tf.int64 , the vocab_lookup_table is used to convert the
unknown_token to an integer. If this is set to None , out-of-vocabulary
tokens are left as is.
|
split_unknown_characters
|
(optional) Whether to split out single unknown characters as subtokens. If False (default), words containing unknown characters will be treated as single unknown tokens. |
lower_case
|
bool - If true, a preprocessing step is added to lowercase the text, apply NFD normalization, and strip accents characters. |
keep_whitespace
|
bool - If true, preserves whitespace characters instead of stripping them away. |
normalization_form
|
If set to a valid value and lower_case=False, the input
text will be normalized to normalization_form . See normalize_utf8() op
for a list of valid values.
|
preserve_unused_token
|
If true, text in the regex format \\[unused\\d+\\]
will be treated as a token and thus remain preserved as is to be looked up
in the vocabulary.
|
basic_tokenizer_class
|
If set, the class to use instead of BasicTokenizer |
Methods
detokenize
detokenize(
token_ids
)
Convert a Tensor
or RaggedTensor
of wordpiece IDs to string-words.
See WordpieceTokenizer.detokenize
for details.
Example:
import pathlib
pathlib.Path('/tmp/tok_vocab.txt').write_text(
"they ##' ##re the great ##est".replace(' ', '\n'))
tokenizer = BertTokenizer(
vocab_lookup_table='/tmp/tok_vocab.txt')
text_inputs = tf.constant(['greatest'.encode('utf-8')])
tokenizer.detokenize([[4, 5]])
<tf.RaggedTensor [[b'greatest']]>
Args | |
---|---|
token_ids
|
A RaggedTensor or Tensor with an int dtype.
|
Returns | |
---|---|
A RaggedTensor with dtype string and the same rank as the input
token_ids .
|
split
split(
input
)
Alias for Tokenizer.tokenize
.
split_with_offsets
split_with_offsets(
input
)
Alias for TokenizerWithOffsets.tokenize_with_offsets
.
tokenize
tokenize(
text_input
)
Tokenizes a tensor of string tokens into subword tokens for BERT.
Example:
import pathlib
pathlib.Path('/tmp/tok_vocab.txt').write_text(
"they ##' ##re the great ##est".replace(' ', '\n'))
tokenizer = BertTokenizer(
vocab_lookup_table='/tmp/tok_vocab.txt')
text_inputs = tf.constant(['greatest'.encode('utf-8') ])
tokenizer.tokenize(text_inputs)
<tf.RaggedTensor [[[4, 5]]]>
Args | |
---|---|
text_input
|
input: A Tensor or RaggedTensor of untokenized UTF-8
strings.
|
Returns | |
---|---|
A RaggedTensor of tokens where tokens[i1...iN, j] is the string
contents (or ID in the vocab_lookup_table representing that string)
of the jth token in input[i1...iN]
|
tokenize_with_offsets
tokenize_with_offsets(
text_input
)
Tokenizes a tensor of string tokens into subword tokens for BERT.
Example:
import pathlib
pathlib.Path('/tmp/tok_vocab.txt').write_text(
"they ##' ##re the great ##est".replace(' ', '\n'))
tokenizer = BertTokenizer(
vocab_lookup_table='/tmp/tok_vocab.txt')
text_inputs = tf.constant(['greatest'.encode('utf-8')])
tokenizer.tokenize_with_offsets(text_inputs)
(<tf.RaggedTensor [[[4, 5]]]>,
<tf.RaggedTensor [[[0, 5]]]>,
<tf.RaggedTensor [[[5, 8]]]>)
Args | |
---|---|
text_input
|
input: A Tensor or RaggedTensor of untokenized UTF-8
strings.
|
Returns | |
---|---|
A tuple of RaggedTensor s where the first element is the tokens where
tokens[i1...iN, j] , the second element is the starting offsets, the
third element is the end offset. (Please look at tokenize for details
on tokens.)
|