View source on GitHub |
Tokenizes a tensor of UTF-8 string tokens into subword pieces.
Inherits From: TokenizerWithOffsets
, Tokenizer
, SplitterWithOffsets
, Splitter
, Detokenizer
text.WordpieceTokenizer(
vocab_lookup_table,
suffix_indicator='##',
max_bytes_per_word=100,
max_chars_per_token=None,
token_out_type=dtypes.int64,
unknown_token='[UNK]',
split_unknown_characters=False
)
Each UTF-8 string token in the input is split into its corresponding
wordpieces, drawing from the list in the file vocab_lookup_table
.
Algorithm summary: For each token, the longest token prefix that is in the
vocabulary is split off. Any part of the token that remains is prefixed using
the suffix_indicator
, and the process of removing the longest token prefix
continues. The unknown_token
(UNK) is used when what remains of the token is
not in the vocabulary, or if the token is too long.
When token_out_type
is tf.string, the output tensor contains strings
in the vocabulary (or UNK). When it is an integer type, the output tensor
contains indices into the vocabulary list (with UNK being after the last
entry).
Example:
import pathlib
pathlib.Path('/tmp/tok_vocab.txt').write_text(
"they ##' ##re the great ##est".replace(' ', '\n'))
tokenizer = WordpieceTokenizer('/tmp/tok_vocab.txt',
token_out_type=tf.string)
tokenizer.tokenize(["they're", "the", "greatest"])
<tf.RaggedTensor [[b'they', b"##'", b'##re'], [b'the'], [b'great', b'##est']]>
tokenizer.tokenize(["they", "are", "great"])
<tf.RaggedTensor [[b'they'], [b'[UNK]'], [b'great']]>
int_tokenizer = WordpieceTokenizer('/tmp/tok_vocab.txt',
token_out_type=tf.int32)
int_tokenizer.tokenize(["the", "greatest"])
<tf.RaggedTensor [[3], [4, 5]]>
int_tokenizer.tokenize(["really", "the", "greatest"])
<tf.RaggedTensor [[6], [3], [4, 5]]>
Tensor or ragged tensor inputs result in ragged tensor outputs. Scalar inputs (which are just a single token) result in tensor outputs.
tokenizer.tokenize("they're")
<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'they', b"##'", b'##re'],
dtype=object)>
tokenizer.tokenize(["they're"])
<tf.RaggedTensor [[b'they', b"##'", b'##re']]>
tokenizer.tokenize(tf.ragged.constant([["they're"]]))
<tf.RaggedTensor [[[b'they', b"##'", b'##re']]]>
Empty strings are tokenized into empty (ragged) tensors.
tokenizer.tokenize([""])
<tf.RaggedTensor [[]]>
Args | |
---|---|
vocab_lookup_table
|
A lookup table implementing the LookupInterface containing the vocabulary of subwords or a string which is the file path to the vocab.txt file. |
suffix_indicator
|
(optional) The characters prepended to a wordpiece to indicate that it is a suffix to another subword. Default is '##'. |
max_bytes_per_word
|
(optional) Max size of input token. Default is 100. |
max_chars_per_token
|
(optional) Max size of subwords, excluding suffix indicator. If known, providing this improves the efficiency of decoding long words. |
token_out_type
|
(optional) The type of the token to return. This can be
tf.int64 or tf.int32 IDs, or tf.string subwords. The default is
tf.int64 .
|
unknown_token
|
(optional) The string value to substitute for an unknown
token. Default is "[UNK]". If set to None , no substitution occurs.
If token_out_type is tf.int32 /tf.int64 , the vocab_lookup_table
is used (after substitution) to convert the unknown token to an integer.
|
split_unknown_characters
|
(optional) Whether to split out single unknown characters as subtokens. If False (default), words containing unknown characters will be treated as single unknown tokens. |
Methods
detokenize
detokenize(
token_ids
)
Convert a Tensor
or RaggedTensor
of wordpiece IDs to string-words.
import pathlib
pathlib.Path('/tmp/detok_vocab.txt').write_text(
'a b c ##a ##b ##c'.replace(' ', '\n'))
wordpiece = WordpieceTokenizer('/tmp/detok_vocab.txt')
token_ids = [[0, 4, 5, 2, 5, 5, 5]]
wordpiece.detokenize(token_ids)
<tf.RaggedTensor [[b'abc', b'cccc']]>
The word pieces are joined along the innermost axis to make words. So the result has the same rank as the input, but the innermost axis of the result indexes words instead of word pieces.
The shape transformation is: [..., wordpieces] => [..., words]
When the input shape is [..., words, wordpieces]
(like the output of
WordpieceTokenizer.tokenize
) the result's shape is [..., words, 1]
.
The additional ragged axis can be removed using words.merge_dims(-2, -1)
.
Args | |
---|---|
token_ids
|
A RaggedTensor or Tensor with an int dtype. Must have
ndims >= 2
|
Returns | |
---|---|
A RaggedTensor with dtype string and the rank as the input
token_ids .
|
split
split(
input
)
Alias for Tokenizer.tokenize
.
split_with_offsets
split_with_offsets(
input
)
Alias for TokenizerWithOffsets.tokenize_with_offsets
.
tokenize
tokenize(
input
)
Tokenizes a tensor of UTF-8 string tokens further into subword tokens.
Example:
import pathlib
pathlib.Path('/tmp/tok_vocab.txt').write_text(
"they ##' ##re the great ##est".replace(' ', '\n'))
tokens = [["they're", 'the', 'greatest']]
tokenizer = WordpieceTokenizer('/tmp/tok_vocab.txt',
token_out_type=tf.string)
tokenizer.tokenize(tokens)
<tf.RaggedTensor [[[b'they', b"##'", b'##re'], [b'the'],
[b'great', b'##est']]]>
Args | |
---|---|
input
|
An N-dimensional Tensor or RaggedTensor of UTF-8 strings.
|
Returns | |
---|---|
A RaggedTensor of tokens where tokens[i1...iN, j] is the string
contents (or ID in the vocab_lookup_table representing that string)
of the jth token in input[i1...iN]
|
tokenize_with_offsets
tokenize_with_offsets(
input
)
Tokenizes a tensor of UTF-8 string tokens further into subword tokens.
Example:
import pathlib
pathlib.Path('/tmp/tok_vocab.txt').write_text(
"they ##' ##re the great ##est".replace(' ', '\n'))
tokens = [["they're", 'the', 'greatest']]
tokenizer = WordpieceTokenizer('/tmp/tok_vocab.txt',
token_out_type=tf.string)
subtokens, starts, ends = tokenizer.tokenize_with_offsets(tokens)
subtokens
<tf.RaggedTensor [[[b'they', b"##'", b'##re'], [b'the'],
[b'great', b'##est']]]>
starts
<tf.RaggedTensor [[[0, 4, 5], [0], [0, 5]]]>
ends
<tf.RaggedTensor [[[4, 5, 7], [3], [5, 8]]]>
Args | |
---|---|
input
|
An N-dimensional Tensor or RaggedTensor of UTF-8 strings.
|
Returns | |
---|---|
A tuple (tokens, start_offsets, end_offsets) where:
tokens[i1...iN, j]: is a |
vocab_size
vocab_size(
name=None
)
Returns the vocabulary size.
Args | |
---|---|
name
|
The name argument that is passed to the op function. |
Returns | |
---|---|
A scalar representing the vocabulary size. |