text.FastWordpieceTokenizer

Tokenizes a tensor of UTF-8 string tokens into subword pieces.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter, Detokenizer

It employs the linear (as opposed to quadratic) WordPiece algorithm (see the paper).

Differences compared to the classic WordpieceTokenizer are as follows (as of 11/2021):

  • unknown_token cannot be None or empty. That means if a word is too long or cannot be tokenized, FastWordpieceTokenizer always returns unknown_token. In constrast, the original WordpieceTokenizer would return the original word if unknown_token is empty or None.

  • unknown_token must be included in the vocabulary.

  • When unknown_token is returned, in tokenize_with_offsets(), the result end_offset is set to be the length of the original input word. In contrast, when unknown_token is returned by the original WordpieceTokenizer, the end_offset is set to be the length of the unknown_token string.

  • split_unknown_characters is not supported.

  • max_chars_per_token is not used or needed.

  • By default the input is assumed to be general text (i.e., sentences), and FastWordpieceTokenizer first splits it on whitespaces and punctuations and then applies the Wordpiece tokenization (see the parameter no_pretokenization). If the input already contains single words only, please set no_pretokenization=True to be consistent with the classic WordpieceTokenizer.

vocab (optional) The list of tokens in the vocabulary.
suffix_indicator (optional) The characters prepended to a wordpiece to indicate that it is a suffix to another subword.
max_bytes_per_word (optional) Max size of input token.
token_out_type (optional) The type of the token to return. This can be tf.int64 or tf.int32 IDs, or tf.string subwords.
unknown_token (optional) The string value to substitute for an unknown token. It must be included in vocab.
no_pretokenization (optional) By default, the input is split on whitespaces and punctuations before applying the Wordpiece tokenization. When true, the input is assumed to be pretokenized already.
support_detokenization (optional) Whether to make the tokenizer support doing detokenization. Setting it to true expands the size of the model flatbuffer. As a reference, when using 120k multilingual BERT WordPiece vocab, the flatbuffer's size increases from ~5MB to ~6MB.
model_buffer (optional) Bytes object (or a uint8 tf.Tenosr) that contains the wordpiece model in flatbuffer format (see fast_wordpiece_tokenizer_model.fbs). If not None, all other arguments (except token_output_type) are ignored.

Methods

detokenize

View source

Detokenizes a tensor of int64 or int32 subword ids into sentences.

Detokenize and tokenize an input string returns itself when the input string is normalized and the tokenized wordpieces don't contain <unk>.

Example:

>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]",
...          "'", "re", "ok"]
>>> tokenizer = FastWordpieceTokenizer(vocab, support_detokenization=True)
>>> ids = tf.ragged.constant([[0, 1, 2, 3, 4, 5], [9]])
>>> tokenizer.detokenize(ids)
<tf.Tensor: shape=(2,), dtype=string,
...         numpy=array([b"they're the greatest", b'ok'], dtype=object)>
>>> ragged_ids = tf.ragged.constant([[[0, 1, 2, 3, 4, 5], [9]], [[4, 5]]])
>>> tokenizer.detokenize(ragged_ids)
<tf.RaggedTensor [[b"they're the greatest", b'ok'], [b'greatest']]>

Args
input An N-dimensional Tensor or RaggedTensor of int64 or int32.

Returns
A RaggedTensor of sentences that has N - 1 dimension when N > 1. Otherwise, a string tensor.

split

View source

Alias for Tokenizer.tokenize.

split_with_offsets

View source

Alias for TokenizerWithOffsets.tokenize_with_offsets.

tokenize

View source

Tokenizes a tensor of UTF-8 string tokens further into subword tokens.

Example 1, single word tokenization:

>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]"]
>>> tokenizer = FastWordpieceTokenizer(vocab, token_out_type=tf.string,
...                                    no_pretokenization=True)
>>> tokens = [["they're", "the", "greatest"]]
>>> tokenizer.tokenize(tokens)
<tf.RaggedTensor [[[b'they', b"##'", b'##re'], [b'the'],
                   [b'great', b'##est']]]>

Example 2, general text tokenization (pre-tokenization on

punctuation and whitespace followed by WordPiece tokenization):

>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]",
...          "'", "re"]
>>> tokenizer = FastWordpieceTokenizer(vocab, token_out_type=tf.string)
>>> tokens = [["they're the greatest", "the greatest"]]
>>> tokenizer.tokenize(tokens)
<tf.RaggedTensor [[[b'they', b"'", b're', b'the', b'great', b'##est'],
                   [b'the', b'great', b'##est']]]>

Args
input An N-dimensional Tensor or RaggedTensor of UTF-8 strings.

Returns
A RaggedTensor of tokens where tokens[i, j] is the j-th token (i.e., wordpiece) for input[i] (i.e., the i-th input word). This token is either the actual token string content, or the corresponding integer id, i.e., the index of that token string in the vocabulary. This choice is controlled by the token_out_type parameter passed to the initializer method.

tokenize_with_offsets

View source

Tokenizes a tensor of UTF-8 string tokens further into subword tokens.

Example 1, single word tokenization:

>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]"]
>>> tokenizer = FastWordpieceTokenizer(vocab, token_out_type=tf.string,
...                                    no_pretokenization=True)
>>> tokens = [["they're", "the", "greatest"]]
>>> subtokens, starts, ends = tokenizer.tokenize_with_offsets(tokens)
>>> subtokens
<tf.RaggedTensor [[[b'they', b"##'", b'##re'], [b'the'],
                   [b'great', b'##est']]]>
>>> starts
<tf.RaggedTensor [[[0, 4, 5], [0], [0, 5]]]>
>>> ends
<tf.RaggedTensor [[[4, 5, 7], [3], [5, 8]]]>

Example 2, general text tokenization (pre-tokenization on

punctuation and whitespace followed by WordPiece tokenization):

>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]",
...          "'", "re"]
>>> tokenizer = FastWordpieceTokenizer(vocab, token_out_type=tf.string)
>>> tokens = [["they're the greatest", "the greatest"]]
>>> subtokens, starts, ends = tokenizer.tokenize_with_offsets(tokens)
>>> subtokens
<tf.RaggedTensor [[[b'they', b"'", b're', b'the', b'great', b'##est'],
                   [b'the', b'great', b'##est']]]>
>>> starts
<tf.RaggedTensor [[[0, 4, 5, 8, 12, 17], [0, 4, 9]]]>
>>> ends
<tf.RaggedTensor [[[4, 5, 7, 11, 17, 20], [3, 9, 12]]]>

Args
input An N-dimensional Tensor or RaggedTensor of UTF-8 strings.

Returns
A tuple (tokens, start_offsets, end_offsets) where:
tokens is a RaggedTensor, where tokens[i, j] is the j-th token (i.e., wordpiece) for input[i] (i.e., the i-th input word). This token is either the actual token string content, or the corresponding integer id, i.e., the index of that token string in the vocabulary. This choice is controlled by the token_out_type parameter passed to the initializer method. start_offsets[i1...iN, j]: is a RaggedTensor of the byte offsets for the inclusive start of the jth token in input[i1...iN]. end_offsets[i1...iN, j]: is a RaggedTensor of the byte offsets for the exclusive end of the jth token in input[i...iN]` (exclusive, i.e., first byte after the end of the token).