text.WordpieceTokenizer

Tokenizes a tensor of UTF-8 string tokens into subword pieces.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter, Detokenizer

Each UTF-8 string token in the input is split into its corresponding wordpieces, drawing from the list in the file vocab_lookup_table.

Algorithm summary: For each token, the longest token prefix that is in the vocabulary is split off. Any part of the token that remains is prefixed using the suffix_indicator, and the process of removing the longest token prefix continues. The unknown_token (UNK) is used when what remains of the token is not in the vocabulary, or if the token is too long.

When token_out_type is tf.string, the output tensor contains strings in the vocabulary (or UNK). When it is an integer type, the output tensor contains indices into the vocabulary list (with UNK being after the last entry).

Example:

import pathlib
pathlib.Path('/tmp/tok_vocab.txt').write_text(
  "they ##' ##re the great ##est".replace(' ', '\n'))
tokenizer = WordpieceTokenizer('/tmp/tok_vocab.txt',
  token_out_type=tf.string)
tokenizer.tokenize(["they're", "the", "greatest"])
<tf.RaggedTensor [[b'they', b"##'", b'##re'], [b'the'], [b'great', b'##est']]>
tokenizer.tokenize(["they", "are", "great"])
<tf.RaggedTensor [[b'they'], [b'[UNK]'], [b'great']]>
int_tokenizer = WordpieceTokenizer('/tmp/tok_vocab.txt',
  token_out_type=tf.int32)
int_tokenizer.tokenize(["the", "greatest"])
<tf.RaggedTensor [[3], [4, 5]]>
int_tokenizer.tokenize(["really", "the", "greatest"])
<tf.RaggedTensor [[6], [3], [4, 5]]>

Tensor or ragged tensor inputs result in ragged tensor outputs. Scalar inputs (which are just a single token) result in tensor outputs.

tokenizer.tokenize("they're")
<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'they', b"##'", b'##re'],
dtype=object)>
tokenizer.tokenize(["they're"])
<tf.RaggedTensor [[b'they', b"##'", b'##re']]>
tokenizer.tokenize(tf.ragged.constant([["they're"]]))
<tf.RaggedTensor [[[b'they', b"##'", b'##re']]]>

Empty strings are tokenized into empty (ragged) tensors.

tokenizer.tokenize([""])
<tf.RaggedTensor [[]]>

vocab_lookup_table A lookup table implementing the LookupInterface containing the vocabulary of subwords or a string which is the file path to the vocab.txt file.
suffix_indicator (optional) The characters prepended to a wordpiece to indicate that it is a suffix to another subword. Default is '##'.
max_bytes_per_word (optional) Max size of input token. Default is 100.
max_chars_per_token (optional) Max size of subwords, excluding suffix indicator. If known, providing this improves the efficiency of decoding long words.
token_out_type (optional) The type of the token to return. This can be tf.int64 or tf.int32 IDs, or tf.string subwords. The default is tf.int64.
unknown_token (optional) The string value to substitute for an unknown token. Default is "[UNK]". If set to None, no substitution occurs. If token_out_type is tf.int32/tf.int64, the vocab_lookup_table is used (after substitution) to convert the unknown token to an integer.
split_unknown_characters (optional) Whether to split out single unknown characters as subtokens. If False (default), words containing unknown characters will be treated as single unknown tokens.

Methods

detokenize

View source

Convert a Tensor or RaggedTensor of wordpiece IDs to string-words.

import pathlib
pathlib.Path('/tmp/detok_vocab.txt').write_text(
    'a b c ##a ##b ##c'.replace(' ', '\n'))
wordpiece = WordpieceTokenizer('/tmp/detok_vocab.txt')
token_ids = [[0, 4, 5, 2, 5, 5, 5]]
wordpiece.detokenize(token_ids)
<tf.RaggedTensor [[b'abc', b'cccc']]>

The word pieces are joined along the innermost axis to make words. So the result has the same rank as the input, but the innermost axis of the result indexes words instead of word pieces.

The shape transformation is: [..., wordpieces] => [..., words]

When the input shape is [..., words, wordpieces] (like the output of WordpieceTokenizer.tokenize) the result's shape is [..., words, 1]. The additional ragged axis can be removed using words.merge_dims(-2, -1).

Args
token_ids A RaggedTensor or Tensor with an int dtype. Must have ndims >= 2

Returns
A RaggedTensor with dtype string and the rank as the input token_ids.

split

View source

Alias for Tokenizer.tokenize.

split_with_offsets

View source

Alias for TokenizerWithOffsets.tokenize_with_offsets.

tokenize

View source

Tokenizes a tensor of UTF-8 string tokens further into subword tokens.

Example:

import pathlib
pathlib.Path('/tmp/tok_vocab.txt').write_text(
    "they ##' ##re the great ##est".replace(' ', '\n'))
tokens = [["they're", 'the', 'greatest']]
tokenizer = WordpieceTokenizer('/tmp/tok_vocab.txt',
                               token_out_type=tf.string)
tokenizer.tokenize(tokens)
<tf.RaggedTensor [[[b'they', b"##'", b'##re'], [b'the'],
                   [b'great', b'##est']]]>

Args
input An N-dimensional Tensor or RaggedTensor of UTF-8 strings.

Returns
A RaggedTensor of tokens where tokens[i1...iN, j] is the string contents (or ID in the vocab_lookup_table representing that string) of the jth token in input[i1...iN]

tokenize_with_offsets

View source

Tokenizes a tensor of UTF-8 string tokens further into subword tokens.

Example:

import pathlib
pathlib.Path('/tmp/tok_vocab.txt').write_text(
    "they ##' ##re the great ##est".replace(' ', '\n'))
tokens = [["they're", 'the', 'greatest']]
tokenizer = WordpieceTokenizer('/tmp/tok_vocab.txt',
                               token_out_type=tf.string)
subtokens, starts, ends = tokenizer.tokenize_with_offsets(tokens)
subtokens
<tf.RaggedTensor [[[b'they', b"##'", b'##re'], [b'the'],
                   [b'great', b'##est']]]>
starts
<tf.RaggedTensor [[[0, 4, 5], [0], [0, 5]]]>
ends
<tf.RaggedTensor [[[4, 5, 7], [3], [5, 8]]]>

Args
input An N-dimensional Tensor or RaggedTensor of UTF-8 strings.

Returns
A tuple (tokens, start_offsets, end_offsets) where:

tokens[i1...iN, j]: is a RaggedTensor of the string contents (or ID in the vocab_lookup_table representing that string) of the jth token in input[i1...iN]. start_offsets[i1...iN, j]: is a RaggedTensor of the byte offsets for the inclusive start of the jth token in input[i1...iN]. end_offsets[i1...iN, j]: is a RaggedTensor of the byte offsets for the exclusive end of the jth token in input[i...iN]` (exclusive, i.e., first byte after the end of the token).

vocab_size

View source

Returns the vocabulary size.

Args
name The name argument that is passed to the op function.

Returns
A scalar representing the vocabulary size.