Tokenizes a tensor of UTF-8 string tokens into phrases.

Inherits From: Tokenizer, Splitter, Detokenizer

vocab (optional) The list of tokens in the vocabulary.
token_out_type (optional) The type of the token to return. This can be tf.int64 or tf.int32 IDs, or tf.string subwords.
unknown_token (optional) The string value to substitute for an unknown token. It must be included in vocab.
support_detokenization (optional) Whether to make the tokenizer support doing detokenization. Setting it to true expands the size of the model flatbuffer.
prob Probability of emitting a phrase when there is a match.
split_end_punctuation Split the end punctuation.
model_buffer (optional) Bytes object (or a uint8 tf.Tenosr) that contains the phrase model in flatbuffer format (see phrase_tokenizer_model.fbs). If not None, all other arguments (except token_output_type) are ignored.



View source

Detokenizes a tensor of int64 or int32 phrase ids into sentences.

Detokenize and tokenize an input string returns itself when the input string is normalized and the tokenized phrases don't contain <unk>.


>>> vocab = ["I", "have", "a", "dream", "a dream", "I have a", "<UNK>"]
>>> tokenizer = PhraseTokenizer(vocab, support_detokenization=True)
>>> ids = tf.ragged.constant([[0, 1, 2], [5, 3]])
>>> tokenizer.detokenize(ids)
<tf.Tensor: shape=(2,), dtype=string,
...       numpy=array([b'I have a', b'I have a dream'], dtype=object)>

input_t An N-dimensional Tensor or RaggedTensor of int64 or int32.

A RaggedTensor of sentences that has N - 1 dimension when N > 1. Otherwise, a string tensor.


View source

Alias for Tokenizer.tokenize.


View source

Tokenizes a tensor of UTF-8 string tokens further into phrase tokens.

Example, single string tokenization:

>>> vocab = ["I", "have", "a", "dream", "a dream", "I have a", "<UNK>"]
>>> tokenizer = PhraseTokenizer(vocab, token_out_type=tf.string)
>>> tokens = [["I have a dream"]]
>>> phrases = tokenizer.tokenize(tokens)
>>> phrases
<tf.RaggedTensor [[[b'I have a', b'dream']]]>

input An N-dimensional Tensor or RaggedTensor of UTF-8 strings.

tokens is a RaggedTensor, where tokens[i, j] is the j-th token (i.e., phrase) for input[i] (i.e., the i-th input word). This token is either the actual token string content, or the corresponding integer id, i.e., the index of that token string in the vocabulary. This choice is controlled by the token_out_type parameter passed to the initializer method.