text.PhraseTokenizer

Tokenizes a tensor of UTF-8 string tokens into phrases.

Inherits From: Tokenizer, Splitter, Detokenizer

text.PhraseTokenizer(
    vocab=None,
    token_out_type=dtypes.int32,
    unknown_token='<UNK>',
    support_detokenization=True,
    prob=0,
    split_end_punctuation=False,
    model_buffer=None
)

Args
`vocab`	(optional) The list of tokens in the vocabulary.
`token_out_type`	(optional) The type of the token to return. This can be `tf.int64` or `tf.int32` IDs, or `tf.string` subwords.
`unknown_token`	(optional) The string value to substitute for an unknown token. It must be included in `vocab`.
`support_detokenization`	(optional) Whether to make the tokenizer support doing detokenization. Setting it to true expands the size of the model flatbuffer.
`prob`	Probability of emitting a phrase when there is a match.
`split_end_punctuation`	Split the end punctuation.
`model_buffer`	(optional) Bytes object (or a uint8 tf.Tenosr) that contains the phrase model in flatbuffer format (see phrase_tokenizer_model.fbs). If not `None`, all other arguments (except `token_output_type`) are ignored.

Methods

`detokenize`

View source

detokenize(
    input_t
)

Detokenizes a tensor of int64 or int32 phrase ids into sentences.

Detokenize and tokenize an input string returns itself when the input string is normalized and the tokenized phrases don't contain <unk>.

Example:

>>> vocab = ["I", "have", "a", "dream", "a dream", "I have a", "<UNK>"]
>>> tokenizer = PhraseTokenizer(vocab, support_detokenization=True)
>>> ids = tf.ragged.constant([[0, 1, 2], [5, 3]])
>>> tokenizer.detokenize(ids)
<tf.Tensor: shape=(2,), dtype=string,
...       numpy=array([b'I have a', b'I have a dream'], dtype=object)>

Args
`input_t`	An N-dimensional `Tensor` or `RaggedTensor` of int64 or int32.

Returns
A `RaggedTensor` of sentences that has N - 1 dimension when N > 1. Otherwise, a string tensor.

`split`

View source

split(
    input
)

Alias for Tokenizer.tokenize.

`tokenize`

View source

tokenize(
    input
)

Tokenizes a tensor of UTF-8 string tokens further into phrase tokens.

Example, single string tokenization:

>>> vocab = ["I", "have", "a", "dream", "a dream", "I have a", "<UNK>"]
>>> tokenizer = PhraseTokenizer(vocab, token_out_type=tf.string)
>>> tokens = [["I have a dream"]]
>>> phrases = tokenizer.tokenize(tokens)
>>> phrases
<tf.RaggedTensor [[[b'I have a', b'dream']]]>

Args
`input`	An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings.

Returns
`tokens`	is a `RaggedTensor`, where `tokens[i, j]` is the j-th token (i.e., phrase) for `input[i]` (i.e., the i-th input word). This token is either the actual token string content, or the corresponding integer id, i.e., the index of that token string in the vocabulary. This choice is controlled by the `token_out_type` parameter passed to the initializer method.

text.PhraseTokenizer

Args

Methods

detokenize

Example:

split

tokenize

Example, single string tokenization:

`detokenize`

`split`

`tokenize`