text.HubModuleTokenizer

Tokenizer that uses a Hub module.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter

text.HubModuleTokenizer(
    hub_module_handle
)

Used in the notebooks

Used in the guide
Tokenizing with TF Text

This class is just a wrapper around an internal HubModuleSplitter. It offers the same functionality, but with 'token'-based method names: e.g., one can use tokenize() instead of the more general and less informatively named split().

Example:

import tensorflow_hub as hub HUB_MODULE = "https://tfhub.dev/google/zh_segmentation/1" segmenter = HubModuleTokenizer(hub.resolve(HUB_MODULE)) segmenter.tokenize(["新华社北京"])

You can also use this tokenizer to return the split strings and their offsets:

import tensorflow_hub as hub HUB_MODULE = "https://tfhub.dev/google/zh_segmentation/1" segmenter = HubModuleTokenizer(hub.resolve(HUB_MODULE)) pieces, starts, ends = segmenter.tokenize_with_offsets(["新华社北京"]) print("pieces: %s starts: %s ends: %s" % (pieces, starts, ends)) pieces: starts: ends:

Args
`hub_module_handle`	A string handle accepted by hub.load(). Supported cases include (1) a local path to a directory containing a module, and (2) a handle to a module uploaded to e.g., https://tfhub.dev

Methods

`split`

View source

split(
    input
)

Alias for Tokenizer.tokenize.

`split_with_offsets`

View source

split_with_offsets(
    input
)

Alias for TokenizerWithOffsets.tokenize_with_offsets.

`tokenize`

View source

tokenize(
    input_strs
)

Tokenizes a tensor of UTF-8 strings into words.

Args
`input_strs`	An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings.

Returns
A `RaggedTensor` of segmented text. The returned shape is the shape of the input tensor with an added ragged dimension for tokens of each string.

`tokenize_with_offsets`

View source

tokenize_with_offsets(
    input_strs
)

Tokenizes a tensor of UTF-8 strings into words with [start,end) offsets.

Args
`input_strs`	An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings.

Returns

Returns
A tuple `(tokens, start_offsets, end_offsets)` where: `tokens` is a `RaggedTensor` of strings where `tokens[i1...iN, j]` is the string content of the `j-th` token in `input_strs[i1...iN]` `start_offsets` is a `RaggedTensor` of int64s where `start_offsets[i1...iN, j]` is the byte offset for the start of the `j-th` token in `input_strs[i1...iN]`. `end_offsets` is a `RaggedTensor` of int64s where `end_offsets[i1...iN, j]` is the byte offset immediately after the end of the `j-th` token in `input_strs[i...iN]`.

A tuple (tokens, start_offsets, end_offsets) where:

tokens is a RaggedTensor of strings where tokens[i1...iN, j] is the string content of the j-th token in input_strs[i1...iN]
start_offsets is a RaggedTensor of int64s where start_offsets[i1...iN, j] is the byte offset for the start of the j-th token in input_strs[i1...iN].
end_offsets is a RaggedTensor of int64s where end_offsets[i1...iN, j] is the byte offset immediately after the end of the j-th token in input_strs[i...iN].

text.HubModuleTokenizer

Used in the notebooks

Example:

Args

Methods

split

split_with_offsets

tokenize

tokenize_with_offsets

`split`

`split_with_offsets`

`tokenize`

`tokenize_with_offsets`