text.HubModuleTokenizer

Tokenizer that uses a Hub module.

Inherits From: TokenizerWithOffsets, Tokenizer, Splitter

Used in the notebooks

Used in the guide

This class is just a wrapper around an internal HubModuleSplitter. It offers the same functionality, but with 'token'-based method names: e.g., one can use tokenize() instead of the more general and less informatively named split().

Example:

HUB_MODULE = "https://tfhub.dev/google/zh_segmentation/1"
segmenter = HubModuleTokenizer(HUB_MODULE)
segmenter.tokenize(["新华社北京"])
<tf.RaggedTensor [[b'\xe6\x96\xb0\xe5\x8d\x8e\xe7\xa4\xbe',
                   b'\xe5\x8c\x97\xe4\xba\xac']]>

You can also use this tokenizer to return the split strings and their offsets:

>>> HUB_MODULE = "https://tfhub.dev/google/zh_segmentation/1"
>>> segmenter = HubModuleTokenizer(HUB_MODULE)
>>> pieces, starts, ends = segmenter.tokenize_with_offsets(["新华社北京"])
>>> print("pieces: %s starts: %s ends: %s" % (pieces, starts, ends))
pieces: <tf.RaggedTensor [[b'\xe6\x96\xb0\xe5\x8d\x8e\xe7\xa4\xbe',
                           b'\xe5\x8c\x97\xe4\xba\xac']]>
starts: <tf.RaggedTensor [[0, 9]]>
ends: <tf.RaggedTensor [[9, 15]]>

hub_module_handle A string handle accepted by hub.load(). Supported cases include (1) a local path to a directory containing a module, and (2) a handle to a module uploaded to e.g., https://tfhub.dev

Methods

split

View source

Alias for Tokenizer.tokenize.

split_with_offsets

View source

Alias for TokenizerWithOffsets.tokenize_with_offsets.

tokenize

View source

Tokenizes a tensor of UTF-8 strings into words.

Args
input_strs An N-dimensional Tensor or RaggedTensor of UTF-8 strings.

Returns
A RaggedTensor of segmented text. The returned shape is the shape of the input tensor with an added ragged dimension for tokens of each string.

tokenize_with_offsets

View source

Tokenizes a tensor of UTF-8 strings into words with [start,end) offsets.

Args
input_strs An N-dimensional Tensor or RaggedTensor of UTF-8 strings.

Returns
A tuple (tokens, start_offsets, end_offsets) where:

  • tokens is a RaggedTensor of strings where tokens[i1...iN, j] is the string content of the j-th token in input_strs[i1...iN]
  • start_offsets is a RaggedTensor of int64s where start_offsets[i1...iN, j] is the byte offset for the start of the j-th token in input_strs[i1...iN].
  • end_offsets is a RaggedTensor of int64s where end_offsets[i1...iN, j] is the byte offset immediately after the end of the j-th token in input_strs[i...iN].