![]() |
Tokenizer that uses a Hub module.
Inherits From: TokenizerWithOffsets
, Tokenizer
, SplitterWithOffsets
, Splitter
text.HubModuleTokenizer(
hub_module_handle
)
Used in the notebooks
Used in the guide |
---|
This class is just a wrapper around an internal HubModuleSplitter. It offers the same functionality, but with 'token'-based method names: e.g., one can use tokenize() instead of the more general and less informatively named split().
Example:
HUB_MODULE = "https://tfhub.dev/google/zh_segmentation/1"
segmenter = HubModuleTokenizer(HUB_MODULE)
segmenter.tokenize(["新华社北京"])
<tf.RaggedTensor [[b'\xe6\x96\xb0\xe5\x8d\x8e\xe7\xa4\xbe',
b'\xe5\x8c\x97\xe4\xba\xac']]>
You can also use this tokenizer to return the split strings and their offsets:
HUB_MODULE = "https://tfhub.dev/google/zh_segmentation/1"
segmenter = HubModuleTokenizer(HUB_MODULE)
pieces, starts, ends = segmenter.tokenize_with_offsets(["新华社北京"])
print("pieces: %s starts: %s ends: %s" % (pieces, starts, ends))
pieces: <tf.RaggedTensor [[b'\xe6\x96\xb0\xe5\x8d\x8e\xe7\xa4\xbe',
b'\xe5\x8c\x97\xe4\xba\xac']]>
starts: <tf.RaggedTensor [[0, 9]]>
ends: <tf.RaggedTensor [[9, 15]]>
Methods
split
split(
input
)
Alias for Tokenizer.tokenize
.
split_with_offsets
split_with_offsets(
input
)
Alias for TokenizerWithOffsets.tokenize_with_offsets
.
tokenize
tokenize(
input_strs
)
Tokenizes a tensor of UTF-8 strings into words.
Args | |
---|---|
input_strs
|
An N-dimensional Tensor or RaggedTensor of UTF-8 strings.
|
Returns | |
---|---|
A RaggedTensor of segmented text. The returned shape is the shape of the
input tensor with an added ragged dimension for tokens of each string.
|
tokenize_with_offsets
tokenize_with_offsets(
input_strs
)
Tokenizes a tensor of UTF-8 strings into words with [start,end) offsets.
Args | |
---|---|
input_strs
|
An N-dimensional Tensor or RaggedTensor of UTF-8 strings.
|
Returns | |
---|---|
A tuple (tokens, start_offsets, end_offsets) where:
|