View source on GitHub |
Tokenizer that uses a Hub module.
Inherits From: TokenizerWithOffsets
, Tokenizer
, SplitterWithOffsets
, Splitter
text.HubModuleTokenizer(
hub_module_handle
)
Used in the notebooks
Used in the guide |
---|
This class is just a wrapper around an internal HubModuleSplitter. It offers the same functionality, but with 'token'-based method names: e.g., one can use tokenize() instead of the more general and less informatively named split().
Example:
import tensorflow_hub as hub
HUB_MODULE = "https://tfhub.dev/google/zh_segmentation/1"
segmenter = HubModuleTokenizer(hub.resolve(HUB_MODULE))
segmenter.tokenize(["新华社北京"])
You can also use this tokenizer to return the split strings and their offsets:
import tensorflow_hub as hub
HUB_MODULE = "https://tfhub.dev/google/zh_segmentation/1"
segmenter = HubModuleTokenizer(hub.resolve(HUB_MODULE))
pieces, starts, ends = segmenter.tokenize_with_offsets(["新华社北京"])
print("pieces: %s starts: %s ends: %s" % (pieces, starts, ends))
pieces:
Methods
split
split(
input
)
Alias for Tokenizer.tokenize
.
split_with_offsets
split_with_offsets(
input
)
Alias for TokenizerWithOffsets.tokenize_with_offsets
.
tokenize
tokenize(
input_strs
)
Tokenizes a tensor of UTF-8 strings into words.
Args | |
---|---|
input_strs
|
An N-dimensional Tensor or RaggedTensor of UTF-8 strings.
|
Returns | |
---|---|
A RaggedTensor of segmented text. The returned shape is the shape of the
input tensor with an added ragged dimension for tokens of each string.
|
tokenize_with_offsets
tokenize_with_offsets(
input_strs
)
Tokenizes a tensor of UTF-8 strings into words with [start,end) offsets.
Args | |
---|---|
input_strs
|
An N-dimensional Tensor or RaggedTensor of UTF-8 strings.
|
Returns | |
---|---|
A tuple (tokens, start_offsets, end_offsets) where:
|