|View source on GitHub|
Splitter that uses a Hub module.
text.HubModuleSplitter( hub_module_handle )
The TensorFlow graph from the module performs the real work. The Python code from this class handles the details of interfacing with that module, as well as the support for ragged tensors and high-rank tensors.
The Hub module should be supported by
<https://www.tensorflow.org/hub/api_docs/python/hub/load>_ If a v1 module, it
should have a graph variant with an empty set of tags; we consider that graph
variant to be the module and ignore everything else. The module should have a
default that takes a
text input (a rank-1 tensor of
strings to split into pieces) and returns a dictionary of tensors, let's say
output_dict, such that:
output_dict['num_pieces']is a rank-1 tensor of integers, where num_pieces[i] is the number of pieces that text[i] was split into.
output_dict['starts']is a rank-1 tensor of integers with the byte offsets where the pieces start (relative to the beginning of the corresponding input string).
output_dict['end']is a rank-1 tensor of integers with the byte offsets right after the end of the tokens (relative to the beginning of the corresponding input string).
The output dictionary may contain other tensors (e.g., for debugging) but this class is not using them.
HUB_MODULE = "https://tfhub.dev/google/zh_segmentation/1"
segmenter = HubModuleSplitter(HUB_MODULE)
You can also use this tokenizer to return the split strings and their offsets:
>>> HUB_MODULE = "https://tfhub.dev/google/zh_segmentation/1" >>> segmenter = HubModuleSplitter(HUB_MODULE) >>> pieces, starts, ends = segmenter.split_with_offsets(["新华社北京"]) >>> print("pieces: %s starts: %s ends: %s" % (pieces, starts, ends)) pieces: <tf.RaggedTensor [[b'\xe6\x96\xb0\xe5\x8d\x8e\xe7\xa4\xbe', b'\xe5\x8c\x97\xe4\xba\xac']]> starts: <tf.RaggedTensor [[0, 9]]> ends: <tf.RaggedTensor [[9, 15]]>
Currently, this class also supports an older API, which uses slightly different key names for the output dictionary. For new Hub modules, please use the API described above.
||A string handle accepted by hub.load(). Supported cases include (1) a local path to a directory containing a module, and (2) a handle to a module uploaded to e.g., https://tfhub.dev. The module should implement the signature described in the docstring for this class.|
split( input_strs )
Splits a tensor of UTF-8 strings into pieces.
split_with_offsets( input_strs )
Splits a tensor of UTF-8 strings into pieces with [start,end) offsets.