text.HubModuleSplitter

Splitter that uses a Hub module.

Inherits From: Splitter

The TensorFlow graph from the module performs the real work. The Python code from this class handles the details of interfacing with that module, as well as the support for ragged tensors and high-rank tensors.

The Hub module should be supported by hub.load() <https://www.tensorflow.org/hub/api_docs/python/hub/load>_ If a v1 module, it should have a graph variant with an empty set of tags; we consider that graph variant to be the module and ignore everything else. The module should have a signature named default that takes a text input (a rank-1 tensor of strings to split into pieces) and returns a dictionary of tensors, let's say output_dict, such that:

  • output_dict['num_pieces'] is a rank-1 tensor of integers, where num_pieces[i] is the number of pieces that text[i] was split into.

  • output_dict['pieces'] is a rank-1 tensor of strings containing all pieces for text0, followed by all pieces for text1 and so on.

  • output_dict['starts'] is a rank-1 tensor of integers with the byte offsets where the pieces start (relative to the beginning of the corresponding input string).

  • output_dict['end'] is a rank-1 tensor of integers with the byte offsets right after the end of the tokens (relative to the beginning of the corresponding input string).

The output dictionary may contain other tensors (e.g., for debugging) but this class is not using them.

Example:

HUB_MODULE = "https://tfhub.dev/google/zh_segmentation/1"
segmenter = HubModuleSplitter(HUB_MODULE)
segmenter.split(["新华社北京"])
<tf.RaggedTensor [[b'\xe6\x96\xb0\xe5\x8d\x8e\xe7\xa4\xbe',
                   b'\xe5\x8c\x97\xe4\xba\xac']]>

You can also use this tokenizer to return the split strings and their offsets:

>>> HUB_MODULE = "https://tfhub.dev/google/zh_segmentation/1"
>>> segmenter = HubModuleSplitter(HUB_MODULE)
>>> pieces, starts, ends = segmenter.split_with_offsets(["新华社北京"])
>>> print("pieces: %s starts: %s ends: %s" % (pieces, starts, ends))
pieces: <tf.RaggedTensor [[b'\xe6\x96\xb0\xe5\x8d\x8e\xe7\xa4\xbe',
                           b'\xe5\x8c\x97\xe4\xba\xac']]>
starts: <tf.RaggedTensor [[0, 9]]>
ends: <tf.RaggedTensor [[9, 15]]>

Currently, this class also supports an older API, which uses slightly different key names for the output dictionary. For new Hub modules, please use the API described above.

hub_module_handle A string handle accepted by hub.load(). Supported cases include (1) a local path to a directory containing a module, and (2) a handle to a module uploaded to e.g., https://tfhub.dev. The module should implement the signature described in the docstring for this class.

Methods

split

View source

Splits a tensor of UTF-8 strings into pieces.

Args
input_strs An N-dimensional Tensor or RaggedTensor of UTF-8 strings.

Returns
A RaggedTensor of segmented text. The returned shape is the shape of the input tensor with an added ragged dimension for the pieces of each string.

split_with_offsets

View source

Splits a tensor of UTF-8 strings into pieces with [start,end) offsets.

Args
input_strs An N-dimensional Tensor or RaggedTensor of UTF-8 strings.

Returns
A tuple (pieces, start_offsets, end_offsets) where:

  • pieces is a RaggedTensor of strings where pieces[i1...iN, j] is the string content of the j-th piece in input_strs[i1...iN]
  • start_offsets is a RaggedTensor of int64s where start_offsets[i1...iN, j] is the byte offset for the start of the j-th piece in input_strs[i1...iN].
  • end_offsets is a RaggedTensor of int64s where end_offsets[i1...iN, j] is the byte offset immediately after the end of the j-th piece in input_strs[i...iN].