Tokenizes a tensor of UTF-8 string into words according to logits.
Inherits From: TokenizerWithOffsets
, Tokenizer
, SplitterWithOffsets
, Splitter
text.SplitMergeFromLogitsTokenizer(
force_split_at_break_character=True
)
Used in the notebooks
Used in the guide |
---|
Methods
split
split(
input
)
Alias for Tokenizer.tokenize
.
split_with_offsets
split_with_offsets(
input
)
Alias for TokenizerWithOffsets.tokenize_with_offsets
.
tokenize
tokenize(
strings, logits
)
Tokenizes a tensor of UTF-8 strings according to logits.
The logits refer to the split / merge action we should take for each character. For more info, see the doc for the logits argument below.
Example:
strings = ['IloveFlume!', 'and tensorflow']
logits = [
[
# 'I'
[5.0, -3.2], # I: split
# 'love'
[2.2, -1.0], # l: split
[0.2, 12.0], # o: merge
[0.0, 11.0], # v: merge
[-3.0, 3.0], # e: merge
# 'Flume'
[10.0, 0.0], # F: split
[0.0, 11.0], # l: merge
[0.0, 11.0], # u: merge
[0.0, 12.0], # m: merge
[0.0, 12.0], # e: merge
# '!'
[5.2, -7.0], # !: split
# padding:
[1.0, 0.0], [1.0, 1.0], [1.0, 0.0],
], [
# 'and'
[2.0, 0.7], # a: split
[0.2, 1.5], # n: merge
[0.5, 2.3], # d: merge
# ' '
[1.7, 7.0], # <space>: merge
# 'tensorflow'
[2.2, 0.1], # t: split
[0.2, 3.1], # e: merge
[1.1, 2.5], # n: merge
[0.7, 0.9], # s: merge
[0.6, 1.0], # o: merge
[0.3, 1.0], # r: merge
[0.2, 2.2], # f: merge
[0.7, 3.1], # l: merge
[0.4, 5.0], # o: merge
[0.8, 6.0], # w: merge
]]
tokenizer = SplitMergeFromLogitsTokenizer()
tokenizer.tokenize(strings, logits)
<tf.RaggedTensor [[b'I', b'love', b'Flume', b'!'], [b'and', b'tensorflow']]>
Args | |
---|---|
strings
|
a 1D Tensor of UTF-8 strings.
|
logits
|
3D Tensor; logits[i,j,0] is the logit for the split action for j-th character of strings[i]. logits[i,j,1] is the logit for the merge action for that same character. For each character, we pick the action with the greatest logit. Split starts a new word at this character and merge adds this character to the previous word. The shape of this tensor should be (n, m, 2) where n is the number of strings, and m is greater or equal with the number of characters from each strings[i]. As the elements of the strings tensor may have different lengths (in UTF-8 chars), padding may be required to get a dense vector; for each row, the extra (padding) pairs of logits are ignored. |
Returns | |
---|---|
A RaggedTensor of strings where tokens[i, k] is the string
content of the k-th token in strings[i]
|
Raises | |
---|---|
InvalidArgumentError
|
if one of the input Tensors has the wrong shape. E.g., if the logits tensor does not have enough elements for one of the strings. |
tokenize_with_offsets
tokenize_with_offsets(
strings, logits
)
Tokenizes a tensor of UTF-8 strings into tokens with [start,end) offsets.
Example:
strings = ['IloveFlume!', 'and tensorflow']
logits = [
[
# 'I'
[5.0, -3.2], # I: split
# 'love'
[2.2, -1.0], # l: split
[0.2, 12.0], # o: merge
[0.0, 11.0], # v: merge
[-3.0, 3.0], # e: merge
# 'Flume'
[10.0, 0.0], # F: split
[0.0, 11.0], # l: merge
[0.0, 11.0], # u: merge
[0.0, 12.0], # m: merge
[0.0, 12.0], # e: merge
# '!'
[5.2, -7.0], # !: split
# padding:
[1.0, 0.0], [1.0, 1.0], [1.0, 0.0],
], [
# 'and'
[2.0, 0.7], # a: split
[0.2, 1.5], # n: merge
[0.5, 2.3], # d: merge
# ' '
[1.7, 7.0], # <space>: merge
# 'tensorflow'
[2.2, 0.1], # t: split
[0.2, 3.1], # e: merge
[1.1, 2.5], # n: merge
[0.7, 0.9], # s: merge
[0.6, 1.0], # o: merge
[0.3, 1.0], # r: merge
[0.2, 2.2], # f: merge
[0.7, 3.1], # l: merge
[0.4, 5.0], # o: merge
[0.8, 6.0], # w: merge
]]
tokenizer = SplitMergeFromLogitsTokenizer()
tokens, starts, ends = tokenizer.tokenize_with_offsets(strings, logits)
tokens
<tf.RaggedTensor [[b'I', b'love', b'Flume', b'!'], [b'and', b'tensorflow']]>
starts
<tf.RaggedTensor [[0, 1, 5, 10], [0, 4]]>
ends
<tf.RaggedTensor [[1, 5, 10, 11], [3, 14]]>
Args | |
---|---|
strings
|
A 1D Tensor of UTF-8 strings.
|
logits
|
3D Tensor; logits[i,j,0] is the logit for the split action for j-th character of strings[i]. logits[i,j,1] is the logit for the merge action for that same character. For each character, we pick the action with the greatest logit. Split starts a new word at this character and merge adds this character to the previous word. The shape of this tensor should be (n, m, 2) where n is the number of strings, and m is greater or equal with the number of characters from each strings[i]. As the elements of the strings tensor may have different lengths (in UTF-8 chars), padding may be required to get a dense vector; for each row, the extra (padding) pairs of logits are ignored. |
Returns | |
---|---|
A tuple (tokens, start_offsets, end_offsets) where:
|
Raises | |
---|---|
InvalidArgumentError
|
if one of the input Tensors has the wrong shape. E.g., if the tensor logits does not have enough elements for one of the strings. |