Tokenizes a tensor of UTF-8 string into words according to logits.
Inherits From: TokenizerWithOffsets
, Tokenizer
, SplitterWithOffsets
, Splitter
text.SplitMergeFromLogitsTokenizer(
force_split_at_break_character=True
)
Used in the notebooks
Args |
force_split_at_break_character
|
a bool that indicates whether to force
start a new word after an ICU-defined whitespace character. Regardless
of this parameter, we never include a whitespace into a token, and we
always ignore the split/merge action for the whitespace character
itself. This parameter indicates what happens after a whitespace.
if force_split_at_break_character is true, create a new word starting
at the first non-space character, regardless of the 0/1 label for
that character, for instance:
s = [2.0, 1.0] # sample pair of logits indicating a split action
m = [1.0, 3.0] # sample pair of logits indicating a merge action
strings=["New York"]
logits=[[s, m, m, s, m, m, m, m]]
output tokens=[["New", "York"]]
strings=["New York"]
logits=[[s, m, m, m, m, m, m, m]]
output tokens=[["New", "York"]]
strings=["New York"],
logits=[[s, m, m, m, s, m, m, m]]
output tokens=[["New", "York"]]
otherwise, create a new word / continue the current one depending on
the action for the first non-whitespace character.
s = [2.0, 1.0] # sample pair of logits indicating a split action
m = [1.0, 3.0] # sample pair of logits indicating a merge action
strings=["New York"],
logits=[[s, m, m, s, m, m, m, m]]
output tokens=[["NewYork"]]
strings=["New York"],
logits=[[s, m, m, m, m, m, m, m]]
output tokens=[["NewYork"]]
strings=["New York"],
logits=[[s, m, m, m, s, m, m, m]]
output tokens=[["New", "York"]]
|
Methods
split
View source
split(
input
)
Alias for Tokenizer.tokenize
.
split_with_offsets
View source
split_with_offsets(
input
)
Alias for TokenizerWithOffsets.tokenize_with_offsets
.
tokenize
View source
tokenize(
strings, logits
)
Tokenizes a tensor of UTF-8 strings according to logits.
The logits refer to the split / merge action we should take for each
character. For more info, see the doc for the logits argument below.
Example:
strings = ['IloveFlume!', 'and tensorflow']
logits = [
[
# 'I'
[5.0, -3.2], # I: split
# 'love'
[2.2, -1.0], # l: split
[0.2, 12.0], # o: merge
[0.0, 11.0], # v: merge
[-3.0, 3.0], # e: merge
# 'Flume'
[10.0, 0.0], # F: split
[0.0, 11.0], # l: merge
[0.0, 11.0], # u: merge
[0.0, 12.0], # m: merge
[0.0, 12.0], # e: merge
# '!'
[5.2, -7.0], # !: split
# padding:
[1.0, 0.0], [1.0, 1.0], [1.0, 0.0],
], [
# 'and'
[2.0, 0.7], # a: split
[0.2, 1.5], # n: merge
[0.5, 2.3], # d: merge
# ' '
[1.7, 7.0], # <space>: merge
# 'tensorflow'
[2.2, 0.1], # t: split
[0.2, 3.1], # e: merge
[1.1, 2.5], # n: merge
[0.7, 0.9], # s: merge
[0.6, 1.0], # o: merge
[0.3, 1.0], # r: merge
[0.2, 2.2], # f: merge
[0.7, 3.1], # l: merge
[0.4, 5.0], # o: merge
[0.8, 6.0], # w: merge
]]
tokenizer = SplitMergeFromLogitsTokenizer()
tokenizer.tokenize(strings, logits)
<tf.RaggedTensor [[b'I', b'love', b'Flume', b'!'], [b'and', b'tensorflow']]>
Args |
strings
|
a 1D Tensor of UTF-8 strings.
|
logits
|
3D Tensor; logits[i,j,0] is the logit for the split action for
j-th character of strings[i]. logits[i,j,1] is the logit for the merge
action for that same character. For each character, we pick the action
with the greatest logit. Split starts a new word at this character and
merge adds this character to the previous word. The shape of this
tensor should be (n, m, 2) where n is the number of strings, and m is
greater or equal with the number of characters from each strings[i]. As
the elements of the strings tensor may have different lengths (in UTF-8
chars), padding may be required to get a dense vector; for each row, the
extra (padding) pairs of logits are ignored.
|
Returns |
A RaggedTensor of strings where tokens[i, k] is the string
content of the k-th token in strings[i]
|
Raises |
InvalidArgumentError
|
if one of the input Tensors has the wrong shape.
E.g., if the logits tensor does not have enough elements for one of the
strings.
|
tokenize_with_offsets
View source
tokenize_with_offsets(
strings, logits
)
Tokenizes a tensor of UTF-8 strings into tokens with [start,end) offsets.
Example:
strings = ['IloveFlume!', 'and tensorflow']
logits = [
[
# 'I'
[5.0, -3.2], # I: split
# 'love'
[2.2, -1.0], # l: split
[0.2, 12.0], # o: merge
[0.0, 11.0], # v: merge
[-3.0, 3.0], # e: merge
# 'Flume'
[10.0, 0.0], # F: split
[0.0, 11.0], # l: merge
[0.0, 11.0], # u: merge
[0.0, 12.0], # m: merge
[0.0, 12.0], # e: merge
# '!'
[5.2, -7.0], # !: split
# padding:
[1.0, 0.0], [1.0, 1.0], [1.0, 0.0],
], [
# 'and'
[2.0, 0.7], # a: split
[0.2, 1.5], # n: merge
[0.5, 2.3], # d: merge
# ' '
[1.7, 7.0], # <space>: merge
# 'tensorflow'
[2.2, 0.1], # t: split
[0.2, 3.1], # e: merge
[1.1, 2.5], # n: merge
[0.7, 0.9], # s: merge
[0.6, 1.0], # o: merge
[0.3, 1.0], # r: merge
[0.2, 2.2], # f: merge
[0.7, 3.1], # l: merge
[0.4, 5.0], # o: merge
[0.8, 6.0], # w: merge
]]
tokenizer = SplitMergeFromLogitsTokenizer()
tokens, starts, ends = tokenizer.tokenize_with_offsets(strings, logits)
tokens
<tf.RaggedTensor [[b'I', b'love', b'Flume', b'!'], [b'and', b'tensorflow']]>
starts
<tf.RaggedTensor [[0, 1, 5, 10], [0, 4]]>
ends
<tf.RaggedTensor [[1, 5, 10, 11], [3, 14]]>
Args |
strings
|
A 1D Tensor of UTF-8 strings.
|
logits
|
3D Tensor; logits[i,j,0] is the logit for the split action for
j-th character of strings[i]. logits[i,j,1] is the logit for the merge
action for that same character. For each character, we pick the action
with the greatest logit. Split starts a new word at this character and
merge adds this character to the previous word. The shape of this
tensor should be (n, m, 2) where n is the number of strings, and m is
greater or equal with the number of characters from each strings[i]. As
the elements of the strings tensor may have different lengths (in UTF-8
chars), padding may be required to get a dense vector; for each row, the
extra (padding) pairs of logits are ignored.
|
Returns |
A tuple (tokens, start_offsets, end_offsets) where:
tokens is a RaggedTensor of strings where tokens[i, k] is
the string content of the k-th token in strings[i]
start_offsets is a RaggedTensor of int64s where
start_offsets[i, k] is the byte offset for the start of the
k-th token in strings[i] .
end_offsets is a RaggedTensor of int64s where
end_offsets[i, k] is the byte offset immediately after the
end of the k-th token in strings[i] .
|
Raises |
InvalidArgumentError
|
if one of the input Tensors has the wrong shape.
E.g., if the tensor logits does not have enough elements for one of the
strings.
|