text.SplitMergeFromLogitsTokenizer

Tokenizes a tensor of UTF-8 string into words according to logits.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter

text.SplitMergeFromLogitsTokenizer(
    force_split_at_break_character=True
)

Used in the notebooks

Used in the guide
Tokenizing with TF Text

Args
`force_split_at_break_character`	a bool that indicates whether to force start a new word after an ICU-defined whitespace character. Regardless of this parameter, we never include a whitespace into a token, and we always ignore the split/merge action for the whitespace character itself. This parameter indicates what happens after a whitespace. if force_split_at_break_character is true, create a new word starting at the first non-space character, regardless of the 0/1 label for that character, for instance: `s = [2.0, 1.0] # sample pair of logits indicating a split action m = [1.0, 3.0] # sample pair of logits indicating a merge action strings=["New York"] logits=[[s, m, m, s, m, m, m, m]] output tokens=[["New", "York"]] strings=["New York"] logits=[[s, m, m, m, m, m, m, m]] output tokens=[["New", "York"]] strings=["New York"], logits=[[s, m, m, m, s, m, m, m]] output tokens=[["New", "York"]]` otherwise, create a new word / continue the current one depending on the action for the first non-whitespace character. `s = [2.0, 1.0] # sample pair of logits indicating a split action m = [1.0, 3.0] # sample pair of logits indicating a merge action strings=["New York"], logits=[[s, m, m, s, m, m, m, m]] output tokens=[["NewYork"]] strings=["New York"], logits=[[s, m, m, m, m, m, m, m]] output tokens=[["NewYork"]] strings=["New York"], logits=[[s, m, m, m, s, m, m, m]] output tokens=[["New", "York"]]`

Args

force_split_at_break_character

a bool that indicates whether to force start a new word after an ICU-defined whitespace character. Regardless of this parameter, we never include a whitespace into a token, and we always ignore the split/merge action for the whitespace character itself. This parameter indicates what happens after a whitespace.

if force_split_at_break_character is true, create a new word starting at the first non-space character, regardless of the 0/1 label for that character, for instance:

s = [2.0, 1.0]  # sample pair of logits indicating a split action
m = [1.0, 3.0]  # sample pair of logits indicating a merge action

strings=["New York"]
logits=[[s, m, m, s, m, m, m, m]]
output tokens=[["New", "York"]]

strings=["New York"]
logits=[[s, m, m, m, m, m, m, m]]
output tokens=[["New", "York"]]

strings=["New York"],
logits=[[s, m, m, m, s, m, m, m]]
output tokens=[["New", "York"]]

otherwise, create a new word / continue the current one depending on the action for the first non-whitespace character.

s = [2.0, 1.0]  # sample pair of logits indicating a split action
m = [1.0, 3.0]  # sample pair of logits indicating a merge action

strings=["New York"],
logits=[[s, m, m, s, m, m, m, m]]
output tokens=[["NewYork"]]

strings=["New York"],
logits=[[s, m, m, m, m, m, m, m]]
output tokens=[["NewYork"]]

strings=["New York"],
logits=[[s, m, m, m, s, m, m, m]]
output tokens=[["New", "York"]]

Methods

`split`

View source

split(
    input
)

Alias for Tokenizer.tokenize.

`split_with_offsets`

View source

split_with_offsets(
    input
)

Alias for TokenizerWithOffsets.tokenize_with_offsets.

`tokenize`

View source

tokenize(
    strings, logits
)

Tokenizes a tensor of UTF-8 strings according to logits.

The logits refer to the split / merge action we should take for each character. For more info, see the doc for the logits argument below.

Example:

strings = ['IloveFlume!', 'and tensorflow']
logits = [
[
    # 'I'
    [5.0, -3.2],  # I: split
    # 'love'
    [2.2, -1.0],  # l: split
    [0.2, 12.0],  # o: merge
    [0.0, 11.0],  # v: merge
    [-3.0, 3.0],  # e: merge
    # 'Flume'
    [10.0, 0.0],  # F: split
    [0.0, 11.0],  # l: merge
    [0.0, 11.0],  # u: merge
    [0.0, 12.0],  # m: merge
    [0.0, 12.0],  # e: merge
    # '!'
    [5.2, -7.0],  # !: split
    # padding:
    [1.0, 0.0], [1.0, 1.0], [1.0, 0.0],
], [
    # 'and'
    [2.0, 0.7],  # a: split
    [0.2, 1.5],  # n: merge
    [0.5, 2.3],  # d: merge
    # ' '
    [1.7, 7.0],  # <space>: merge
    # 'tensorflow'
    [2.2, 0.1],  # t: split
    [0.2, 3.1],  # e: merge
    [1.1, 2.5],  # n: merge
    [0.7, 0.9],  # s: merge
    [0.6, 1.0],  # o: merge
    [0.3, 1.0],  # r: merge
    [0.2, 2.2],  # f: merge
    [0.7, 3.1],  # l: merge
    [0.4, 5.0],  # o: merge
    [0.8, 6.0],  # w: merge
]]
tokenizer = SplitMergeFromLogitsTokenizer()
tokenizer.tokenize(strings, logits)
<tf.RaggedTensor [[b'I', b'love', b'Flume', b'!'], [b'and', b'tensorflow']]>

Args

strings a 1D Tensor of UTF-8 strings.

logits 3D Tensor; logits[i,j,0] is the logit for the split action for j-th character of strings[i]. logits[i,j,1] is the logit for the merge action for that same character. For each character, we pick the action with the greatest logit. Split starts a new word at this character and merge adds this character to the previous word. The shape of this tensor should be (n, m, 2) where n is the number of strings, and m is greater or equal with the number of characters from each strings[i]. As the elements of the strings tensor may have different lengths (in UTF-8 chars), padding may be required to get a dense vector; for each row, the extra (padding) pairs of logits are ignored.

Args
`strings`	a 1D `Tensor` of UTF-8 strings.
`logits`	3D Tensor; logits[i,j,0] is the logit for the split action for j-th character of strings[i]. logits[i,j,1] is the logit for the merge action for that same character. For each character, we pick the action with the greatest logit. Split starts a new word at this character and merge adds this character to the previous word. The shape of this tensor should be (n, m, 2) where n is the number of strings, and m is greater or equal with the number of characters from each strings[i]. As the elements of the strings tensor may have different lengths (in UTF-8 chars), padding may be required to get a dense vector; for each row, the extra (padding) pairs of logits are ignored.

Returns
A `RaggedTensor` of strings where `tokens[i, k]` is the string content of the `k-th` token in `strings[i]`

Raises
`InvalidArgumentError`	if one of the input Tensors has the wrong shape. E.g., if the logits tensor does not have enough elements for one of the strings.

`tokenize_with_offsets`

View source

tokenize_with_offsets(
    strings, logits
)

Tokenizes a tensor of UTF-8 strings into tokens with [start,end) offsets.

Example:

strings = ['IloveFlume!', 'and tensorflow']
logits = [
[
    # 'I'
    [5.0, -3.2],  # I: split
    # 'love'
    [2.2, -1.0],  # l: split
    [0.2, 12.0],  # o: merge
    [0.0, 11.0],  # v: merge
    [-3.0, 3.0],  # e: merge
    # 'Flume'
    [10.0, 0.0],  # F: split
    [0.0, 11.0],  # l: merge
    [0.0, 11.0],  # u: merge
    [0.0, 12.0],  # m: merge
    [0.0, 12.0],  # e: merge
    # '!'
    [5.2, -7.0],  # !: split
    # padding:
    [1.0, 0.0], [1.0, 1.0], [1.0, 0.0],
], [
    # 'and'
    [2.0, 0.7],  # a: split
    [0.2, 1.5],  # n: merge
    [0.5, 2.3],  # d: merge
    # ' '
    [1.7, 7.0],  # <space>: merge
    # 'tensorflow'
    [2.2, 0.1],  # t: split
    [0.2, 3.1],  # e: merge
    [1.1, 2.5],  # n: merge
    [0.7, 0.9],  # s: merge
    [0.6, 1.0],  # o: merge
    [0.3, 1.0],  # r: merge
    [0.2, 2.2],  # f: merge
    [0.7, 3.1],  # l: merge
    [0.4, 5.0],  # o: merge
    [0.8, 6.0],  # w: merge
]]
tokenizer = SplitMergeFromLogitsTokenizer()
tokens, starts, ends = tokenizer.tokenize_with_offsets(strings, logits)
tokens
<tf.RaggedTensor [[b'I', b'love', b'Flume', b'!'], [b'and', b'tensorflow']]>
starts
<tf.RaggedTensor [[0, 1, 5, 10], [0, 4]]>
ends
<tf.RaggedTensor [[1, 5, 10, 11], [3, 14]]>

Args

strings A 1D Tensor of UTF-8 strings.

Args
`strings`	A 1D `Tensor` of UTF-8 strings.
`logits`	3D Tensor; logits[i,j,0] is the logit for the split action for j-th character of strings[i]. logits[i,j,1] is the logit for the merge action for that same character. For each character, we pick the action with the greatest logit. Split starts a new word at this character and merge adds this character to the previous word. The shape of this tensor should be (n, m, 2) where n is the number of strings, and m is greater or equal with the number of characters from each strings[i]. As the elements of the strings tensor may have different lengths (in UTF-8 chars), padding may be required to get a dense vector; for each row, the extra (padding) pairs of logits are ignored.

Returns

Returns
A tuple `(tokens, start_offsets, end_offsets)` where: `tokens` is a `RaggedTensor` of strings where `tokens[i, k]` is the string content of the `k-th` token in `strings[i]` `start_offsets` is a `RaggedTensor` of int64s where `start_offsets[i, k]` is the byte offset for the start of the `k-th` token in `strings[i]`. `end_offsets` is a `RaggedTensor` of int64s where `end_offsets[i, k]` is the byte offset immediately after the end of the `k-th` token in `strings[i]`.

A tuple (tokens, start_offsets, end_offsets) where:

tokens is a RaggedTensor of strings where tokens[i, k] is the string content of the k-th token in strings[i]
start_offsets is a RaggedTensor of int64s where start_offsets[i, k] is the byte offset for the start of the k-th token in strings[i].
end_offsets is a RaggedTensor of int64s where end_offsets[i, k] is the byte offset immediately after the end of the k-th token in strings[i].

Raises
`InvalidArgumentError`	if one of the input Tensors has the wrong shape. E.g., if the tensor logits does not have enough elements for one of the strings.

text.SplitMergeFromLogitsTokenizer Stay organized with collections Save and categorize content based on your preferences.

Used in the notebooks

Args

Methods

split

split_with_offsets

tokenize

Example:

tokenize_with_offsets

Example:

text.SplitMergeFromLogitsTokenizer

`split`

`split_with_offsets`

`tokenize`

`tokenize_with_offsets`