text.SplitMergeFromLogitsTokenizer

Tokenizes a tensor of UTF-8 string into words according to logits.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter

Used in the notebooks

Used in the guide

force_split_at_break_character a bool that indicates whether to force start a new word after an ICU-defined whitespace character. Regardless of this parameter, we never include a whitespace into a token, and we always ignore the split/merge action for the whitespace character itself. This parameter indicates what happens after a whitespace.

  • if force_split_at_break_character is true, create a new word starting at the first non-space character, regardless of the 0/1 label for that character, for instance:

    s = [2.0, 1.0]  # sample pair of logits indicating a split action
    m = [1.0, 3.0]  # sample pair of logits indicating a merge action
    
    strings=["New York"]
    logits=[[s, m, m, s, m, m, m, m]]
    output tokens=[["New", "York"]]
    
    strings=["New York"]
    logits=[[s, m, m, m, m, m, m, m]]
    output tokens=[["New", "York"]]
    
    strings=["New York"],
    logits=[[s, m, m, m, s, m, m, m]]
    output tokens=[["New", "York"]]
    
  • otherwise, create a new word / continue the current one depending on the action for the first non-whitespace character.

    s = [2.0, 1.0]  # sample pair of logits indicating a split action
    m = [1.0, 3.0]  # sample pair of logits indicating a merge action
    
    strings=["New York"],
    logits=[[s, m, m, s, m, m, m, m]]
    output tokens=[["NewYork"]]
    
    strings=["New York"],
    logits=[[s, m, m, m, m, m, m, m]]
    output tokens=[["NewYork"]]
    
    strings=["New York"],
    logits=[[s, m, m, m, s, m, m, m]]
    output tokens=[["New", "York"]]
    

Methods

split

View source

Alias for Tokenizer.tokenize.

split_with_offsets

View source

Alias for TokenizerWithOffsets.tokenize_with_offsets.

tokenize

View source

Tokenizes a tensor of UTF-8 strings according to logits.

The logits refer to the split / merge action we should take for each character. For more info, see the doc for the logits argument below.

Example:

strings = ['IloveFlume!', 'and tensorflow']
logits = [
[
    # 'I'
    [5.0, -3.2],  # I: split
    # 'love'
    [2.2, -1.0],  # l: split
    [0.2, 12.0],  # o: merge
    [0.0, 11.0],  # v: merge
    [-3.0, 3.0],  # e: merge
    # 'Flume'
    [10.0, 0.0],  # F: split
    [0.0, 11.0],  # l: merge
    [0.0, 11.0],  # u: merge
    [0.0, 12.0],  # m: merge
    [0.0, 12.0],  # e: merge
    # '!'
    [5.2, -7.0],  # !: split
    # padding:
    [1.0, 0.0], [1.0, 1.0], [1.0, 0.0],
], [
    # 'and'
    [2.0, 0.7],  # a: split
    [0.2, 1.5],  # n: merge
    [0.5, 2.3],  # d: merge
    # ' '
    [1.7, 7.0],  # <space>: merge
    # 'tensorflow'
    [2.2, 0.1],  # t: split
    [0.2, 3.1],  # e: merge
    [1.1, 2.5],  # n: merge
    [0.7, 0.9],  # s: merge
    [0.6, 1.0],  # o: merge
    [0.3, 1.0],  # r: merge
    [0.2, 2.2],  # f: merge
    [0.7, 3.1],  # l: merge
    [0.4, 5.0],  # o: merge
    [0.8, 6.0],  # w: merge
]]
tokenizer = SplitMergeFromLogitsTokenizer()
tokenizer.tokenize(strings, logits)
<tf.RaggedTensor [[b'I', b'love', b'Flume', b'!'], [b'and', b'tensorflow']]>

Args
strings a 1D Tensor of UTF-8 strings.
logits 3D Tensor; logits[i,j,0] is the logit for the split action for j-th character of strings[i]. logits[i,j,1] is the logit for the merge action for that same character. For each character, we pick the action with the greatest logit. Split starts a new word at this character and merge adds this character to the previous word. The shape of this tensor should be (n, m, 2) where n is the number of strings, and m is greater or equal with the number of characters from each strings[i]. As the elements of the strings tensor may have different lengths (in UTF-8 chars), padding may be required to get a dense vector; for each row, the extra (padding) pairs of logits are ignored.

Returns
A RaggedTensor of strings where tokens[i, k] is the string content of the k-th token in strings[i]

Raises
InvalidArgumentError if one of the input Tensors has the wrong shape. E.g., if the logits tensor does not have enough elements for one of the strings.

tokenize_with_offsets

View source

Tokenizes a tensor of UTF-8 strings into tokens with [start,end) offsets.

Example:

strings = ['IloveFlume!', 'and tensorflow']
logits = [
[
    # 'I'
    [5.0, -3.2],  # I: split
    # 'love'
    [2.2, -1.0],  # l: split
    [0.2, 12.0],  # o: merge
    [0.0, 11.0],  # v: merge
    [-3.0, 3.0],  # e: merge
    # 'Flume'
    [10.0, 0.0],  # F: split
    [0.0, 11.0],  # l: merge
    [0.0, 11.0],  # u: merge
    [0.0, 12.0],  # m: merge
    [0.0, 12.0],  # e: merge
    # '!'
    [5.2, -7.0],  # !: split
    # padding:
    [1.0, 0.0], [1.0, 1.0], [1.0, 0.0],
], [
    # 'and'
    [2.0, 0.7],  # a: split
    [0.2, 1.5],  # n: merge
    [0.5, 2.3],  # d: merge
    # ' '
    [1.7, 7.0],  # <space>: merge
    # 'tensorflow'
    [2.2, 0.1],  # t: split
    [0.2, 3.1],  # e: merge
    [1.1, 2.5],  # n: merge
    [0.7, 0.9],  # s: merge
    [0.6, 1.0],  # o: merge
    [0.3, 1.0],  # r: merge
    [0.2, 2.2],  # f: merge
    [0.7, 3.1],  # l: merge
    [0.4, 5.0],  # o: merge
    [0.8, 6.0],  # w: merge
]]
tokenizer = SplitMergeFromLogitsTokenizer()
tokens, starts, ends = tokenizer.tokenize_with_offsets(strings, logits)
tokens
<tf.RaggedTensor [[b'I', b'love', b'Flume', b'!'], [b'and', b'tensorflow']]>
starts
<tf.RaggedTensor [[0, 1, 5, 10], [0, 4]]>
ends
<tf.RaggedTensor [[1, 5, 10, 11], [3, 14]]>

Args
strings A 1D Tensor of UTF-8 strings.
logits 3D Tensor; logits[i,j,0] is the logit for the split action for j-th character of strings[i]. logits[i,j,1] is the logit for the merge action for that same character. For each character, we pick the action with the greatest logit. Split starts a new word at this character and merge adds this character to the previous word. The shape of this tensor should be (n, m, 2) where n is the number of strings, and m is greater or equal with the number of characters from each strings[i]. As the elements of the strings tensor may have different lengths (in UTF-8 chars), padding may be required to get a dense vector; for each row, the extra (padding) pairs of logits are ignored.

Returns
A tuple (tokens, start_offsets, end_offsets) where:

  • tokens is a RaggedTensor of strings where tokens[i, k] is the string content of the k-th token in strings[i]
  • start_offsets is a RaggedTensor of int64s where start_offsets[i, k] is the byte offset for the start of the k-th token in strings[i].
  • end_offsets is a RaggedTensor of int64s where end_offsets[i, k] is the byte offset immediately after the end of the k-th token in strings[i].

Raises
InvalidArgumentError if one of the input Tensors has the wrong shape. E.g., if the tensor logits does not have enough elements for one of the strings.