text.SplitMergeTokenizer

Tokenizes a tensor of UTF-8 string into words according to labels.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter

Used in the notebooks

Used in the guide

Methods

split

View source

Alias for Tokenizer.tokenize.

split_with_offsets

View source

Alias for TokenizerWithOffsets.tokenize_with_offsets.

tokenize

View source

Tokenizes a tensor of UTF-8 strings according to labels.

Example:

strings = ["HelloMonday", "DearFriday"]
labels = [[0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1],
          [0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0]]
tokenizer = SplitMergeTokenizer()
tokenizer.tokenize(strings, labels)
<tf.RaggedTensor [[b'Hello', b'Monday'], [b'Dear', b'Friday']]>

Args
input An N-dimensional Tensor or RaggedTensor of UTF-8 strings.
labels An (N+1)-dimensional Tensor or RaggedTensor of int32, with labels[i1...iN, j] being the split(0)/merge(1) label of the j-th character for input[i1...iN]. Here split means create a new word with this character and merge means adding this character to the previous word.
force_split_at_break_character bool indicates whether to force start a new word after seeing a ICU defined whitespace character. When seeing one or more ICU defined whitespace character:

  • if force_split_at_break_character is set true, then create a new word at the first non-space character, regardless of the label of that character, for instance:

    input="New York"
    labels=[0, 1, 1, 0, 1, 1, 1, 1]
    output tokens=["New", "York"]
    
    input="New York"
    labels=[0, 1, 1, 1, 1, 1, 1, 1]
    output tokens=["New", "York"]
    
    input="New York",
    labels=[0, 1, 1, 1, 0, 1, 1, 1]
    output tokens=["New", "York"]
    
  • otherwise, whether to create a new word or not for the first non-space character depends on the label of that character, for instance:

    input="New York",
    labels=[0, 1, 1, 0, 1, 1, 1, 1]
    output tokens=["NewYork"]
    
    input="New York",
    labels=[0, 1, 1, 1, 1, 1, 1, 1]
    output tokens=["NewYork"]
    
    input="New York",
    labels=[0, 1, 1, 1, 0, 1, 1, 1]
    output tokens=["New", "York"]
    

Returns
A RaggedTensor of strings where tokens[i1...iN, j] is the string content of the j-th token in input[i1...iN]

tokenize_with_offsets

View source

Tokenizes a tensor of UTF-8 strings into tokens with [start,end) offsets.

Example:

strings = ["HelloMonday", "DearFriday"]
labels = [[0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1],
          [0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0]]
tokenizer = SplitMergeTokenizer()
tokens, starts, ends = tokenizer.tokenize_with_offsets(strings, labels)
tokens
<tf.RaggedTensor [[b'Hello', b'Monday'], [b'Dear', b'Friday']]>
starts
<tf.RaggedTensor [[0, 5], [0, 4]]>
ends
<tf.RaggedTensor [[5, 11], [4, 10]]>

Args
input An N-dimensional Tensor or RaggedTensor of UTF-8 strings.
labels An (N+1)-dimensional Tensor or RaggedTensor of int32, with labels[i1...iN, j] being the split(0)/merge(1) label of the j-th character for input[i1...iN]. Here split means create a new word with this character and merge means adding this character to the previous word.
force_split_at_break_character bool indicates whether to force start a new word after seeing a ICU defined whitespace character. When seeing one or more ICU defined whitespace character:

  • if force_split_at_break_character is set true, then create a new word at the first non-space character, regardless of the label of that character, for instance:

    input="New York"
    labels=[0, 1, 1, 0, 1, 1, 1, 1]
    output tokens=["New", "York"]
    
    input="New York"
    labels=[0, 1, 1, 1, 1, 1, 1, 1]
    output tokens=["New", "York"]
    
    input="New York",
    labels=[0, 1, 1, 1, 0, 1, 1, 1]
    output tokens=["New", "York"]
    
  • otherwise, whether to create a new word or not for the first non-space character depends on the label of that character, for instance:

    input="New York",
    labels=[0, 1, 1, 0, 1, 1, 1, 1]
    output tokens=["NewYork"]
    
    input="New York",
    labels=[0, 1, 1, 1, 1, 1, 1, 1]
    output tokens=["NewYork"]
    
    input="New York",
    labels=[0, 1, 1, 1, 0, 1, 1, 1]
    output tokens=["New", "York"]
    

Returns
A tuple (tokens, start_offsets, end_offsets) where:
tokens is a RaggedTensor of strings where tokens[i1...iN, j] is the string content of the j-th token in input[i1...iN]
start_offsets is a RaggedTensor of int64s where start_offsets[i1...iN, j] is the byte offset for the start of the j-th token in input[i1...iN].
end_offsets is a RaggedTensor of int64s where end_offsets[i1...iN, j] is the byte offset immediately after the end of the j-th token in input[i...iN].