text.SplitMergeTokenizer

Tokenizes a tensor of UTF-8 string into words according to labels.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter

text.SplitMergeTokenizer()

Used in the notebooks

Used in the guide
Tokenizing with TF Text

Methods

`split`

View source

split(
    input
)

Alias for Tokenizer.tokenize.

`split_with_offsets`

View source

split_with_offsets(
    input
)

Alias for TokenizerWithOffsets.tokenize_with_offsets.

`tokenize`

View source

tokenize(
    input, labels, force_split_at_break_character=True
)

Tokenizes a tensor of UTF-8 strings according to labels.

Example:

strings = ["HelloMonday", "DearFriday"]
labels = [[0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1],
          [0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0]]
tokenizer = SplitMergeTokenizer()
tokenizer.tokenize(strings, labels)
<tf.RaggedTensor [[b'Hello', b'Monday'], [b'Dear', b'Friday']]>

Args

input An N-dimensional Tensor or RaggedTensor of UTF-8 strings.

labels An (N+1)-dimensional Tensor or RaggedTensor of int32, with labels[i1...iN, j] being the split(0)/merge(1) label of the j-th character for input[i1...iN]. Here split means create a new word with this character and merge means adding this character to the previous word.

Args
`input`	An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings.
`labels`	An (N+1)-dimensional `Tensor` or `RaggedTensor` of `int32`, with `labels[i1...iN, j]` being the split(0)/merge(1) label of the j-th character for `input[i1...iN]`. Here split means create a new word with this character and merge means adding this character to the previous word.
`force_split_at_break_character`	bool indicates whether to force start a new word after seeing a ICU defined whitespace character. When seeing one or more ICU defined whitespace character: if `force_split_at_break_character` is set true, then create a new word at the first non-space character, regardless of the label of that character, for instance: `input="New York" labels=[0, 1, 1, 0, 1, 1, 1, 1] output tokens=["New", "York"]` `input="New York" labels=[0, 1, 1, 1, 1, 1, 1, 1] output tokens=["New", "York"]` `input="New York", labels=[0, 1, 1, 1, 0, 1, 1, 1] output tokens=["New", "York"]` otherwise, whether to create a new word or not for the first non-space character depends on the label of that character, for instance: `input="New York", labels=[0, 1, 1, 0, 1, 1, 1, 1] output tokens=["NewYork"]` `input="New York", labels=[0, 1, 1, 1, 1, 1, 1, 1] output tokens=["NewYork"]` `input="New York", labels=[0, 1, 1, 1, 0, 1, 1, 1] output tokens=["New", "York"]`

force_split_at_break_character

bool indicates whether to force start a new word after seeing a ICU defined whitespace character. When seeing one or more ICU defined whitespace character:

if force_split_at_break_character is set true, then create a new word at the first non-space character, regardless of the label of that character, for instance:

input="New York"
labels=[0, 1, 1, 0, 1, 1, 1, 1]
output tokens=["New", "York"]

input="New York"
labels=[0, 1, 1, 1, 1, 1, 1, 1]
output tokens=["New", "York"]

input="New York",
labels=[0, 1, 1, 1, 0, 1, 1, 1]
output tokens=["New", "York"]

otherwise, whether to create a new word or not for the first non-space character depends on the label of that character, for instance:

input="New York",
labels=[0, 1, 1, 0, 1, 1, 1, 1]
output tokens=["NewYork"]

input="New York",
labels=[0, 1, 1, 1, 1, 1, 1, 1]
output tokens=["NewYork"]

input="New York",
labels=[0, 1, 1, 1, 0, 1, 1, 1]
output tokens=["New", "York"]

Returns
A `RaggedTensor` of strings where `tokens[i1...iN, j]` is the string content of the `j-th` token in `input[i1...iN]`

`tokenize_with_offsets`

View source

tokenize_with_offsets(
    input, labels, force_split_at_break_character=True
)

Tokenizes a tensor of UTF-8 strings into tokens with [start,end) offsets.

Example:

strings = ["HelloMonday", "DearFriday"]
labels = [[0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1],
          [0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0]]
tokenizer = SplitMergeTokenizer()
tokens, starts, ends = tokenizer.tokenize_with_offsets(strings, labels)
tokens
<tf.RaggedTensor [[b'Hello', b'Monday'], [b'Dear', b'Friday']]>
starts
<tf.RaggedTensor [[0, 5], [0, 4]]>
ends
<tf.RaggedTensor [[5, 11], [4, 10]]>

Args

input An N-dimensional Tensor or RaggedTensor of UTF-8 strings.

labels An (N+1)-dimensional Tensor or RaggedTensor of int32, with labels[i1...iN, j] being the split(0)/merge(1) label of the j-th character for input[i1...iN]. Here split means create a new word with this character and merge means adding this character to the previous word.

Args
`input`	An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings.
`labels`	An (N+1)-dimensional `Tensor` or `RaggedTensor` of int32, with labels[i1...iN, j] being the split(0)/merge(1) label of the j-th character for input[i1...iN]. Here split means create a new word with this character and merge means adding this character to the previous word.
`force_split_at_break_character`	bool indicates whether to force start a new word after seeing a ICU defined whitespace character. When seeing one or more ICU defined whitespace character: if `force_split_at_break_character` is set true, then create a new word at the first non-space character, regardless of the label of that character, for instance: `input="New York" labels=[0, 1, 1, 0, 1, 1, 1, 1] output tokens=["New", "York"]` `input="New York" labels=[0, 1, 1, 1, 1, 1, 1, 1] output tokens=["New", "York"]` `input="New York", labels=[0, 1, 1, 1, 0, 1, 1, 1] output tokens=["New", "York"]` otherwise, whether to create a new word or not for the first non-space character depends on the label of that character, for instance: `input="New York", labels=[0, 1, 1, 0, 1, 1, 1, 1] output tokens=["NewYork"]` `input="New York", labels=[0, 1, 1, 1, 1, 1, 1, 1] output tokens=["NewYork"]` `input="New York", labels=[0, 1, 1, 1, 0, 1, 1, 1] output tokens=["New", "York"]`

force_split_at_break_character

bool indicates whether to force start a new word after seeing a ICU defined whitespace character. When seeing one or more ICU defined whitespace character:

if force_split_at_break_character is set true, then create a new word at the first non-space character, regardless of the label of that character, for instance:

input="New York"
labels=[0, 1, 1, 0, 1, 1, 1, 1]
output tokens=["New", "York"]

input="New York"
labels=[0, 1, 1, 1, 1, 1, 1, 1]
output tokens=["New", "York"]

input="New York",
labels=[0, 1, 1, 1, 0, 1, 1, 1]
output tokens=["New", "York"]

otherwise, whether to create a new word or not for the first non-space character depends on the label of that character, for instance:

input="New York",
labels=[0, 1, 1, 0, 1, 1, 1, 1]
output tokens=["NewYork"]

input="New York",
labels=[0, 1, 1, 1, 1, 1, 1, 1]
output tokens=["NewYork"]

input="New York",
labels=[0, 1, 1, 1, 0, 1, 1, 1]
output tokens=["New", "York"]

Returns
A tuple `(tokens, start_offsets, end_offsets)` where:
`tokens`	is a `RaggedTensor` of strings where `tokens[i1...iN, j]` is the string content of the `j-th` token in `input[i1...iN]`
`start_offsets`	is a `RaggedTensor` of int64s where `start_offsets[i1...iN, j]` is the byte offset for the start of the `j-th` token in `input[i1...iN]`.
`end_offsets`	is a `RaggedTensor` of int64s where `end_offsets[i1...iN, j]` is the byte offset immediately after the end of the `j-th` token in `input[i...iN]`.

text.SplitMergeTokenizer Stay organized with collections Save and categorize content based on your preferences.

Used in the notebooks

Methods

split

split_with_offsets

tokenize

Example:

tokenize_with_offsets

Example:

text.SplitMergeTokenizer

`split`

`split_with_offsets`

`tokenize`

`tokenize_with_offsets`