text.boise_tags_to_offsets

Converts the token offsets and BOISE tags into span offsets and span type.

In the BOISE scheme there is a set of 5 labels for each type:

  • (B)egin: meaning the beginning of the span type.
  • (O)utside: meaning the token is outside of any span type
  • (I)nside: the token is inside the span
  • (S)ingleton: the entire span consists of this single token.
  • (E)nd: this token is the end of the span.

For example, given the following example string and entity:

content = "Who let the dogs out" entity = "dogs" tokens = ["Who", "let", "the", "dogs", "out"] token_begin_offsets = [0, 4, 8, 12, 17] token_end_offsets = [3, 7, 11, 16, 20] span_begin_offsets = [12] span_end_offsets = [16] span_type = ["animal"]

BOISE tags are: ["O", "O", "O", "S-animal", "O"] | | | | | Who let the dogs out

When given the token begin/end offsets and BOISE tags for an input text sequence, this function translates them into entity span begin/end offsets and span types.

Example:

>>> token_begin_offsets = tf.ragged.constant(
...   [[0, 4, 8, 12, 17], [0, 4, 8, 12]])
>>> token_end_offsets = tf.ragged.constant(
...   [[3, 7, 11, 16, 20], [3, 7, 11, 16]])
>>> boise_tags = tf.ragged.constant(
...   [['O', 'B-animal', 'I-animal', 'E-animal', 'O'],
...    ['O', 'O', 'O', 'S-loc']])
>>> (span_begin_offsets, span_end_offsets, span_type) = (
...   tf_text.boise_tags_to_offsets(token_begin_offsets, token_end_offsets,
...     boise_tags))
>>> span_begin_offsets
<tf.RaggedTensor [[4], [12]]>
>>> span_end_offsets
<tf.RaggedTensor [[16], [16]]>
>>> span_type
<tf.RaggedTensor [[b'animal'], [b'loc']]>

token_begin_offsets A RaggedTensor or Tensor of token begin byte offsets of int32 or int64.
token_end_offsets A RaggedTensor or Tensor of token end byte offsets of int32 or int64.
boise_tags A RaggedTensor of BOISE tag strings in the same dimension as the token begin and end offsets.

A tuple containing span_begin_offsets, span_end_offsets and span_type. span_begin_offsets is a RaggedTensor or Tensor of span begin byte offsets of int32 or int64. span_end_offsets is a RaggedTensor or Tensor of span end byte offsets of int32 or int64. span_type is a RaggedTensor or Tensor of span type strings.