text.offsets_to_boise_tags

Converts the given tokens and spans in offsets format into BOISE tags.

In the BOISE scheme there is a set of 5 labels for each type:

  • (B)egin: meaning the beginning of the span type.
  • (O)utside: meaning the token is outside of any span type
  • (I)nside: the token is inside the span
  • (S)ingleton: the entire span consists of this single token.
  • (E)nd: this token is the end of the span.

When given the span begin & end offsets along with a set of token begin & end offsets, this function helps translate which each token into one of the 5 labels.

For example, given the following example string and entity:

content = "Who let the dogs out" entity = "dogs" tokens = ["Who", "let", "the", "dogs", "out"] token_begin_offsets = [0, 4, 8, 12, 17] token_end_offsets = [3, 7, 11, 16, 20] span_begin_offsets = [12] span_end_offsets = [16] span_type = ["animal"]

Foo will produce the following labels: ["O", "O", "O", "S-animal", "O"] | | | | | Who let the dogs out

Special Case 1: Loose or Strict Boundary Criteria: By default, loose boundary criteria are used to decide token start and end, given a entity span. In the above example, say if we have

span_begin_offsets = [13]; span_end_offsets = [16];

we still get ["O", "O", "O", "S-animal", "O"], even though the span begin offset (13) is not exactly aligned with the token begin offset (12). Partial overlap between a token and a BOISE tag still qualify the token to be labeled with this tag.

You can choose to use strict boundary criteria by passing in use_strict_boundary_mode = false argument, with which Foo will produce ["O", "O", "O", "O", "O"] for the case described above.

Special Case 2: One Token Mapped to Multiple BOISE Tags: In cases where a token is overlapped with multiple BOISE tags, we label the token with the last tag. For example, given the following example inputs:

std::string content = "Getty Center"; std::vector tokens = { "Getty Center" }; std::vector token_begin_offsets = { 0 }; std::vector token_end_offsets = { 12 }; std::vector span_begin_offsets = { 0, 6 }; std::vector span_end_offsets = { 5, 12 }; std::vector span_type = { "per", "loc" }

Foo will produce the following labels: ["B-loc"]

Example:

>>> token_begin_offsets = tf.ragged.constant(
...   [[0, 4, 8, 12, 17], [0, 4, 8, 12]])
>>> token_end_offsets = tf.ragged.constant(
...   [[3, 7, 11, 16, 20], [3, 7, 11, 16]])
>>> span_begin_offsets = tf.ragged.constant([[4], [12]])
>>> span_end_offsets = tf.ragged.constant([[16], [16]])
>>> span_type = tf.ragged.constant([['animal'], ['loc']])
>>> boise_tags = tf_text.offsets_to_boise_tags(token_begin_offsets,
...   token_end_offsets, span_begin_offsets, span_end_offsets, span_type)
>>> boise_tags
<tf.RaggedTensor [[b'O', b'B-animal', b'I-animal', b'E-animal', b'O'],
[b'O', b'O', b'O', b'S-loc']]>

token_begin_offsets A RaggedTensor or Tensor of token begin byte offsets of int32 or int64.
token_end_offsets A RaggedTensor or Tensor of token end byte offsets of int32 or int64.
span_begin_offsets A RaggedTensor or Tensor of span begin byte offsets of int32 or int64.
span_end_offsets A RaggedTensor or Tensor of span end byte offsets of int32 or int64.
span_type A RaggedTensor or Tensor of span type strings.
use_strict_boundary_mode A bool indicating whether to use the strict boundary mode, which excludes a token from a span label when the token begin/end byte range partially overlaps with the span range.

A RaggedTensor of BOISE tag strings in the same dimension as the input token begin and end offsets.