View source on GitHub |
Converts the given tokens and spans in offsets format into BOISE tags.
text.offsets_to_boise_tags(
token_begin_offsets,
token_end_offsets,
span_begin_offsets,
span_end_offsets,
span_type,
use_strict_boundary_mode=False
)
In the BOISE scheme there is a set of 5 labels for each type:
- (B)egin: meaning the beginning of the span type.
- (O)utside: meaning the token is outside of any span type
- (I)nside: the token is inside the span
- (S)ingleton: the entire span consists of this single token.
- (E)nd: this token is the end of the span.
When given the span begin & end offsets along with a set of token begin & end offsets, this function helps translate which each token into one of the 5 labels.
For example, given the following example string and entity:
content = "Who let the dogs out" entity = "dogs" tokens = ["Who", "let", "the", "dogs", "out"] token_begin_offsets = [0, 4, 8, 12, 17] token_end_offsets = [3, 7, 11, 16, 20] span_begin_offsets = [12] span_end_offsets = [16] span_type = ["animal"]
Foo will produce the following labels: ["O", "O", "O", "S-animal", "O"] | | | | | Who let the dogs out
Special Case 1: Loose or Strict Boundary Criteria: By default, loose boundary criteria are used to decide token start and end, given a entity span. In the above example, say if we have
span_begin_offsets = [13]; span_end_offsets = [16];
we still get ["O", "O", "O", "S-animal", "O"], even though the span begin offset (13) is not exactly aligned with the token begin offset (12). Partial overlap between a token and a BOISE tag still qualify the token to be labeled with this tag.
You can choose to use strict boundary criteria by passing in use_strict_boundary_mode = false argument, with which Foo will produce ["O", "O", "O", "O", "O"] for the case described above.
Special Case 2: One Token Mapped to Multiple BOISE Tags: In cases where a token is overlapped with multiple BOISE tags, we label the token with the last tag. For example, given the following example inputs:
std::string content = "Getty Center";
std::vector
Foo will produce the following labels: ["B-loc"]
Example:
>>> token_begin_offsets = tf.ragged.constant(
... [[0, 4, 8, 12, 17], [0, 4, 8, 12]])
>>> token_end_offsets = tf.ragged.constant(
... [[3, 7, 11, 16, 20], [3, 7, 11, 16]])
>>> span_begin_offsets = tf.ragged.constant([[4], [12]])
>>> span_end_offsets = tf.ragged.constant([[16], [16]])
>>> span_type = tf.ragged.constant([['animal'], ['loc']])
>>> boise_tags = tf_text.offsets_to_boise_tags(token_begin_offsets,
... token_end_offsets, span_begin_offsets, span_end_offsets, span_type)
>>> boise_tags
<tf.RaggedTensor [[b'O', b'B-animal', b'I-animal', b'E-animal', b'O'],
[b'O', b'O', b'O', b'S-loc']]>
Returns | |
---|---|
A RaggedTensor of BOISE tag strings in the same dimension as the input
token begin and end offsets.
|