text.StateBasedSentenceBreaker

A Splitter that uses a state machine to determine sentence breaks.

StateBasedSentenceBreaker splits text into sentences by using a state machine to determine when a sequence of characters indicates a potential sentence break.

The state machine consists of an initial state, then transitions to a collecting terminal punctuation state once an acronym, an emoticon, or terminal punctuation (ellipsis, question mark, exclamation point, etc.), is encountered.

It transitions to the collecting close punctuation state when a close punctuation (close bracket, end quote, etc.) is found.

If non-punctuation is encountered in the collecting terminal punctuation or collecting close punctuation states, then the state machine exits, returning false, indicating it has moved past the end of a potential sentence fragment.

Methods

break_sentences

View source

Splits doc into sentence fragments and returns the fragments' text.

Args
doc A string Tensor of shape [batch] with a batch of documents.

Returns
results A string RaggedTensor of shape [batch, (num_sentences)] with each input broken up into its constituent sentence fragments.

break_sentences_with_offsets

View source

Splits doc into sentence fragments, returns text, start & end offsets.

Example:

                1                  1         2         3
      012345678901234    01234567890123456789012345678901234567
doc: 'Hello...foo bar', 'Welcome to the U.S. don't be surprised'

fragment_text: [
  ['Hello...', 'foo bar'],
  ['Welcome to the U.S.' , 'don't be surprised']
]
start: [[0, 8],[0, 20]]
end: [[8, 15],[19, 38]]

Args
doc A string Tensor of shape [batch] or [batch, 1].

Returns
A tuple of (fragment_text, start, end) where:
fragment_text A string RaggedTensor of shape [batch, (num_sentences)] with each input broken up into its constituent sentence fragments.
start A int64 RaggedTensor of shape [batch, (num_sentences)] where each entry is the inclusive beginning byte offset of a sentence.
end A int64 RaggedTensor of shape [batch, (num_sentences)] where each entry is the exclusive ending byte offset of a sentence.