Have a question? Connect with the community at the TensorFlow Forum Visit Forum

text.sentence_fragments

Find the sentence fragments in a given text. (deprecated)

A sentence fragment is a potential next sentence determined using deterministic heuristics based on punctuation, capitalization, and similar text attributes.

token_word A Tensor (w/ rank=2) or a RaggedTensor (w/ ragged_rank=1) containing the token strings.
token_starts A Tensor (w/ rank=2) or a RaggedTensor (w/ ragged_rank=1) containing offsets where the token starts.
token_ends A Tensor (w/ rank=2) or a RaggedTensor (w/ ragged_rank=1) containing offsets where the token ends.
token_properties A Tensor (w/ rank=2) or a RaggedTensor (w/ ragged_rank=1) containing a bitmask.

The values of the bitmask are:

  • 0x01 (ILL_FORMED) - Text is ill-formed: typically applies to all tokens of a paragraph that is too short or lacks terminal punctuation.
  • 0x02 (HEADING)
  • 0x04 (BOLD)
  • 0x10 (UNDERLINED)
  • 0x20 (LIST)
  • 0x40 (TITLE)
  • 0x80 (EMOTICON)
  • 0x100 (ACRONYM) - Token was identified as an acronym. Period-, hyphen-, and space-separated acronyms: "U.S.", "U-S", and "U S".
  • 0x200 (HYPERLINK) - Indicates that the token (or part of the token) is covered by at least one hyperlink.

input_encoding String name for the unicode encoding that should be used to decode each string.
errors Specifies the response when an input string can't be converted using the indicated encoding. One of:

  • 'strict': Raise an exception for any illegal substrings.

  • 'replace': Replace illegal substrings with replacement_char.

  • 'ignore': Skip illegal substrings.

  • replacement_char The replacement codepoint to be used in place of invalid substrings in input when errors='replace'; and in place of C0 control characters in input when replace_control_characters=True.
    replace_control_characters Whether to replace the C0 control characters (U+0000 - U+001F) with the replacement_char.

    A RaggedTensor of fragment_start, fragment_end, fragment_properties and terminal_punc_token.

    fragment_properties is an int32 bitmask whose values may contain:

    • 1 = fragment ends with terminal punctuation
    • 2 = fragment ends with multiple terminal punctuations (e.g. "She said what?!")
    • 3 = Has close parenthesis (e.g. "Mushrooms (they're fungi).")
    • 4 = Has sentential close parenthesis (e.g. "(Mushrooms are fungi!)")

    terminal_punc_token is a RaggedTensor containing the index of terminal punctuation token immediately following the last word in the fragment -- or index of the last word itself, if it's an acronym (since acronyms include the terminal punctuation). index of the terminal punctuation token.