TensorFlow is back at Google I/O on May 14! Register now

text.sentence_fragments

Find the sentence fragments in a given text. (deprecated)

text.sentence_fragments(
    token_word,
    token_starts,
    token_ends,
    token_properties,
    input_encoding='UTF-8',
    errors='replace',
    replacement_char=65533,
    replace_control_characters=False
)

A sentence fragment is a potential next sentence determined using deterministic heuristics based on punctuation, capitalization, and similar text attributes.

Args
`token_word`	A Tensor (w/ rank=2) or a RaggedTensor (w/ ragged_rank=1) containing the token strings.
`token_starts`	A Tensor (w/ rank=2) or a RaggedTensor (w/ ragged_rank=1) containing offsets where the token starts.
`token_ends`	A Tensor (w/ rank=2) or a RaggedTensor (w/ ragged_rank=1) containing offsets where the token ends.
`token_properties`	A Tensor (w/ rank=2) or a RaggedTensor (w/ ragged_rank=1) containing a bitmask. The values of the bitmask are: 0x01 (ILL_FORMED) - Text is ill-formed: typically applies to all tokens of a paragraph that is too short or lacks terminal punctuation. 0x02 (HEADING) 0x04 (BOLD) 0x10 (UNDERLINED) 0x20 (LIST) 0x40 (TITLE) 0x80 (EMOTICON) 0x100 (ACRONYM) - Token was identified as an acronym. Period-, hyphen-, and space-separated acronyms: "U.S.", "U-S", and "U S". 0x200 (HYPERLINK) - Indicates that the token (or part of the token) is covered by at least one hyperlink.
`input_encoding`	String name for the unicode encoding that should be used to decode each string.
`errors`	Specifies the response when an input string can't be converted using the indicated encoding. One of: `'strict'`: Raise an exception for any illegal substrings. `'replace'`: Replace illegal substrings with `replacement_char`. `'ignore'`: Skip illegal substrings.
`replacement_char`	The replacement codepoint to be used in place of invalid substrings in `input` when `errors='replace'`; and in place of C0 control characters in `input` when `replace_control_characters=True`.
`replace_control_characters`	Whether to replace the C0 control characters `(U+0000 - U+001F)` with the `replacement_char`.

Returns
A RaggedTensor of `fragment_start`, `fragment_end`, `fragment_properties` and `terminal_punc_token`. `fragment_properties` is an int32 bitmask whose values may contain: 1 = fragment ends with terminal punctuation 2 = fragment ends with multiple terminal punctuations (e.g. "She said what?!") 3 = Has close parenthesis (e.g. "Mushrooms (they're fungi).") 4 = Has sentential close parenthesis (e.g. "(Mushrooms are fungi!)") `terminal_punc_token` is a RaggedTensor containing the index of terminal punctuation token immediately following the last word in the fragment -- or index of the last word itself, if it's an acronym (since acronyms include the terminal punctuation). index of the terminal punctuation token.

Returns

A RaggedTensor of fragment_start, fragment_end, fragment_properties and terminal_punc_token.

fragment_properties is an int32 bitmask whose values may contain:

1 = fragment ends with terminal punctuation
2 = fragment ends with multiple terminal punctuations (e.g. "She said what?!")
3 = Has close parenthesis (e.g. "Mushrooms (they're fungi).")
4 = Has sentential close parenthesis (e.g. "(Mushrooms are fungi!)")

terminal_punc_token is a RaggedTensor containing the index of terminal punctuation token immediately following the last word in the fragment -- or index of the last word itself, if it's an acronym (since acronyms include the terminal punctuation). index of the terminal punctuation token.

text.sentence_fragments

Args

Returns