Module: text

Various tensorflow ops related to text-processing.

Modules

metrics module: Tensorflow text-processing metrics.

tflite_registrar module: tflite_registrar A module with a Python wrapper for TFLite TFText ops.

Classes

class BertTokenizer: Tokenizer used for BERT.

class ByteSplitter: Splits a string tensor into bytes.

class Detokenizer: Base class for detokenizer implementations.

class FastBertNormalizer: Normalizes a tensor of UTF-8 strings.

class FastBertTokenizer: Tokenizer used for BERT, a faster version with TFLite support.

class FastSentencepieceTokenizer: Sentencepiece tokenizer with tf.text interface.

class FastWordpieceTokenizer: Tokenizes a tensor of UTF-8 string tokens into subword pieces.

class FirstNItemSelector: An ItemSelector that selects the first n items in the batch.

class HubModuleSplitter: Splitter that uses a Hub module.

class HubModuleTokenizer: Tokenizer that uses a Hub module.

class LastNItemSelector: An ItemSelector that selects the last n items in the batch.

class MaskValuesChooser: Assigns values to the items chosen for masking.

class PhraseTokenizer: Tokenizes a tensor of UTF-8 string tokens into phrases.

class RandomItemSelector: An ItemSelector implementation that randomly selects items in a batch.

class Reduction: Type of reduction to be done by the n-gram op.

class RegexSplitter: RegexSplitter splits text on the given regular expression.

class RoundRobinTrimmer: A Trimmer that allocates a length budget to segments via round robin.

class SentencepieceTokenizer: Tokenizes a tensor of UTF-8 strings.

class ShrinkLongestTrimmer: A Trimmer that truncates the longest segment.

class SplitMergeFromLogitsTokenizer: Tokenizes a tensor of UTF-8 string into words according to logits.

class SplitMergeTokenizer: Tokenizes a tensor of UTF-8 string into words according to labels.

class Splitter: An abstract base class for splitting text.

class SplitterWithOffsets: An abstract base class for splitters that return offsets.

class StateBasedSentenceBreaker: A Splitter that uses a state machine to determine sentence breaks.

class Tokenizer: Base class for tokenizer implementations.

class TokenizerWithOffsets: Base class for tokenizer implementations that return offsets.

class Trimmer: Truncates a list of segments using a pre-determined truncation strategy.

class UnicodeCharTokenizer: Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.

class UnicodeScriptTokenizer: Tokenizes UTF-8 by splitting when there is a change in Unicode script.

class WaterfallTrimmer: A Trimmer that allocates a length budget to segments in order.

class WhitespaceTokenizer: Tokenizes a tensor of UTF-8 strings on whitespaces.

class WordShape: Values for the 'pattern' arg of the wordshape op.

class WordpieceTokenizer: Tokenizes a tensor of UTF-8 string tokens into subword pieces.

Functions

boise_tags_to_offsets(...): Converts the token offsets and BOISE tags into span offsets and span type.

build_fast_bert_normalizer_model(...): build_fast_bert_normalizer_model(arg0: bool) -> bytes

build_fast_wordpiece_model(...): build_fast_wordpiece_model(arg0: list[str], arg1: int, arg2: str, arg3: str, arg4: bool, arg5: bool) -> bytes

case_fold_utf8(...): Applies case folding to every UTF-8 string in the input.

coerce_to_structurally_valid_utf8(...): Coerce UTF-8 input strings to structurally valid UTF-8.

combine_segments(...): Combine one or more input segments for a model's input sequence.

concatenate_segments(...): Concatenate input segments for a model's input sequence.

find_source_offsets(...): Maps the input post-normalized string offsets to pre-normalized offsets.

gather_with_default(...): Gather slices with indices=-1 mapped to default.

greedy_constrained_sequence(...): Performs greedy constrained sequence on a batch of examples.

mask_language_model(...): Applies dynamic language model masking.

max_spanning_tree(...): Finds the maximum directed spanning tree of a digraph.

max_spanning_tree_gradient(...): Returns a subgradient of the MaximumSpanningTree op.

ngrams(...): Create a tensor of n-grams based on the input data data.

normalize_utf8(...): Normalizes each UTF-8 string in the input tensor using the specified rule.

normalize_utf8_with_offsets_map(...): Normalizes each UTF-8 string in the input tensor using the specified rule.

offsets_to_boise_tags(...): Converts the given tokens and spans in offsets format into BOISE tags.

pad_along_dimension(...): Add padding to the beginning and end of data in a specific dimension.

pad_model_inputs(...): Pad model input and generate corresponding input masks.

regex_split(...): Split input by delimiters that match a regex pattern.

regex_split_with_offsets(...): Split input by delimiters that match a regex pattern; returns offsets.

sentence_fragments(...): Find the sentence fragments in a given text. (deprecated)

sliding_window(...): Builds a sliding window for data with a specified width.

span_alignment(...): Return an alignment from a set of source spans to a set of target spans.

span_overlaps(...): Returns a boolean tensor indicating which source and target spans overlap.

utf8_binarize(...): Decode UTF8 tokens into code points and return their bits.

viterbi_constrained_sequence(...): Performs greedy constrained sequence on a batch of examples.

wordshape(...): Determine wordshape features for each input string.

version '2.18.0'