ML Community Day is November 9! Join us for updates from TensorFlow, JAX, and more Learn more

Module: text

Various tensorflow ops related to text-processing.

Modules

metrics module: Tensorflow text-processing metrics.

Classes

class BertTokenizer: Tokenizer used for BERT.

class Detokenizer: Base class for detokenizer implementations.

class FirstNItemSelector: An ItemSelector that selects the first n items in the batch.

class HubModuleSplitter: Splitter that uses a Hub module.

class HubModuleTokenizer: Tokenizer that uses a Hub module.

class MaskValuesChooser: Assigns values to the items chosen for masking.

class RandomItemSelector: An ItemSelector implementation that randomly selects items in a batch.

class Reduction: Type of reduction to be done by the n-gram op.

class RegexSplitter: RegexSplitter splits text on the given regular expression.

class RoundRobinTrimmer: A Trimmer that allocates a length budget to segments via round robin.

class SentencepieceTokenizer: Tokenizes a tensor of UTF-8 strings.

class SplitMergeFromLogitsTokenizer: Tokenizes a tensor of UTF-8 string into words according to logits.

class SplitMergeTokenizer: Tokenizes a tensor of UTF-8 string into words according to labels.

class Splitter: An abstract base class for splitting text.

class SplitterWithOffsets: An abstract base class for splitters that return offsets.

class StateBasedSentenceBreaker: A Splitter that uses a state machine to determine sentence breaks.

class Tokenizer: Base class for tokenizer implementations.

class TokenizerWithOffsets: Base class for tokenizer implementations that return offsets.

class UnicodeCharTokenizer: Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.

class UnicodeScriptTokenizer: Tokenizes UTF-8 by splitting when there is a change in Unicode script.

class WaterfallTrimmer: A Trimmer that allocates a length budget to segments in order.

class WhitespaceTokenizer: Tokenizes a tensor of UTF-8 strings on whitespaces.

class WordShape: Values for the 'pattern' arg of the wordshape op.

class WordpieceTokenizer: Tokenizes a tensor of UTF-8 string tokens into subword pieces.

Functions

case_fold_utf8(...): Applies case folding to every UTF-8 string in the input.

coerce_to_structurally_valid_utf8(...): Coerce UTF-8 input strings to structurally valid UTF-8.

combine_segments(...): Combine one or more input segments for a model's input sequence.

find_source_offsets(...): Maps the input post-normalized string offsets to pre-normalized offsets.

gather_with_default(...): Gather slices with indices=-1 mapped to default.

greedy_constrained_sequence(...): Performs greedy constrained sequence on a batch of examples.

mask_language_model(...): Applies dynamic language model masking.

max_spanning_tree(...): Finds the maximum directed spanning tree of a digraph.

max_spanning_tree_gradient(...): Returns a subgradient of the MaximumSpanningTree op.

ngrams(...): Create a tensor of n-grams based on the input data data.

normalize_utf8(...): Normalizes each UTF-8 string in the input tensor using the specified rule.

normalize_utf8_with_offsets_map(...): Normalizes each UTF-8 string in the input tensor using the specified rule.

pad_along_dimension(...): Add padding to the beginning and end of data in a specific dimension.

pad_model_inputs(...): Pad model input and generate corresponding input masks.

regex_split(...): Split input by delimiters that match a regex pattern.

regex_split_with_offsets(...): Split input by delimiters that match a regex pattern; returns offsets.

sentence_fragments(...): Find the sentence fragments in a given text. (deprecated)

sliding_window(...): Builds a sliding window for data with a specified width.

span_alignment(...): Return an alignment from a set of source spans to a set of target spans.

span_overlaps(...): Returns a boolean tensor indicating which source and target spans overlap.

viterbi_constrained_sequence(...): Performs greedy constrained sequence on a batch of examples.

wordshape(...): Determine wordshape features for each input string.

version '2.6.0'