View source on GitHub |
Various tensorflow ops related to text-processing.
Modules
metrics
module: Tensorflow text-processing metrics.
tflite_registrar
module: tflite_registrar A module with a Python wrapper for TFLite TFText ops.
Classes
class BertTokenizer
: Tokenizer used for BERT.
class ByteSplitter
: Splits a string tensor into bytes.
class Detokenizer
: Base class for detokenizer implementations.
class FastBertNormalizer
: Normalizes a tensor of UTF-8 strings.
class FastBertTokenizer
: Tokenizer used for BERT, a faster version with TFLite support.
class FastSentencepieceTokenizer
: Sentencepiece tokenizer with tf.text interface.
class FastWordpieceTokenizer
: Tokenizes a tensor of UTF-8 string tokens into subword pieces.
class FirstNItemSelector
: An ItemSelector
that selects the first n
items in the batch.
class HubModuleSplitter
: Splitter that uses a Hub module.
class HubModuleTokenizer
: Tokenizer that uses a Hub module.
class LastNItemSelector
: An ItemSelector
that selects the last n
items in the batch.
class MaskValuesChooser
: Assigns values to the items chosen for masking.
class PhraseTokenizer
: Tokenizes a tensor of UTF-8 string tokens into phrases.
class RandomItemSelector
: An ItemSelector
implementation that randomly selects items in a batch.
class Reduction
: Type of reduction to be done by the n-gram op.
class RegexSplitter
: RegexSplitter
splits text on the given regular expression.
class RoundRobinTrimmer
: A Trimmer
that allocates a length budget to segments via round robin.
class SentencepieceTokenizer
: Tokenizes a tensor of UTF-8 strings.
class ShrinkLongestTrimmer
: A Trimmer
that truncates the longest segment.
class SplitMergeFromLogitsTokenizer
: Tokenizes a tensor of UTF-8 string into words according to logits.
class SplitMergeTokenizer
: Tokenizes a tensor of UTF-8 string into words according to labels.
class Splitter
: An abstract base class for splitting text.
class SplitterWithOffsets
: An abstract base class for splitters that return offsets.
class StateBasedSentenceBreaker
: A Splitter
that uses a state machine to determine sentence breaks.
class Tokenizer
: Base class for tokenizer implementations.
class TokenizerWithOffsets
: Base class for tokenizer implementations that return offsets.
class Trimmer
: Truncates a list of segments using a pre-determined truncation strategy.
class UnicodeCharTokenizer
: Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.
class UnicodeScriptTokenizer
: Tokenizes UTF-8 by splitting when there is a change in Unicode script.
class WaterfallTrimmer
: A Trimmer
that allocates a length budget to segments in order.
class WhitespaceTokenizer
: Tokenizes a tensor of UTF-8 strings on whitespaces.
class WordShape
: Values for the 'pattern' arg of the wordshape op.
class WordpieceTokenizer
: Tokenizes a tensor of UTF-8 string tokens into subword pieces.
Functions
boise_tags_to_offsets(...)
: Converts the token offsets and BOISE tags into span offsets and span type.
build_fast_bert_normalizer_model(...)
: build_fast_bert_normalizer_model(arg0: bool) -> bytes
build_fast_wordpiece_model(...)
: build_fast_wordpiece_model(arg0: list[str], arg1: int, arg2: str, arg3: str, arg4: bool, arg5: bool) -> bytes
case_fold_utf8(...)
: Applies case folding to every UTF-8 string in the input.
coerce_to_structurally_valid_utf8(...)
: Coerce UTF-8 input strings to structurally valid UTF-8.
combine_segments(...)
: Combine one or more input segments for a model's input sequence.
concatenate_segments(...)
: Concatenate input segments for a model's input sequence.
find_source_offsets(...)
: Maps the input post-normalized string offsets to pre-normalized offsets.
gather_with_default(...)
: Gather slices with indices=-1
mapped to default
.
greedy_constrained_sequence(...)
: Performs greedy constrained sequence on a batch of examples.
mask_language_model(...)
: Applies dynamic language model masking.
max_spanning_tree(...)
: Finds the maximum directed spanning tree of a digraph.
max_spanning_tree_gradient(...)
: Returns a subgradient of the MaximumSpanningTree op.
ngrams(...)
: Create a tensor of n-grams based on the input data data
.
normalize_utf8(...)
: Normalizes each UTF-8 string in the input tensor using the specified rule.
normalize_utf8_with_offsets_map(...)
: Normalizes each UTF-8 string in the input tensor using the specified rule.
offsets_to_boise_tags(...)
: Converts the given tokens and spans in offsets format into BOISE tags.
pad_along_dimension(...)
: Add padding to the beginning and end of data in a specific dimension.
pad_model_inputs(...)
: Pad model input and generate corresponding input masks.
regex_split(...)
: Split input
by delimiters that match a regex pattern.
regex_split_with_offsets(...)
: Split input
by delimiters that match a regex pattern; returns offsets.
sentence_fragments(...)
: Find the sentence fragments in a given text. (deprecated)
sliding_window(...)
: Builds a sliding window for data
with a specified width.
span_alignment(...)
: Return an alignment from a set of source spans to a set of target spans.
span_overlaps(...)
: Returns a boolean tensor indicating which source and target spans overlap.
utf8_binarize(...)
: Decode UTF8 tokens into code points and return their bits.
viterbi_constrained_sequence(...)
: Performs greedy constrained sequence on a batch of examples.
wordshape(...)
: Determine wordshape features for each input string.
Other Members | |
---|---|
version |
'2.18.0'
|