tft.experimental.compute_and_apply_approximate_vocabulary

Stay organized with collections Save and categorize content based on your preferences.

Generates an approximate vocabulary for x and maps it to an integer.

x A Tensor, SparseTensor, or RaggedTensor of type tf.string or tf.int[8|16|32|64].
default_value The value to use for out-of-vocabulary values, unless 'num_oov_buckets' is greater than zero.
top_k Limit the generated vocabulary to the first top_k elements. If set to None, the full vocabulary is generated.
num_oov_buckets Any lookup of an out-of-vocabulary token will return a bucket ID based on its hash if num_oov_buckets is greater than zero. Otherwise it is assigned the default_value.
vocab_filename The file name for the vocabulary file. If None, a name based on the scope name in the context of this graph will be used as the file name. If not None, should be unique within a given preprocessing function. NOTE in order to make your pipelines resilient to implementation details please set vocab_filename when you are using the vocab_filename on a downstream component.
weights (Optional) Weights Tensor for the vocabulary. It must have the same shape as x.
file_format (Optional) A str. The format of the resulting vocabulary file. Accepted formats are: 'tfrecord_gzip', 'text'. 'tfrecord_gzip' requires tensorflow>=2.4. The default value is 'text'.
name (Optional) A name for this operation.

A Tensor, SparseTensor, or RaggedTensor where each string value is mapped to an integer. Each unique string value that appears in the vocabulary is mapped to a different integer and integers are consecutive starting from zero. String value not in the vocabulary is assigned default_value. Alternatively, if num_oov_buckets is specified, out of vocabulary strings are hashed to values in [vocab_size, vocab_size + num_oov_buckets) for an overall range of [0, vocab_size + num_oov_buckets).

ValueError If top_k is negative. If file_format is not in the list of allowed formats. If x.dtype is not string or integral.