tft.experimental.compute_and_apply_approximate_vocabulary

Generates an approximate vocabulary for x and maps it to an integer.

tft.experimental.compute_and_apply_approximate_vocabulary(
    x: common_types.ConsistentTensorType,
    *,
    default_value: Any = -1,
    top_k: Optional[int] = None,
    num_oov_buckets: int = 0,
    vocab_filename: Optional[str] = None,
    weights: Optional[tf.Tensor] = None,
    file_format: common_types.VocabularyFileFormatType = analyzers.DEFAULT_VOCABULARY_FILE_FORMAT,
    store_frequency: Optional[bool] = False,
    reserved_tokens: Optional[Union[Sequence[str], tf.Tensor]] = None,
    name: Optional[str] = None
) -> common_types.ConsistentTensorType

Args
`x`	A `Tensor`, `SparseTensor`, or `RaggedTensor` of type tf.string or tf.int[8\|16\|32\|64].
`default_value`	The value to use for out-of-vocabulary values, unless 'num_oov_buckets' is greater than zero.
`top_k`	Limit the generated vocabulary to the first `top_k` elements. If set to None, the full vocabulary is generated.
`num_oov_buckets`	Any lookup of an out-of-vocabulary token will return a bucket ID based on its hash if `num_oov_buckets` is greater than zero. Otherwise it is assigned the `default_value`.
`vocab_filename`	The file name for the vocabulary file. If None, a name based on the scope name in the context of this graph will be used as the file name. If not None, should be unique within a given preprocessing function. NOTE in order to make your pipelines resilient to implementation details please set `vocab_filename` when you are using the vocab_filename on a downstream component.
`weights`	(Optional) Weights `Tensor` for the vocabulary. It must have the same shape as x.
`file_format`	(Optional) A str. The format of the resulting vocabulary file. Accepted formats are: 'tfrecord_gzip', 'text'. 'tfrecord_gzip' requires tensorflow>=2.4. The default value is 'text'.
`store_frequency`	If True, frequency of the words is stored in the vocabulary file. In the case labels are provided, the mutual information is stored in the file instead. Each line in the file will be of the form 'frequency word'. NOTE: if True and text_format is 'text' then spaces will be replaced to avoid information loss.
`reserved_tokens`	(Optional) A list of tokens that should appear in the vocabulary regardless of their appearance in the input. These tokens would maintain their order, and have a reserved spot at the beginning of the vocabulary. Note: this field has no affect on cache.
`name`	(Optional) A name for this operation.

Returns
A `Tensor`, `SparseTensor`, or `RaggedTensor` where each string value is mapped to an integer. Each unique string value that appears in the vocabulary is mapped to a different integer and integers are consecutive starting from zero. String value not in the vocabulary is assigned `default_value`. Alternatively, if `num_oov_buckets` is specified, out of vocabulary strings are hashed to values in [vocab_size, vocab_size + num_oov_buckets) for an overall range of [0, vocab_size + num_oov_buckets).

Returns

A Tensor, SparseTensor, or RaggedTensor where each string value is mapped to an integer. Each unique string value that appears in the vocabulary is mapped to a different integer and integers are consecutive starting from zero. String value not in the vocabulary is assigned default_value. Alternatively, if num_oov_buckets is specified, out of vocabulary strings are hashed to values in [vocab_size, vocab_size + num_oov_buckets) for an overall range of [0, vocab_size + num_oov_buckets).

Raises
`ValueError`	If `top_k` is negative. If `file_format` is not in the list of allowed formats. If x.dtype is not string or integral.

tft.experimental.compute_and_apply_approximate_vocabulary

Args

Returns

Raises