Generates an approximate vocabulary for x
and maps it to an integer.
tft.experimental.compute_and_apply_approximate_vocabulary(
x: common_types.ConsistentTensorType,
default_value: Any = -1,
top_k: Optional[int] = None,
num_oov_buckets: int = 0,
vocab_filename: Optional[str] = None,
weights: Optional[tf.Tensor] = None,
file_format: common_types.VocabularyFileFormatType = analyzers.DEFAULT_VOCABULARY_FILE_FORMAT,
name: Optional[str] = None
) -> common_types.ConsistentTensorType
Args |
x
|
A Tensor , SparseTensor , or RaggedTensor of type tf.string or
tf.int[8|16|32|64].
|
default_value
|
The value to use for out-of-vocabulary values, unless
'num_oov_buckets' is greater than zero.
|
top_k
|
Limit the generated vocabulary to the first top_k elements. If set
to None, the full vocabulary is generated.
|
num_oov_buckets
|
Any lookup of an out-of-vocabulary token will return a
bucket ID based on its hash if num_oov_buckets is greater than zero.
Otherwise it is assigned the default_value .
|
vocab_filename
|
The file name for the vocabulary file. If None, a name based
on the scope name in the context of this graph will be used as the file
name. If not None, should be unique within a given preprocessing function.
NOTE in order to make your pipelines resilient to implementation details
please set vocab_filename when you are using the vocab_filename on a
downstream component.
|
weights
|
(Optional) Weights Tensor for the vocabulary. It must have the
same shape as x.
|
file_format
|
(Optional) A str. The format of the resulting vocabulary file.
Accepted formats are: 'tfrecord_gzip', 'text'. 'tfrecord_gzip' requires
tensorflow>=2.4. The default value is 'text'.
|
name
|
(Optional) A name for this operation.
|
Returns |
A Tensor , SparseTensor , or RaggedTensor where each string value is
mapped to an integer. Each unique string value that appears in the
vocabulary is mapped to a different integer and integers are consecutive
starting from zero. String value not in the vocabulary is assigned
default_value . Alternatively, if num_oov_buckets is specified, out of
vocabulary strings are hashed to values in
[vocab_size, vocab_size + num_oov_buckets) for an overall range of
[0, vocab_size + num_oov_buckets).
|
Raises |
ValueError
|
If top_k is negative.
If file_format is not in the list of allowed formats.
If x.dtype is not string or integral.
|