Generates a vocabulary for x and maps it to an integer with this vocab.

Used in the notebooks

Used in the tutorials

In case one of the tokens contains the '\n' or '\r' characters or is empty it will be discarded since we are currently writing the vocabularies as text files. This behavior will likely be fixed/improved in the future.

Note that this function will cause a vocabulary to be computed. For large datasets it is highly recommended to either set frequency_threshold or top_k to control the size of the vocabulary, and also the run time of this operation.

x A Tensor, SparseTensor, or RaggedTensor of type tf.string or[8|16|32|64].
default_value The value to use for out-of-vocabulary values, unless 'num_oov_buckets' is greater than zero.
top_k Limit the generated vocabulary to the first top_k elements. If set to None, the full vocabulary is generated.
frequency_threshold Limit the generated vocabulary only to elements whose absolute frequency is >= to the supplied threshold. If set to None, the full vocabulary is generated. Absolute frequency means the number of occurences of the element in the dataset, as opposed to the proportion of instances that contain that element. If labels are provided and the vocab is computed using mutual information, tokens are filtered if their mutual information with the label is < the supplied threshold.
num_oov_buckets Any lookup of an out-of-vocabulary token will return a bucket ID based on its hash if num_oov_buckets is greater than zero. Otherwise it is assigned the default_value.
vocab_filename The file name for the vocabulary file. If None, a name based on the scope name in the context of this graph will be used as the file name. If not None, should be unique within a given preprocessing function. NOTE in order to make your pipelines resilient to implementation details please set vocab_filename when you are using the vocab_filename on a downstream component.
weights (Optional) Weights Tensor for the vocabulary. It must have the same shape as x.
labels (Optional) A Tensor of labels for the vocabulary. If provided, the vocabulary is calculated based on mutual information with the label, rather than frequency. The labels must have the same batch dimension as x. If x is sparse, labels should be a 1D tensor reflecting row-wise labels. If x is dense, labels can either be a 1D tensor of row-wise labels, or a dense tensor of the identical shape as x (i.e. element-wise labels). Labels should be a discrete integerized tensor (If the label is numeric, it should first be bucketized; If the label is a string, an integer vocabulary should first be applied). Note: CompositeTensor labels are not yet supported (b/134931826). WARNING: when labels are provided, the frequency_threshold argument functions as a mutual information threshold, which is a float.
use_adjusted_mutual_info If true, use adjusted mutual information.
min_diff_from_avg Mutual information of a feature will be adjusted to zero whenever the difference between count of the feature with any label and its expected count is lower than min_diff_from_average.
coverage_top_k (Optional), (Experimental) The minimum number of elements per key to be included in the vocabulary.
coverage_frequency_threshold (Optional), (Experimental) Limit the coverage arm of the vocabulary only to elements whose absolute frequency is >= this threshold for a given key.
key_fn (Optional), (Experimental) A fn that takes in a single entry of x and returns the corresponding key for coverage calculation. If this is None, no coverage arm is added to the vocabulary.
fingerprint_shuffle (Optional), (Experimental) Whether to sort the vocabularies by fingerprint instead of counts. This is useful for load balancing on the training parameter servers. Shuffle only happens while writing the files, so all the filters above will still take effect.
file_format (Optional) A str. The format of the resulting vocabulary file. Accepted formats are: 'tfrecord_gzip', 'text'. 'tfrecord_gzip' requires tensorflow>=2.4. The default value is 'text'.
store_frequency If True, frequency of the words is stored in the vocabulary file. In the case labels are provided, the mutual information is stored in the file instead. Each line in the file will be of the form 'frequency word'. NOTE: if True and text_format is 'text' then spaces will be replaced to avoid information loss.
reserved_tokens (Optional) A list of tokens that should appear in the vocabulary regardless of their appearance in the input. These tokens would maintain their order, and have a reserved spot at the beginning of the vocabulary. Note: this field has no affect on cache.
name (Optional) A name for this operation.

A Tensor, SparseTensor, or RaggedTensor where each string value is mapped to an integer. Each unique string value that appears in the vocabulary is mapped to a different integer and integers are consecutive starting from zero. String value not in the vocabulary is assigned default_value. Alternatively, if num_oov_buckets is specified, out of vocabulary strings are hashed to values in [vocab_size, vocab_size + num_oov_buckets) for an overall range of [0, vocab_size + num_oov_buckets).

ValueError If top_k or frequency_threshold is negative. If coverage_top_k or coverage_frequency_threshold is negative.