tf.contrib.lookup.index_table_from_file

View source on GitHub

Returns a lookup table that converts a string tensor into int64 IDs.

This operation constructs a lookup table to convert tensor of strings into int64 IDs. The mapping can be initialized from a vocabulary file specified in vocabulary_file, where the whole line is the key and the zero-based line number is the ID.

Any lookup of an out-of-vocabulary token will return a bucket ID based on its hash if num_oov_buckets is greater than zero. Otherwise it is assigned the default_value. The bucket ID range is [vocabulary size, vocabulary size + num_oov_buckets - 1].

The underlying table must be initialized by calling session.run(tf.compat.v1.tables_initializer()) or session.run(table.init()) once.

To specify multi-column vocabulary files, use key_column_index and value_column_index and delimiter.

  • TextFileIndex.LINE_NUMBER means use the line number starting from zero, expects data type int64.
  • TextFileIndex.WHOLE_LINE means use the whole line content, expects data type string.
  • A value >=0 means use the index (starting at zero) of the split line based on delimiter.

Sample Usages:

If we have a vocabulary file "test.txt" with the following content:

emerson
lake
palmer
features = tf.constant(["emerson", "lake", "and", "palmer"])
table = tf.lookup.index_table_from_file(
    vocabulary_file="test.txt", num_oov_buckets=1)
ids = table.lookup(features)
...
tf.compat.v1.tables_initializer().run()

ids.eval()  ==> [0, 1, 3, 2]  # where 3 is the out-of-vocabulary bucket

vocabulary_file The vocabulary filename, may be a constant scalar Tensor.
num_oov_buckets The number of out-of-vocabulary buckets.
vocab_size Number of the elements in the vocabulary, if known.
default_value The value to use for out-of-vocabulary feature values. Defaults to -1.
hasher_spec A HasherSpec to specify the hash function to use for assignation of out-of-vocabulary buckets.
key_dtype The key data type.
name A name for this op (optional).
key_column_index The column index from the text file to get the key values from. The default is to use the whole line content.
value_column_index The column index from the text file to get the value values from. The default is to use the line number, starting from zero.
delimiter The delimiter to separate fields in a line.

The lookup table to map a key_dtype Tensor to index int64 Tensor.

ValueError If vocabulary_file is not set.
ValueError If num_oov_buckets is negative or vocab_size is not greater than zero.