tf.keras.layers.experimental.preprocessing.StringLookup

Maps strings from a vocabulary to integer indices.

Inherits From: PreprocessingLayer, Layer, Module

This layer translates a set of arbitrary strings into an integer output via a table-based lookup, with optional out-of-vocabulary handling.

If desired, the user can call this layer's adapt() method on a data set, which will analyze the data set, determine the frequency of individual string values, and create a vocabulary from them. This vocabulary can have unlimited size or be capped, depending on the configuration options for this layer; if there are more unique values in the input than the maximum vocabulary size, the most frequent terms will be used to create the vocabulary.

Examples:

Creating a lookup layer with a known vocabulary

This example creates a lookup layer with a pre-existing vocabulary.

vocab = ["a", "b", "c", "d"]
data = tf.constant([["a", "c", "d"], ["d", "z", "b"]])
layer = StringLookup(vocabulary=vocab)
layer(data)
<tf.Tensor: shape=(2, 3), dtype=int64, numpy=
array([[2, 4, 5],
       [5, 1, 3]])>

Creating a lookup layer with an adapted vocabulary

This example creates a lookup layer and generates the vocabulary by analyzing the dataset.

data = tf.constant([["a", "c", "d"], ["d", "z", "b"]])
layer = StringLookup()
layer.adapt(data)
layer.get_vocabulary()
['', '[UNK]', 'd', 'z', 'c', 'b', 'a']

Note how the mask token '' and the OOV token [UNK] have been added to the vocabulary. The remaining tokens are sorted by frequency ('d', which has 2 occurrences, is first) then by inverse sort order.

data = tf.constant([["a", "c", "d"], ["d", "z", "b"]])
layer = StringLookup()
layer.adapt(data)
layer(data)
<tf.Tensor: shape=(2, 3), dtype=int64, numpy=
array([[6, 4, 2],
       [2, 3, 5]])>

Lookups with multiple OOV tokens.

This example demonstrates how to use a lookup layer with multiple OOV tokens. When a layer is created with more than one OOV token, any OOV values are hashed into the number of OOV buckets, distributing OOV values in a deterministic fashion across the set.

vocab = ["a", "b", "c", "d"]
data = tf.constant([["a", "c", "d"], ["m", "z", "b"]])
layer = StringLookup(vocabulary=vocab, num_oov_indices=2)
layer(data)
<tf.Tensor: shape=(2, 3), dtype=int64, numpy=
array([[3, 5, 6],
       [1, 2, 4]])>

Note that the output for OOV value 'm' is 1, while the output for OOV value 'z' is 2. The in-vocab terms have their output index increased by 1 from earlier examples (a maps to 3, etc) in order to make space for the extra OOV value.

Inverse lookup

This example demonstrates how to map indices to strings using this layer. (You can also use adapt() with inverse=True, but for simplicity we'll pass the vocab in this example.)

vocab = ["a", "b", "c", "d"]
data = tf.constant([[1, 3, 4], [4, 5, 2]])
layer = StringLookup(vocabulary=vocab, invert=True)
layer(data)
<tf.Tensor: shape=(2, 3), dtype=string, numpy=
array([[b'a', b'c', b'd'],
       [b'd', b'[UNK]', b'b']], dtype=object)>

Note that the integer 5, which is out of the vocabulary space, returns an OOV token.

Forward and inverse lookup pairs

This example demonstrates how to use the vocabulary of a standard lookup layer to create an inverse lookup layer.

vocab = ["a", "b", "c", "d"]
data = tf.constant([["a", "c", "d"], ["d", "z", "b"]])
layer = StringLookup(vocabulary=vocab)
i_layer = StringLookup(vocabulary=layer.get_vocabulary(), invert=True)
int_data = layer(data)
i_layer(int_data)
<tf.Tensor: shape=(2, 3), dtype=string, numpy=
array([[b'a', b'c', b'd'],
       [b'd', b'[UNK]', b'b']], dtype=object)>

In this example, the input value 'z' resulted in an output of '[UNK]', since 1000 was not in the vocabulary - it got represented as an OOV, and all OOV values are returned as '[OOV}' in the inverse layer. Also, note that for the inverse to work, you must have already set the forward layer vocabulary either directly or via fit() before calling get_vocabulary().

max_tokens The maximum size of the vocabulary for this layer. If None, there is no cap on the size of the vocabulary. Note that this vocabulary includes the OOV and mask tokens, so the effective number of tokens is (max_tokens - num_oov_indices - (1 if mask_token else 0))
num_oov_indices The number of out-of-vocabulary tokens to use; defaults to

  1. If this value is more than 1, OOV inputs are hashed to determine their OOV value; if this value is 0, passing an OOV input will result in a '-1' being returned for that value in the output tensor. (Note that, because the value is -1 and not 0, this will allow you to effectively drop OOV values from categorical encodings.)
mask_token A token that represents masked values, and which is mapped to index 0. Defaults to the empty string "". If set to None, no mask term will be added and the OOV tokens, if any, will be indexed from (0...num_oov_indices) instead of (1...num_oov_indices+1).
oov_token The token representing an out-of-vocabulary value. Defaults to "[UNK]".
vocabulary An optional list of vocabulary terms, or a path to a text file containing a vocabulary to load into this layer. The file should contain one token per line. If the list or file contains the same token multiple times, an error will be thrown.
encoding The Python string encoding to use. Defaults to 'utf-8'.
invert If true, this layer will map indices to vocabulary items instead of mapping vocabulary items to indices.

Methods

adapt

View source

Fits the state of the preprocessing layer to the dataset.

Overrides the default adapt method to apply relevant preprocessing to the inputs before passing to the combiner.

Arguments
data The data to train on. It can be passed either as a tf.data Dataset, or as a numpy array.
reset_state Optional argument specifying whether to clear the state of the layer at the start of the call to adapt. This must be True for this layer, which does not support repeated calls to adapt.

get_vocabulary

View source

set_vocabulary

View source

Sets vocabulary data for this layer with inverse=False.

This method sets the vocabulary for this layer directly, instead of analyzing a dataset through 'adapt'. It should be used whenever the vocab information is already known. If vocabulary data is already present in the layer, this method will either replace it

Arguments
vocab An array of string tokens.

Raises
ValueError If there are too many inputs, the inputs do not match, or input data is missing.

vocab_size

View source