Encodes the strings into an IBLT data structure.

The IBLT is a numpy array of shape [repetitions, table_size, num_chunks+2]. Its value at index (r, h, c) corresponds to (r is a repetition): sum of chunk c of keys hashing to h in r if c < num_chunks, sum of counts of keys hashing to h in r if c = num_chunks, sum of checks of keys hashing to h in r if c = num_chunks + 1.

capacity Number of distinct strings that we expect to be inserted.
string_max_bytes Maximum length of a string in bytesthat can be inserted.
encoding The character encoding of the string data to encode. For non-character binary data or strings with unknown encoding, specify CharacterEncoding.UNKNOWN. Defaults to CharacterEncoding.UTF8.
drop_strings_above_max_length If True, strings above string_max_bytes will be dropped when constructing the IBLT. Defaults to False.
seed Integer seed for hash functions. Defaults to 0.
repetitions Number of repetitions in IBLT data structure (must be >= 3). Defaults to 3.
hash_family String specifying the hash family to use to construct IBLT. (options include coupled or random, default is chosen based on capacity)
hash_family_params A dict of parameters that the hash family hasher expects. (defaults are chosen based on capacity.)
field_size The field size for all values in IBLT. Defaults to 2**31 - 1.



View source

Returns Tensor containing integer chunks for input strings.

input_strings A tensor of strings.

A 2D tensor with rows consisting of integer chunks corresponding to the string indexed by the row and a trimmed input_strings that can fit in the IBLT.


View source

Returns Tensor containing the values of the IBLT data structure.

input_strings A 1D tensor of strings.
input_counts A 1D tensor of tf.int64 representing the count of each string.

A tensor of shape [repetitions, table_size, num_chunks+2] whose value at index (r, h, c) corresponds to chunk c of the keys if c < num_chunks, to the counts if c = num_chunks, and to the checks if c = num_chunks + 1.