View source on GitHub

Builds a graph based on dense embeddings and persists it in TSV format.

Used in the notebooks

Used in the tutorials

This function reads input instances from one or more TFRecord files, each containing tf.train.Example protos. Each input example is expected to contain at least the following 2 features:

  • id: A singleton bytes_list feature that identifies each example.
  • embedding: A float_list feature that contains the (dense) embedding of each example.

id and embedding are not necessarily the literal feature names; if your features have different names, you can specify them using the id_feature_name and embedding_feature_name arguments, respectively.

This function then computes the cosine similarity between all pairs of input examples based on their associated embeddings. An edge is written to the TSV file named by output_graph_path for each pair whose similarity is at least as large as similarity_threshold. Each output edge is represented by a TSV line in the output_graph_path file with the following form:


All edges in the output will be symmetric (i.e., if edge A--w-->B exists in the output, then so will edge B--w-->A).

Note that this function can also be invoked as a binary from a shell. Sample usage:

python -m neural_structured_learning.tools.build_graph [flags] embedding_file.tfr... output_graph.tsv

For details about this program's flags, run:

python -m neural_structured_learning.tools.build_graph --help

embedding_files A list of names of TFRecord files containing tf.train.Example objects, which in turn contain dense embeddings.
output_graph_path Name of the file to which the output graph in TSV format should be written.
similarity_threshold Threshold used to determine which edges to retain in the resulting graph.
id_feature_name The name of the feature in the input tf.train.Example objects representing the ID of examples.
embedding_feature_name The name of the feature in the input tf.train.Example objects representing the embedding of examples.