Prepares input for graph-based Neural Structured Learning and persists it.

Used in the notebooks

Used in the tutorials

In particular, this function merges into each labeled training example the features from its out-edge neighbor examples according to a supplied similarity graph, and persists the resulting (augmented) training data.

Each tf.train.Example read from the files identified by labeled_examples_path and unlabeled_examples_path is expected to have a feature that contains its ID (represented as a singleton bytes_list value); the name of this feature is specified by the value of id_feature_name.

Each edge in the graph specified by graph_path is identified by a source instance ID, a target instance ID, and an optional edge weight. These edges are specified by TSV lines of the following form:


If no edge_weight is specified, it defaults to 1.0. If the input graph is not symmetric and if add_undirected_edges is True, then all edges will be treated as bi-directional. To build a graph based on the similarity of instances' dense embeddings, see nsl.tools.build_graph.

This function merges into each labeled example the features of that example's out-edge neighbors according to that instance's in-edges in the graph. If a value is specified for max_nbrs, then at most that many neighbors' features are merged into each labeled instance (based on which neighbors have the largest edge weights, with ties broken using instance IDs).

Here's how the merging process works. For each labeled example, the features of its i'th out-edge neighbor will be prefixed by NL_nbr_<i>_, with indexes i in the half-open interval [0, K), where K is the minimum of max_nbrs and the number of the labeled example's out-edges in the graph. A feature named NL_nbr_<i>_weight will also be merged into the labeled example whose value will be the neighbor's corresponding edge weight. The top neighbors to use in this process are selected by consulting the input graph and selecting the labeled example's out-edge neighbors with the largest edge weight; ties are broken by preferring neighbor IDs with larger lexicographic order. Finally, a feature named NL_num_nbrs is set on the result (a singleton int64_list) denoting the number of neighbors K merged into the labeled example.

Finally, the merged examples are written to a TFRecord file named by output_training_data_path.

labeled_examples_path Names a TFRecord file containing labeled tf.train.Example instances.
unlabeled_examples_path Names a TFRecord file containing unlabeled tf.train.Example instances. This can be an empty string if there are no unlabeled examples.
graph_path Names a TSV file that specifies a graph as a set of edges representing similarity relationships.
output_training_data_path Path to a file where the resulting augmented training data in the form of tf.train.Example instances will be persisted in the TFRecord format.
add_undirected_edges Boolean indicating whether or not to treat adges as bi-directional.
max_nbrs The maximum number of neighbors to use to generate the augmented training data for downstream training.
id_feature_name The name of the feature in the input labeled and unlabeled tf.train.Example objects representing the ID of examples.