Graph actions

Perform various training, evaluation, and inference actions on a graph.

class tf.contrib.learn.NanLossDuringTrainingError


tf.contrib.learn.NanLossDuringTrainingError.__str__() {:#NanLossDuringTrainingError.str}


class tf.contrib.learn.RunConfig

This class specifies the specific configurations for the run.

If you're a Google-internal user using command line flags with learn_runner.py (for instance, to do distributed training or to use parameter servers), you probably want to use learn_runner.EstimatorConfig instead.


tf.contrib.learn.RunConfig.__init__(master=None, task=None, num_ps_replicas=None, num_cores=0, log_device_placement=False, gpu_memory_fraction=1, cluster_spec=None, tf_random_seed=None, save_summary_steps=100, save_checkpoints_secs=600, keep_checkpoint_max=5, keep_checkpoint_every_n_hours=10000, job_name=None, is_chief=None, evaluation_master='') {:#RunConfig.init}

Constructor.

If set to None, master, task, num_ps_replicas, cluster_spec, job_name, and is_chief are set based on the TF_CONFIG environment variable, if the pertinent information is present; otherwise, the defaults listed in the Args section apply.

The TF_CONFIG environment variable is a JSON object with two relevant attributes: task and cluster_spec. cluster_spec is a JSON serialized version of the Python dict described in server_lib.py. task has two attributes: type and index, where type can be any of the task types in the cluster_spec. When TF_CONFIG contains said information, the following properties are set on this class:

  • job_name is set to [task][type]
  • task is set to [task][index]
  • cluster_spec is parsed from [cluster]
  • 'master' is determined by looking up job_name and task in the cluster_spec.
  • num_ps_replicas is set by counting the number of nodes listed in the ps job of cluster_spec.
  • is_chief: true when job_name == "master" and task == 0.

Example:

  cluster = {'ps': ['host1:2222', 'host2:2222'],
             'worker': ['host3:2222', 'host4:2222', 'host5:2222']}
  os.environ['TF_CONFIG'] = json.dumps({
      {'cluster': cluster,
       'task': {'type': 'worker', 'index': 1}}})
  config = RunConfig()
  assert config.master == 'host4:2222'
  assert config.task == 1
  assert config.num_ps_replicas == 2
  assert config.cluster_spec == server_lib.ClusterSpec(cluster)
  assert config.job_name == 'worker'
  assert not config.is_chief
Args:
  • master: TensorFlow master. Defaults to empty string for local.
  • task: Task id of the replica running the training (default: 0).
  • num_ps_replicas: Number of parameter server tasks to use (default: 0).
  • num_cores: Number of cores to be used. If 0, the system picks an appropriate number (default: 0).
  • log_device_placement: Log the op placement to devices (default: False).
  • gpu_memory_fraction: Fraction of GPU memory used by the process on each GPU uniformly on the same machine.
  • cluster_spec: a tf.train.ClusterSpec object that describes the cluster in the case of distributed computation. If missing, reasonable assumptions are made for the addresses of jobs.
  • tf_random_seed: Random seed for TensorFlow initializers. Setting this value allows consistency between reruns.
  • save_summary_steps: Save summaries every this many steps.
  • save_checkpoints_secs: Save checkpoints every this many seconds.
  • keep_checkpoint_max: The maximum number of recent checkpoint files to keep. As new files are created, older files are deleted. If None or 0, all checkpoint files are kept. Defaults to 5 (that is, the 5 most recent checkpoint files are kept.)
  • keep_checkpoint_every_n_hours: Number of hours between each checkpoint to be saved. The default value of 10,000 hours effectively disables the feature.
  • job_name: the type of task, e.g., 'ps', 'worker', etc. The job_name must exist in the cluster_spec.jobs.
  • is_chief: whether or not this task (as identified by the other parameters) should be the chief task.
  • evaluation_master: the master on which to perform evaluation.
Raises:
  • ValueError: if num_ps_replicas and cluster_spec are set (cluster_spec may fome from the TF_CONFIG environment variable).

tf.contrib.learn.RunConfig.is_chief


tf.contrib.learn.RunConfig.job_name


tf.contrib.learn.evaluate(graph, output_dir, checkpoint_path, eval_dict, update_op=None, global_step_tensor=None, supervisor_master='', log_every_steps=10, feed_fn=None, max_steps=None)

Evaluate a model loaded from a checkpoint.

Given graph, a directory to write summaries to (output_dir), a checkpoint to restore variables from, and a dict of Tensors to evaluate, run an eval loop for max_steps steps, or until an exception (generally, an end-of-input signal from a reader operation) is raised from running eval_dict.

In each step of evaluation, all tensors in the eval_dict are evaluated, and every log_every_steps steps, they are logged. At the very end of evaluation, a summary is evaluated (finding the summary ops using Supervisor's logic) and written to output_dir.

Args:
  • graph: A Graph to train. It is expected that this graph is not in use elsewhere.
  • output_dir: A string containing the directory to write a summary to.
  • checkpoint_path: A string containing the path to a checkpoint to restore. Can be None if the graph doesn't require loading any variables.
  • eval_dict: A dict mapping string names to tensors to evaluate. It is evaluated in every logging step. The result of the final evaluation is returned. If update_op is None, then it's evaluated in every step. If max_steps is None, this should depend on a reader that will raise an end-of-inupt exception when the inputs are exhausted.
  • update_op: A Tensor which is run in every step.
  • global_step_tensor: A Variable containing the global step. If None, one is extracted from the graph using the same logic as in Supervisor. Used to place eval summaries on training curves.
  • supervisor_master: The master string to use when preparing the session.
  • log_every_steps: Integer. Output logs every log_every_steps evaluation steps. The logs contain the eval_dict and timing information.
  • feed_fn: A function that is called every iteration to produce a feed_dict passed to session.run calls. Optional.
  • max_steps: Integer. Evaluate eval_dict this many times.
Returns:

A tuple (eval_results, global_step):

  • eval_results: A dict mapping string to numeric values (int, float) that are the result of running eval_dict in the last step. None if no eval steps were run.
  • global_step: The global step this evaluation corresponds to.
Raises:
  • ValueError: if output_dir is empty.

tf.contrib.learn.infer(restore_checkpoint_path, output_dict, feed_dict=None)

Restore graph from restore_checkpoint_path and run output_dict tensors.

If restore_checkpoint_path is supplied, restore from checkpoint. Otherwise, init all variables.

Args:
  • restore_checkpoint_path: A string containing the path to a checkpoint to restore.
  • output_dict: A dict mapping string names to Tensor objects to run. Tensors must all be from the same graph.
  • feed_dict: dict object mapping Tensor objects to input values to feed.
Returns:

Dict of values read from output_dict tensors. Keys are the same as output_dict, values are the results read from the corresponding Tensor in output_dict.

Raises:
  • ValueError: if output_dict or feed_dicts is None or empty.

tf.contrib.learn.run_feeds(*args, **kwargs)

See run_feeds_iter(). Returns a list instead of an iterator.


tf.contrib.learn.run_n(output_dict, feed_dict=None, restore_checkpoint_path=None, n=1)

Run output_dict tensors n times, with the same feed_dict each run.

Args:
  • output_dict: A dict mapping string names to tensors to run. Must all be from the same graph.
  • feed_dict: dict of input values to feed each run.
  • restore_checkpoint_path: A string containing the path to a checkpoint to restore.
  • n: Number of times to repeat.
Returns:

A list of n dict objects, each containing values read from output_dict tensors.


tf.contrib.learn.train(graph, output_dir, train_op, loss_op, global_step_tensor=None, init_op=None, init_feed_dict=None, init_fn=None, log_every_steps=10, supervisor_is_chief=True, supervisor_master='', supervisor_save_model_secs=600, keep_checkpoint_max=5, supervisor_save_summaries_steps=100, feed_fn=None, steps=None, fail_on_nan_loss=True, monitors=None, max_steps=None)

Train a model.

Given graph, a directory to write outputs to (output_dir), and some ops, run a training loop. The given train_op performs one step of training on the model. The loss_op represents the objective function of the training. It is expected to increment the global_step_tensor, a scalar integer tensor counting training steps. This function uses Supervisor to initialize the graph (from a checkpoint if one is available in output_dir), write summaries defined in the graph, and write regular checkpoints as defined by supervisor_save_model_secs.

Training continues until global_step_tensor evaluates to max_steps, or, if fail_on_nan_loss, until loss_op evaluates to NaN. In that case the program is terminated with exit code 1.

Args:
  • graph: A graph to train. It is expected that this graph is not in use elsewhere.
  • output_dir: A directory to write outputs to.
  • train_op: An op that performs one training step when run.
  • loss_op: A scalar loss tensor.
  • global_step_tensor: A tensor representing the global step. If none is given, one is extracted from the graph using the same logic as in Supervisor.
  • init_op: An op that initializes the graph. If None, use Supervisor's default.
  • init_feed_dict: A dictionary that maps Tensor objects to feed values. This feed dictionary will be used when init_op is evaluated.
  • init_fn: Optional callable passed to Supervisor to initialize the model.
  • log_every_steps: Output logs regularly. The logs contain timing data and the current loss.
  • supervisor_is_chief: Whether the current process is the chief supervisor in charge of restoring the model and running standard services.
  • supervisor_master: The master string to use when preparing the session.
  • supervisor_save_model_secs: Save a checkpoint every supervisor_save_model_secs seconds when training.
  • keep_checkpoint_max: The maximum number of recent checkpoint files to keep. As new files are created, older files are deleted. If None or 0, all checkpoint files are kept. This is simply passed as the max_to_keep arg to tf.Saver constructor.
  • supervisor_save_summaries_steps: Save summaries every supervisor_save_summaries_steps seconds when training.
  • feed_fn: A function that is called every iteration to produce a feed_dict passed to session.run calls. Optional.
  • steps: Trains for this many steps (e.g. current global step + steps).
  • fail_on_nan_loss: If true, raise NanLossDuringTrainingError if loss_op evaluates to NaN. If false, continue training as if nothing happened.
  • monitors: List of BaseMonitor subclass instances. Used for callbacks inside the training loop.
  • max_steps: Number of total steps for which to train model. If None, train forever. Two calls fit(steps=100) means 200 training iterations. On the other hand two calls of fit(max_steps=100) means, second call will not do any iteration since first call did all 100 steps.
Returns:

The final loss value.

Raises:
  • ValueError: If output_dir, train_op, loss_op, or global_step_tensor is not provided. See tf.contrib.framework.get_global_step for how we look up the latter if not provided explicitly.
  • NanLossDuringTrainingError: If fail_on_nan_loss is True, and loss ever evaluates to NaN.
  • ValueError: If both steps and max_steps are not None.