Online data resampling

To resample data with replacement on a per-example basis, use 'rejection_sample' or 'resample_at_rate'. For rejection_sample, provide a boolean Tensor describing whether to accept or reject. Resulting batch sizes are always the same. For resample_at_rate, provide the desired rate for each example. Resulting batch sizes may vary. If you wish to specify relative rates, rather than absolute ones, use 'weighted_resample' (which also returns the actual resampling rate used for each output example).

Use 'stratified_sample' to resample without replacement from the data to achieve a desired mix of class proportions that the Tensorflow graph sees. For instance, if you have a binary classification dataset that is 99.9% class 1, a common approach is to resample from the data so that the data is more balanced.

tf.contrib.training.rejection_sample(tensors, accept_prob_fn, batch_size, queue_threads=1, enqueue_many=False, prebatch_capacity=16, prebatch_threads=1, runtime_checks=False, name=None)

Stochastically creates batches by rejection sampling.

Each list of non-batched tensors is evaluated by accept_prob_fn, to produce a scalar tensor between 0 and 1. This tensor corresponds to the probability of being accepted. When batch_size tensor groups have been accepted, the batch queue will return a mini-batch.

Args:
  • tensors: List of tensors for data. All tensors are either one item or a batch, according to enqueue_many.
  • accept_prob_fn: A python lambda that takes a non-batch tensor from each item in tensors, and produces a scalar tensor.
  • batch_size: Size of batch to be returned.
  • queue_threads: The number of threads for the queue that will hold the final batch.
  • enqueue_many: Bool. If true, interpret input tensors as having a batch dimension.
  • prebatch_capacity: Capacity for the large queue that is used to convert batched tensors to single examples.
  • prebatch_threads: Number of threads for the large queue that is used to convert batched tensors to single examples.
  • runtime_checks: Bool. If true, insert runtime checks on the output of accept_prob_fn. Using True might have a performance impact.
  • name: Optional prefix for ops created by this function.
Raises:
  • ValueError: enqueue_many is True and labels doesn't have a batch dimension, or if enqueue_many is False and labels isn't a scalar.
  • ValueError: enqueue_many is True, and batch dimension on data and labels don't match.
  • ValueError: if a zero initial probability class has a nonzero target probability.
Returns:

A list of tensors of the same length as tensors, with batch dimension batch_size.

Example:

# Get tensor for a single data and label example. data, label = data_provider.Get(['data', 'label'])

# Get stratified batch according to data tensor. accept_prob_fn = lambda x: (tf.tanh(x[0]) + 1) / 2 data_batch = tf.contrib.training.rejection_sample( [data, label], accept_prob_fn, 16)

# Run batch through network. ...


tf.contrib.training.resample_at_rate(inputs, rates, scope=None, seed=None, back_prop=False)

Given inputs tensors, stochastically resamples each at a given rate.

For example, if the inputs are [[a1, a2], [b1, b2]] and the rates tensor contains [3, 1], then the return value may look like [[a1, a2, a1, a1], [b1, b2, b1, b1]]. However, many other outputs are possible, since this is stochastic -- averaged over many repeated calls, each set of inputs should appear in the output rate times the number of invocations.

Uses Knuth's method to generate samples from the poisson distribution (but instead of just incrementing a count, actually emits the input); this is described at https://en.wikipedia.org/wiki/Poisson_distribution in the section on generating Poisson-distributed random variables.

Note that this method is not appropriate for large rate values: with float16 it will stop performing correctly for rates above 9.17; float32, 87; and float64, 708. (These are the base-e versions of the minimum representable exponent for each type.)

Args:
  • inputs: A list of tensors, each of which has a shape of [batch_size, ...]
  • rates: A tensor of shape [batch_size] contiaining the resampling rates for each input.
  • scope: Scope for the op.
  • seed: Random seed to use.
  • back_prop: Whether to allow back-propagation through this op.
Returns:

Selections from the input tensors.


tf.contrib.training.stratified_sample(tensors, labels, target_probs, batch_size, init_probs=None, enqueue_many=False, queue_capacity=16, threads_per_queue=1, name=None)

Stochastically creates batches based on per-class probabilities.

This method discards examples. Internally, it creates one queue to amortize the cost of disk reads, and one queue to hold the properly-proportioned batch.

Args:
  • tensors: List of tensors for data. All tensors are either one item or a batch, according to enqueue_many.
  • labels: Tensor for label of data. Label is a single integer or a batch, depending on enqueue_many. It is not a one-hot vector.
  • target_probs: Target class proportions in batch. An object whose type has a registered Tensor conversion function.
  • batch_size: Size of batch to be returned.
  • init_probs: Class proportions in the data. An object whose type has a registered Tensor conversion function, or None for estimating the initial distribution.
  • enqueue_many: Bool. If true, interpret input tensors as having a batch dimension.
  • queue_capacity: Capacity of the large queue that holds input examples.
  • threads_per_queue: Number of threads for the large queue that holds input examples and for the final queue with the proper class proportions.
  • name: Optional prefix for ops created by this function.
Raises:
  • ValueError: enqueue_many is True and labels doesn't have a batch dimension, or if enqueue_many is False and labels isn't a scalar.
  • ValueError: enqueue_many is True, and batch dimension on data and labels don't match.
  • ValueError: if probs don't sum to one.
  • ValueError: if a zero initial probability class has a nonzero target probability.
  • TFAssertion: if labels aren't integers in [0, num classes).
Returns:

(data_batch, label_batch), where data_batch is a list of tensors of the same length as tensors

Example:

# Get tensor for a single data and label example. data, label = data_provider.Get(['data', 'label'])

# Get stratified batch according to per-class probabilities. target_probs = [...distribution you want...] [data_batch], labels = tf.contrib.training.stratified_sample( [data], label, target_probs)

# Run batch through network. ...


tf.contrib.training.weighted_resample(inputs, weights, overall_rate, scope=None, mean_decay=0.999, seed=None)

Performs an approximate weighted resampling of inputs.

This method chooses elements from inputs where each item's rate of selection is proportional to its value in weights, and the average rate of selection across all inputs (and many invocations!) is overall_rate.

Args:
  • inputs: A list of tensors whose first dimension is batch_size.
  • weights: A [batch_size]-shaped tensor with each batch member's weight.
  • overall_rate: Desired overall rate of resampling.
  • scope: Scope to use for the op.
  • mean_decay: How quickly to decay the running estimate of the mean weight.
  • seed: Random seed.
Returns:

A list of tensors exactly like inputs, but with an unknown (and possibly zero) first dimension. A tensor containing the effective resampling rate used for each output.