Convolution

The convolution ops sweep a 2-D filter over a batch of images, applying the filter to each window of each image of the appropriate size. The different ops trade off between generic vs. specific filters:

• conv2d: Arbitrary filters that can mix channels together.
• depthwise_conv2d: Filters that operate on each channel independently.
• separable_conv2d: A depthwise spatial filter followed by a pointwise filter.

Note that although these ops are called "convolution", they are strictly speaking "cross-correlation" since the filter is combined with an input window without reversing the filter. For details, see the properties of cross-correlation.

The filter is applied to image patches of the same size as the filter and strided according to the strides argument. strides = [1, 1, 1, 1] applies the filter to a patch at every offset, strides = [1, 2, 2, 1] applies the filter to every other image patch in each dimension, etc.

Ignoring channels for the moment, and assume that the 4-D input has shape [batch, in_height, in_width, ...] and the 4-D filter has shape [filter_height, filter_width, ...], then the spatial semantics of the convolution ops are as follows: first, according to the padding scheme chosen as 'SAME' or 'VALID', the output size and the padding pixels are computed. For the 'SAME' padding, the output height and width are computed as:

out_height = ceil(float(in_height) / float(strides[1]))
out_width  = ceil(float(in_width) / float(strides[2]))

and the padding on the top and left are computed as:

pad_along_height = ((out_height - 1) * strides[1] +
filter_height - in_height)
pad_along_width = ((out_width - 1) * strides[2] +
filter_width - in_width)

Note that the division by 2 means that there might be cases when the padding on both sides (top vs bottom, right vs left) are off by one. In this case, the bottom and right sides always get the one additional padded pixel. For example, when pad_along_height is 5, we pad 2 pixels at the top and 3 pixels at the bottom. Note that this is different from existing libraries such as cuDNN and Caffe, which explicitly specify the number of padded pixels and always pad the same number of pixels on both sides.

For the 'VALID' padding, the output height and width are computed as:

out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))
out_width  = ceil(float(in_width - filter_width + 1) / float(strides[2]))

and the padding values are always zero. The output is then computed as

output[b, i, j, :] =
sum_{di, dj} input[b, strides[1] * i + di - pad_top,
strides[2] * j + dj - pad_left, ...] *
filter[di, dj, ...]

where any value outside the original input image region are considered zero ( i.e. we pad zero values around the border of the image).

Since input is 4-D, each input[b, i, j, :] is a vector. For conv2d, these vectors are multiplied by the filter[di, dj, :, :] matrices to produce new vectors. For depthwise_conv_2d, each scalar component input[b, i, j, k] is multiplied by a vector filter[di, dj, k], and all the vectors are concatenated.

tf.nn.conv2d(input, filter, strides, padding, use_cudnn_on_gpu=None, data_format=None, name=None)

Computes a 2-D convolution given 4-D input and filter tensors.

Given an input tensor of shape [batch, in_height, in_width, in_channels] and a filter / kernel tensor of shape [filter_height, filter_width, in_channels, out_channels], this op performs the following:

1. Flattens the filter to a 2-D matrix with shape [filter_height * filter_width * in_channels, output_channels].
2. Extracts image patches from the input tensor to form a virtual tensor of shape [batch, out_height, out_width, filter_height * filter_width * in_channels].
3. For each patch, right-multiplies the filter matrix and the image patch vector.

In detail, with the default NHWC format,

output[b, i, j, k] =
sum_{di, dj, q} input[b, strides[1] * i + di, strides[2] * j + dj, q] *
filter[di, dj, q, k]

Must have strides[0] = strides[3] = 1. For the most common case of the same horizontal and vertices strides, strides = [1, stride, stride, 1].

Args:
• input: A Tensor. Must be one of the following types: half, float32, float64.
• filter: A Tensor. Must have the same type as input.
• strides: A list of ints. 1-D of length 4. The stride of the sliding window for each dimension of input. Must be in the same order as the dimension specified with format.
• padding: A string from: "SAME", "VALID". The type of padding algorithm to use.
• use_cudnn_on_gpu: An optional bool. Defaults to True.
• data_format: An optional string from: "NHWC", "NCHW". Defaults to "NHWC". Specify the data format of the input and output data. With the default format "NHWC", the data is stored in the order of: [batch, in_height, in_width, in_channels]. Alternatively, the format could be "NCHW", the data storage order of: [batch, in_channels, in_height, in_width].
• name: A name for the operation (optional).
Returns:

A Tensor. Has the same type as input.

tf.nn.depthwise_conv2d(input, filter, strides, padding, name=None)

Depthwise 2-D convolution.

Given an input tensor of shape [batch, in_height, in_width, in_channels] and a filter tensor of shape [filter_height, filter_width, in_channels, channel_multiplier] containing in_channels convolutional filters of depth 1, depthwise_conv2d applies a different filter to each input channel (expanding from 1 channel to channel_multiplier channels for each), then concatenates the results together. The output has in_channels * channel_multiplier channels.

In detail,

output[b, i, j, k * channel_multiplier + q] =
sum_{di, dj} input[b, strides[1] * i + di, strides[2] * j + dj, k] *
filter[di, dj, k, q]

Must have strides[0] = strides[3] = 1. For the most common case of the same horizontal and vertical strides, strides = [1, stride, stride, 1].

Args:
• input: 4-D with shape [batch, in_height, in_width, in_channels].
• filter: 4-D with shape [filter_height, filter_width, in_channels, channel_multiplier].
• strides: 1-D of size 4. The stride of the sliding window for each dimension of input.
• padding: A string, either 'VALID' or 'SAME'. The padding algorithm. See the comment here
• name: A name for this operation (optional).
Returns:

A 4-D Tensor of shape [batch, out_height, out_width, in_channels * channel_multiplier].

tf.nn.separable_conv2d(input, depthwise_filter, pointwise_filter, strides, padding, name=None)

2-D convolution with separable filters.

Performs a depthwise convolution that acts separately on channels followed by a pointwise convolution that mixes channels. Note that this is separability between dimensions [1, 2] and 3, not spatial separability between dimensions 1 and 2.

In detail,

output[b, i, j, k] = sum_{di, dj, q, r]
input[b, strides[1] * i + di, strides[2] * j + dj, q] *
depthwise_filter[di, dj, q, r] *
pointwise_filter[0, 0, q * channel_multiplier + r, k]

strides controls the strides for the depthwise convolution only, since the pointwise convolution has implicit strides of [1, 1, 1, 1]. Must have strides[0] = strides[3] = 1. For the most common case of the same horizontal and vertical strides, strides = [1, stride, stride, 1].

Args:
• input: 4-D Tensor with shape [batch, in_height, in_width, in_channels].
• depthwise_filter: 4-D Tensor with shape [filter_height, filter_width, in_channels, channel_multiplier]. Contains in_channels convolutional filters of depth 1.
• pointwise_filter: 4-D Tensor with shape [1, 1, channel_multiplier * in_channels, out_channels]. Pointwise filter to mix channels after depthwise_filter has convolved spatially.
• strides: 1-D of size 4. The strides for the depthwise convolution for each dimension of input.
• padding: A string, either 'VALID' or 'SAME'. The padding algorithm. See the comment here
• name: A name for this operation (optional).
Returns:

A 4-D Tensor of shape [batch, out_height, out_width, out_channels].

Raises:
• ValueError: If channel_multiplier * in_channels > out_channels, which means that the separable convolution is overparameterized.

tf.nn.atrous_conv2d(value, filters, rate, padding, name=None)

Atrous convolution (a.k.a. convolution with holes or dilated convolution).

Computes a 2-D atrous convolution, also known as convolution with holes or dilated convolution, given 4-D value and filters tensors. If the rate parameter is equal to one, it performs regular 2-D convolution. If the rate parameter is greater than one, it performs convolution with holes, sampling the input values every rate pixels in the height and width dimensions. This is equivalent to convolving the input with a set of upsampled filters, produced by inserting rate - 1 zeros between two consecutive values of the filters along the height and width dimensions, hence the name atrous convolution or convolution with holes (the French word trous means holes in English).

More specifically:

output[b, i, j, k] = sum_{di, dj, q} filters[di, dj, q, k] *
value[b, i + rate * di, j + rate * dj, q]

Atrous convolution allows us to explicitly control how densely to compute feature responses in fully convolutional networks. Used in conjunction with bilinear interpolation, it offers an alternative to conv2d_transpose in dense prediction tasks such as semantic image segmentation, optical flow computation, or depth estimation. It also allows us to effectively enlarge the field of view of filters without increasing the number of parameters or the amount of computation.

For a description of atrous convolution and how it can be used for dense feature extraction, please see: Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. The same operation is investigated further in Multi-Scale Context Aggregation by Dilated Convolutions. Previous works that effectively use atrous convolution in different ways are, among others, OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks and [Fast Image Scanning with Deep Max-Pooling Convolutional Neural Networks] (http://arxiv.org/abs/1302.1700). Atrous convolution is also closely related to the so-called noble identities in multi-rate signal processing.

There are many different ways to implement atrous convolution (see the refs above). The implementation here reduces

to the following three operations:

net = space_to_batch(value, paddings, block_size=rate)
net = conv2d(net, filters, strides=[1, 1, 1, 1], padding="VALID")
crops = ...
net = batch_to_space(net, crops, block_size=rate)

Advanced usage. Note the following optimization: A sequence of atrous_conv2d operations with identical rate parameters, 'SAME' padding, and filters with odd heights/ widths:

net = atrous_conv2d(net, filters1, rate, padding="SAME")
net = atrous_conv2d(net, filters2, rate, padding="SAME")
...
net = atrous_conv2d(net, filtersK, rate, padding="SAME")

can be equivalently performed cheaper in terms of computation and memory as:

pad = ...  # padding so that the input dims are multiples of rate
net = conv2d(net, filters1, strides=[1, 1, 1, 1], padding="SAME")
net = conv2d(net, filters2, strides=[1, 1, 1, 1], padding="SAME")
...
net = conv2d(net, filtersK, strides=[1, 1, 1, 1], padding="SAME")
net = batch_to_space(net, crops=pad, block_size=rate)

because a pair of consecutive space_to_batch and batch_to_space ops with the same block_size cancel out when their respective paddings and crops inputs are identical.

Args:
• value: A 4-D Tensor of type float. It needs to be in the default "NHWC" format. Its shape is [batch, in_height, in_width, in_channels].
• filters: A 4-D Tensor with the same type as value and shape [filter_height, filter_width, in_channels, out_channels]. filters' in_channels dimension must match that of value. Atrous convolution is equivalent to standard convolution with upsampled filters with effective height filter_height + (filter_height - 1) * (rate - 1) and effective width filter_width + (filter_width - 1) * (rate - 1), produced by inserting rate - 1 zeros along consecutive elements across the filters' spatial dimensions.
• rate: A positive int32. The stride with which we sample input values across the height and width dimensions. Equivalently, the rate by which we upsample the filter values by inserting zeros across the height and width dimensions. In the literature, the same parameter is sometimes called input stride or dilation.
• padding: A string, either 'VALID' or 'SAME'. The padding algorithm.
• name: Optional name for the returned tensor.
Returns:

A Tensor with the same type as value.

Raises:
• ValueError: If input/output depth does not match filters' shape, or if padding is other than 'VALID' or 'SAME'.

tf.nn.conv2d_transpose(value, filter, output_shape, strides, padding='SAME', name=None)

The transpose of conv2d.

This operation is sometimes called "deconvolution" after Deconvolutional Networks, but is actually the transpose (gradient) of conv2d rather than an actual deconvolution.

Args:
• value: A 4-D Tensor of type float and shape [batch, height, width, in_channels].
• filter: A 4-D Tensor with the same type as value and shape [height, width, output_channels, in_channels]. filter's in_channels dimension must match that of value.
• output_shape: A 1-D Tensor representing the output shape of the deconvolution op.
• strides: A list of ints. The stride of the sliding window for each dimension of the input tensor.
• padding: A string, either 'VALID' or 'SAME'. The padding algorithm. See the comment here
• name: Optional name for the returned tensor.
Returns:

A Tensor with the same type as value.

Raises:
• ValueError: If input/output depth does not match filter's shape, or if padding is other than 'VALID' or 'SAME'.

tf.nn.conv3d(input, filter, strides, padding, name=None)

Computes a 3-D convolution given 5-D input and filter tensors.

In signal processing, cross-correlation is a measure of similarity of two waveforms as a function of a time-lag applied to one of them. This is also known as a sliding dot product or sliding inner-product.

Our Conv3D implements a form of cross-correlation.

Args:
• input: A Tensor. Must be one of the following types: float32, float64, int64, int32, uint8, uint16, int16, int8, complex64, complex128, qint8, quint8, qint32, half. Shape [batch, in_depth, in_height, in_width, in_channels].
• filter: A Tensor. Must have the same type as input. Shape [filter_depth, filter_height, filter_width, in_channels, out_channels]. in_channels must match between input and filter.
• strides: A list of ints that has length >= 5. 1-D tensor of length 5. The stride of the sliding window for each dimension of input. Must have strides[0] = strides[4] = 1.
• padding: A string from: "SAME", "VALID". The type of padding algorithm to use.
• name: A name for the operation (optional).
Returns:

A Tensor. Has the same type as input.