|TensorFlow 1 version||View source on GitHub|
Layer normalization layer (Ba et al., 2016).
Compat aliases for migration
See Migration guide for more details.
tf.keras.layers.LayerNormalization( axis=-1, epsilon=0.001, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', beta_regularizer=None, gamma_regularizer=None, beta_constraint=None, gamma_constraint=None, trainable=True, name=None, **kwargs )
Normalize the activations of the previous layer for each given example in a batch independently, rather than across a batch like Batch Normalization. i.e. applies a transformation that maintains the mean activation within each example close to 0 and the activation standard deviation close to 1.
Given a tensor
inputs, moments are calculated and normalization
is performed across the axes specified in
data = tf.constant(np.arange(10).reshape(5, 2) * 10, dtype=tf.float32)
[[ 0. 10.]
[80. 90.]], shape=(5, 2), dtype=float32)
layer = tf.keras.layers.LayerNormalization(axis=1)
output = layer(data)
[-1. 1.]], shape=(5, 2), dtype=float32)
Notice that with Layer Normalization the normalization happens across the axes within each example, rather than across different examples in the batch.
center are enabled, the layer will scale the normalized
outputs by broadcasting them with a trainable variable
gamma, and center
the outputs by broadcasting with a trainable variable
default to a ones tensor and
beta will default to a zeros tensor, so that
centering and scaling are no-ops before training has begun.
So, with scaling and centering enabled the normalization equations
are as follows:
Let the intermediate activations for a mini-batch to be the
For each sample
k features, we compute the mean and
variance of the sample:
mean_i = sum(x_i[j] for j in range(k)) / k var_i = sum((x_i[j] - mean_i) ** 2 for j in range(k)) / k
and then compute a normalized
x_i_normalized, including a small factor
epsilon for numerical stability.
x_i_normalized = (x_i - mean_i) / sqrt(var_i + epsilon)
x_i_normalized is linearly transformed by
which are learned parameters:
output_i = x_i_normalized * gamma + beta
beta will span the axes of
inputs specified in
this part of the inputs' shape must be fully defined.
layer = tf.keras.layers.LayerNormalization(axis=[1, 2, 3])
layer.build([5, 20, 30, 40])
(20, 30, 40)
(20, 30, 40)
Note that other implementations of layer normalization may choose to define
beta over a separate set of axes from the axes being
normalized across. For example, Group Normalization
(Wu et al. 2018) with group size of 1
corresponds to a Layer Normalization that normalizes across height, width,
and channel and has
beta span only the channel dimension.
So, this Layer Normalization implementation will not match a Group
Normalization layer with group size set to 1.
Integer or List/Tuple. The axis or axes to normalize across. Typically
this is the features axis/axes. The left-out axes are typically the batch
axis/axes. This argument defaults to
||Small float added to variance to avoid dividing by zero. Defaults to 1e-3|
If True, add offset of
If True, multiply by
||Initializer for the beta weight. Defaults to zeros.|
||Initializer for the gamma weight. Defaults to ones.|
||Optional regularizer for the beta weight. None by default.|
||Optional regularizer for the gamma weight. None by default.|
||Optional constraint for the beta weight. None by default.|
||Optional constraint for the gamma weight. None by default.|
Input shape: Arbitrary. Use the keyword argument
input_shape (tuple of
integers, does not include the samples axis) when using this layer as the
first layer in a model.
Output shape: Same shape as input.