此页面由 Cloud Translation API 翻译。
Switch to English

过拟合和欠拟合

在TensorFlow.org上查看 在Google Colab中运行 在GitHub上查看源代码 下载笔记本

与往常一样,此示例中的代码将使用tf.keras API,您可以在TensorFlow Keras指南中了解更多信息。

在前面的两个示例( 对文本进行分类预测燃油效率)中 ,我们看到了在验证数据上的模型的准确性会在训练了多个时期后达到峰值,然后停滞或开始下降。

换句话说,我们的模型将过度拟合训练数据。学习如何应对过度拟合非常重要。尽管通常可以在训练集上达到高精度,但我们真正想要的是开发能够很好地推广到测试集 (或之前未见的数据)的模型。

过度拟合的反面是拟合。当测试数据仍有改进空间时,就会发生欠拟合。发生这种情况的原因有很多:如果模型不够强大,模型过于规范化,或者只是没有经过足够长时间的训练。这意味着网络尚未学习训练数据中的相关模式。

但是,如果训练时间过长,则模型将开始过拟合并从训练数据中学习无法推广到测试数据的模式。我们需要保持平衡。如下所述,了解如何训练适当数量的纪元是一项有用的技能。

为了防止过度拟合,最好的解决方案是使用更完整的训练数据。数据集应涵盖模型应处理的所有输入范围。仅当涉及新的有趣案例时,其他数据才有用。

经过更完整数据训练的模型自然会更好地推广。当这不再可能时,下一个最佳解决方案是使用正则化之类的技术。这些因素限制了模型可以存储的信息的数量和类型。如果网络只能存储少量模式,那么优化过程将迫使它专注于最突出的模式,这些模式具有更好的概括性。

在本笔记本中,我们将探索几种常见的正则化技术,并使用它们对分类模型进行改进。

建立

在开始之前,请导入必要的软件包:

 import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import regularizers

print(tf.__version__)
 
2.2.0

 !pip install -q git+https://github.com/tensorflow/docs

import tensorflow_docs as tfdocs
import tensorflow_docs.modeling
import tensorflow_docs.plots
 
 from  IPython import display
from matplotlib import pyplot as plt

import numpy as np

import pathlib
import shutil
import tempfile

 
 logdir = pathlib.Path(tempfile.mkdtemp())/"tensorboard_logs"
shutil.rmtree(logdir, ignore_errors=True)
 

希格斯数据集

本教程的目的不是做粒子物理学,所以不要关注数据集的细节。它包含11000000个示例,每个示例具有28个功能以及一个二进制类标签。

 gz = tf.keras.utils.get_file('HIGGS.csv.gz', 'http://mlphysics.ics.uci.edu/data/higgs/HIGGS.csv.gz')
 
Downloading data from http://mlphysics.ics.uci.edu/data/higgs/HIGGS.csv.gz
2816409600/2816407858 [==============================] - 259s 0us/step

 FEATURES = 28
 

tf.data.experimental.CsvDataset类可用于直接从gzip文件读取csv记录,而无需中间的解压缩步骤。

 ds = tf.data.experimental.CsvDataset(gz,[float(),]*(FEATURES+1), compression_type="GZIP")
 

该csv阅读器类返回每个记录的标量列表。以下函数将标量列表重新打包为(feature_vector,label)对。

 def pack_row(*row):
  label = row[0]
  features = tf.stack(row[1:],1)
  return features, label
 

当处理大量数据时,TensorFlow效率最高。

因此, pack_row单独重新包装每一行, pack_row创建一个新的Dataset ,该Dataset采用10000个示例的批次,将pack_row函数应用于每个批次,然后将批次拆分回单独的记录:

 packed_ds = ds.batch(10000).map(pack_row).unbatch()
 

看一下这个新的packed_ds中的一些记录。

这些功能尚未完全标准化,但这足以满足本教程的要求。

 for features,label in packed_ds.batch(1000).take(1):
  print(features[0])
  plt.hist(features.numpy().flatten(), bins = 101)
 
tf.Tensor(
[ 0.8692932  -0.6350818   0.22569026  0.32747006 -0.6899932   0.75420225
 -0.24857314 -1.0920639   0.          1.3749921  -0.6536742   0.9303491
  1.1074361   1.1389043  -1.5781983  -1.0469854   0.          0.65792954
 -0.01045457 -0.04576717  3.1019614   1.35376     0.9795631   0.97807616
  0.92000484  0.72165745  0.98875093  0.87667835], shape=(28,), dtype=float32)

png

为了使本教程相对简短,仅使用前1000个样本进行验证,然后使用10000个样本进行培训:

 N_VALIDATION = int(1e3)
N_TRAIN = int(1e4)
BUFFER_SIZE = int(1e4)
BATCH_SIZE = 500
STEPS_PER_EPOCH = N_TRAIN//BATCH_SIZE
 

Dataset.skipDataset.take方法使此操作变得容易。

同时,使用Dataset.cache方法来确保加载器不需要在每个时期重新从文件中读取数据:

 validate_ds = packed_ds.take(N_VALIDATION).cache()
train_ds = packed_ds.skip(N_VALIDATION).take(N_TRAIN).cache()
 
 train_ds
 
<CacheDataset shapes: ((28,), ()), types: (tf.float32, tf.float32)>

这些数据集返回单个示例。使用.batch方法可创建适当大小的批次进行训练。批处理之前,还记得.shuffle.repeat训练集。

 validate_ds = validate_ds.batch(BATCH_SIZE)
train_ds = train_ds.shuffle(BUFFER_SIZE).repeat().batch(BATCH_SIZE)
 

证明过度拟合

防止过度拟合的最简单方法是从一个小的模型开始:一个具有少量可学习参数(由层数和每层单位数确定)的模型。在深度学习中,模型中可学习参数的数量通常称为模型的“容量”。

直观地讲,具有更多参数的模型将具有更多的“记忆能力”,因此将能够轻松学习训练样本与其目标之间的完美的字典式映射,这种映射没有任何泛化能力,但是在进行预测时这将是无用的根据以前看不见的数据。

始终牢记这一点:深度学习模型往往擅长拟合训练数据,但真正的挑战是泛化而不是拟合。

另一方面,如果网络的存储资源有限,则将无法轻松学习映射。为了最大程度地减少损失,它必须学习具有更强预测能力的压缩表示形式。同时,如果您使模型过小,将很难适应训练数据。在“容量过多”和“容量不足”之间有一个平衡。

不幸的是,没有神奇的公式来确定模型的正确大小或体系结构(就层数而言,还是对于每个层的正确大小而言)。您将不得不尝试使用一系列不同的体系结构。

为了找到合适的模型大小,最好从相对较少的图层和参数开始,然后开始增加图层的大小或添加新图层,直到看到验证损失的收益递减为止。

从仅使用layers.Dense的简单模型开始,以layers.Dense为基准,然后创建较大的版本并进行比较。

训练程序

如果您在训练过程中逐渐降低学习率,则许多模型的训练效果会更好。使用optimizers.schedules随着时间的推移降低学习率:

 lr_schedule = tf.keras.optimizers.schedules.InverseTimeDecay(
  0.001,
  decay_steps=STEPS_PER_EPOCH*1000,
  decay_rate=1,
  staircase=False)

def get_optimizer():
  return tf.keras.optimizers.Adam(lr_schedule)
 

上面的代码设置了一个schedules.InverseTimeDecay以双曲线的方式将学习速率在1000个时代降低到基本速率的1/2,在2000个时代降低1/3,依此类推。

 step = np.linspace(0,100000)
lr = lr_schedule(step)
plt.figure(figsize = (8,6))
plt.plot(step/STEPS_PER_EPOCH, lr)
plt.ylim([0,max(plt.ylim())])
plt.xlabel('Epoch')
_ = plt.ylabel('Learning Rate')

 

png

本教程中的每个模型都将使用相同的训练配置。因此,从回调列表开始,以可重用的方式设置它们。

本教程的培训会持续很短的时间。为了减少记录噪音,请使用tfdocs.EpochDots ,它仅打印一个.每个时期,以及每100个时期的完整指标。

接下来包括callbacks.EarlyStopping以避免冗长和不必要的培训时间。请注意,此回调设置为监视val_binary_crossentropy ,而不是val_loss 。这种差异稍后将变得很重要。

使用callbacks.TensorBoard生成用于训练的TensorBoard日志。

 def get_callbacks(name):
  return [
    tfdocs.modeling.EpochDots(),
    tf.keras.callbacks.EarlyStopping(monitor='val_binary_crossentropy', patience=200),
    tf.keras.callbacks.TensorBoard(logdir/name),
  ]
 

同样,每个模型将使用相同的Model.compileModel.fit设置:

 def compile_and_fit(model, name, optimizer=None, max_epochs=10000):
  if optimizer is None:
    optimizer = get_optimizer()
  model.compile(optimizer=optimizer,
                loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                metrics=[
                  tf.keras.losses.BinaryCrossentropy(
                      from_logits=True, name='binary_crossentropy'),
                  'accuracy'])

  model.summary()

  history = model.fit(
    train_ds,
    steps_per_epoch = STEPS_PER_EPOCH,
    epochs=max_epochs,
    validation_data=validate_ds,
    callbacks=get_callbacks(name),
    verbose=0)
  return history
 

小模型

首先训练模型:

 tiny_model = tf.keras.Sequential([
    layers.Dense(16, activation='elu', input_shape=(FEATURES,)),
    layers.Dense(1)
])
 
 size_histories = {}
 
 size_histories['Tiny'] = compile_and_fit(tiny_model, 'sizes/Tiny')
 
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 16)                464       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
=================================================================
Total params: 481
Trainable params: 481
Non-trainable params: 0
_________________________________________________________________

Epoch: 0, accuracy:0.4995,  binary_crossentropy:0.7955,  loss:0.7955,  val_accuracy:0.5140,  val_binary_crossentropy:0.7285,  val_loss:0.7285,  
....................................................................................................
Epoch: 100, accuracy:0.5907,  binary_crossentropy:0.6291,  loss:0.6291,  val_accuracy:0.5790,  val_binary_crossentropy:0.6281,  val_loss:0.6281,  
....................................................................................................
Epoch: 200, accuracy:0.6161,  binary_crossentropy:0.6164,  loss:0.6164,  val_accuracy:0.5890,  val_binary_crossentropy:0.6189,  val_loss:0.6189,  
....................................................................................................
Epoch: 300, accuracy:0.6319,  binary_crossentropy:0.6059,  loss:0.6059,  val_accuracy:0.6250,  val_binary_crossentropy:0.6072,  val_loss:0.6072,  
....................................................................................................
Epoch: 400, accuracy:0.6423,  binary_crossentropy:0.5992,  loss:0.5992,  val_accuracy:0.6240,  val_binary_crossentropy:0.6027,  val_loss:0.6027,  
....................................................................................................
Epoch: 500, accuracy:0.6610,  binary_crossentropy:0.5921,  loss:0.5921,  val_accuracy:0.6210,  val_binary_crossentropy:0.6000,  val_loss:0.6000,  
....................................................................................................
Epoch: 600, accuracy:0.6651,  binary_crossentropy:0.5882,  loss:0.5882,  val_accuracy:0.6330,  val_binary_crossentropy:0.5962,  val_loss:0.5962,  
....................................................................................................
Epoch: 700, accuracy:0.6654,  binary_crossentropy:0.5858,  loss:0.5858,  val_accuracy:0.6630,  val_binary_crossentropy:0.5916,  val_loss:0.5916,  
....................................................................................................
Epoch: 800, accuracy:0.6681,  binary_crossentropy:0.5829,  loss:0.5829,  val_accuracy:0.6620,  val_binary_crossentropy:0.5911,  val_loss:0.5911,  
....................................................................................................
Epoch: 900, accuracy:0.6735,  binary_crossentropy:0.5813,  loss:0.5813,  val_accuracy:0.6580,  val_binary_crossentropy:0.5906,  val_loss:0.5906,  
....................................................................................................
Epoch: 1000, accuracy:0.6744,  binary_crossentropy:0.5794,  loss:0.5794,  val_accuracy:0.6590,  val_binary_crossentropy:0.5896,  val_loss:0.5896,  
....................................................................................................
Epoch: 1100, accuracy:0.6791,  binary_crossentropy:0.5782,  loss:0.5782,  val_accuracy:0.6470,  val_binary_crossentropy:0.5913,  val_loss:0.5913,  
....................................................................................................
Epoch: 1200, accuracy:0.6770,  binary_crossentropy:0.5764,  loss:0.5764,  val_accuracy:0.6690,  val_binary_crossentropy:0.5879,  val_loss:0.5879,  
....................................................................................................
Epoch: 1300, accuracy:0.6773,  binary_crossentropy:0.5759,  loss:0.5759,  val_accuracy:0.6760,  val_binary_crossentropy:0.5873,  val_loss:0.5873,  
....................................................................................................
Epoch: 1400, accuracy:0.6818,  binary_crossentropy:0.5743,  loss:0.5743,  val_accuracy:0.6610,  val_binary_crossentropy:0.5879,  val_loss:0.5879,  
....................................................................................................
Epoch: 1500, accuracy:0.6835,  binary_crossentropy:0.5735,  loss:0.5735,  val_accuracy:0.6520,  val_binary_crossentropy:0.5916,  val_loss:0.5916,  
....................................................................................................
Epoch: 1600, accuracy:0.6869,  binary_crossentropy:0.5722,  loss:0.5722,  val_accuracy:0.6650,  val_binary_crossentropy:0.5881,  val_loss:0.5881,  
....................................................................................................
Epoch: 1700, accuracy:0.6823,  binary_crossentropy:0.5716,  loss:0.5716,  val_accuracy:0.6710,  val_binary_crossentropy:0.5865,  val_loss:0.5865,  
.......................................................................

现在检查模型的效果:

 plotter = tfdocs.plots.HistoryPlotter(metric = 'binary_crossentropy', smoothing_std=10)
plotter.plot(size_histories)
plt.ylim([0.5, 0.7])
 
(0.5, 0.7)

png

小模型

若要查看是否可以击败小型模型的性能,请逐步训练一些大型模型。

尝试两个每个都有16个单位的隐藏层:

 small_model = tf.keras.Sequential([
    # `input_shape` is only required here so that `.summary` works.
    layers.Dense(16, activation='elu', input_shape=(FEATURES,)),
    layers.Dense(16, activation='elu'),
    layers.Dense(1)
])
 
 size_histories['Small'] = compile_and_fit(small_model, 'sizes/Small')
 
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_2 (Dense)              (None, 16)                464       
_________________________________________________________________
dense_3 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 17        
=================================================================
Total params: 753
Trainable params: 753
Non-trainable params: 0
_________________________________________________________________

Epoch: 0, accuracy:0.5282,  binary_crossentropy:0.8122,  loss:0.8122,  val_accuracy:0.5050,  val_binary_crossentropy:0.7110,  val_loss:0.7110,  
....................................................................................................
Epoch: 100, accuracy:0.6249,  binary_crossentropy:0.6134,  loss:0.6134,  val_accuracy:0.6370,  val_binary_crossentropy:0.6107,  val_loss:0.6107,  
....................................................................................................
Epoch: 200, accuracy:0.6576,  binary_crossentropy:0.5936,  loss:0.5936,  val_accuracy:0.6490,  val_binary_crossentropy:0.5916,  val_loss:0.5916,  
....................................................................................................
Epoch: 300, accuracy:0.6760,  binary_crossentropy:0.5808,  loss:0.5808,  val_accuracy:0.6490,  val_binary_crossentropy:0.5890,  val_loss:0.5890,  
....................................................................................................
Epoch: 400, accuracy:0.6842,  binary_crossentropy:0.5739,  loss:0.5739,  val_accuracy:0.6580,  val_binary_crossentropy:0.5877,  val_loss:0.5877,  
....................................................................................................
Epoch: 500, accuracy:0.6919,  binary_crossentropy:0.5694,  loss:0.5694,  val_accuracy:0.6680,  val_binary_crossentropy:0.5875,  val_loss:0.5875,  
...........................................

中型

现在尝试3个隐藏层,每个隐藏层64个单位:

 medium_model = tf.keras.Sequential([
    layers.Dense(64, activation='elu', input_shape=(FEATURES,)),
    layers.Dense(64, activation='elu'),
    layers.Dense(64, activation='elu'),
    layers.Dense(1)
])
 

并使用相同的数据训练模型:

 size_histories['Medium']  = compile_and_fit(medium_model, "sizes/Medium")
 
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_5 (Dense)              (None, 64)                1856      
_________________________________________________________________
dense_6 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_7 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 65        
=================================================================
Total params: 10,241
Trainable params: 10,241
Non-trainable params: 0
_________________________________________________________________

Epoch: 0, accuracy:0.4730,  binary_crossentropy:0.7116,  loss:0.7116,  val_accuracy:0.4760,  val_binary_crossentropy:0.6788,  val_loss:0.6788,  
....................................................................................................
Epoch: 100, accuracy:0.7094,  binary_crossentropy:0.5330,  loss:0.5330,  val_accuracy:0.6720,  val_binary_crossentropy:0.5980,  val_loss:0.5980,  
....................................................................................................
Epoch: 200, accuracy:0.7866,  binary_crossentropy:0.4317,  loss:0.4317,  val_accuracy:0.6510,  val_binary_crossentropy:0.6785,  val_loss:0.6785,  
............................................................................

大型模型

作为练习,您可以创建一个更大的模型,并查看它开始过拟合的速度。接下来,让我们将具有更大容量的网络添加到该基准中,这远远超出了问题所需要的范围:

 large_model = tf.keras.Sequential([
    layers.Dense(512, activation='elu', input_shape=(FEATURES,)),
    layers.Dense(512, activation='elu'),
    layers.Dense(512, activation='elu'),
    layers.Dense(512, activation='elu'),
    layers.Dense(1)
])
 

再次,使用相同的数据训练模型:

 size_histories['large'] = compile_and_fit(large_model, "sizes/large")
 
Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_9 (Dense)              (None, 512)               14848     
_________________________________________________________________
dense_10 (Dense)             (None, 512)               262656    
_________________________________________________________________
dense_11 (Dense)             (None, 512)               262656    
_________________________________________________________________
dense_12 (Dense)             (None, 512)               262656    
_________________________________________________________________
dense_13 (Dense)             (None, 1)                 513       
=================================================================
Total params: 803,329
Trainable params: 803,329
Non-trainable params: 0
_________________________________________________________________

Epoch: 0, accuracy:0.5132,  binary_crossentropy:0.7929,  loss:0.7929,  val_accuracy:0.4760,  val_binary_crossentropy:0.6944,  val_loss:0.6944,  
....................................................................................................
Epoch: 100, accuracy:1.0000,  binary_crossentropy:0.0021,  loss:0.0021,  val_accuracy:0.6700,  val_binary_crossentropy:1.7392,  val_loss:1.7392,  
....................................................................................................
Epoch: 200, accuracy:1.0000,  binary_crossentropy:0.0001,  loss:0.0001,  val_accuracy:0.6700,  val_binary_crossentropy:2.3772,  val_loss:2.3772,  
.......................

绘制训练和验证损失

实线表示训练损失,而虚线表示验证损失(请记住:验证损失越小表示模型越好)。

虽然构建较大的模型可以提供更多功能,但是如果不以某种方式限制此功能,则可以轻松地将其过度拟合至训练集。

在此示例中,通常,只有"Tiny"模型设法避免完全过拟合,而每个较大的模型都会更快地过拟合数据。对于"large"模型而言,这变得如此严重,以至于您需要将绘图切换为对数比例才能真正看到正在发生的情况。

如果您将验证指标与训练指标进行比较并进行比较,这很明显。

  • 差异很小是正常的。
  • 如果两个指标都朝着同一方向发展,那么一切都很好。
  • 如果在培训指标继续提高的同时,验证指标开始停滞不前,那么您可能已经过拟合了。
  • 如果验证指标的方向错误,则表明该模型过度拟合。
 plotter.plot(size_histories)
a = plt.xscale('log')
plt.xlim([5, max(plt.xlim())])
plt.ylim([0.5, 0.7])
plt.xlabel("Epochs [Log Scale]")
 
Text(0.5, 0, 'Epochs [Log Scale]')

png

在TensorBoard中查看

这些模型都在训练期间编写了TensorBoard日志。

在笔记本中打开嵌入式TensorBoard查看器:

 
# Load the TensorBoard notebook extension
%load_ext tensorboard

# Open an embedded TensorBoard viewer
%tensorboard --logdir {logdir}/sizes
 

您可以在TensorBoard.dev查看此笔记本的上次运行结果

TensorBoard.dev是一种托管体验,用于与所有人托管,跟踪和共享ML实验。

为了方便起见,它也包含在<iframe>中:

 display.IFrame(
    src="https://tensorboard.dev/experiment/vW7jmmF9TmKmy3rbheMQpw/#scalars&_smoothingWeight=0.97",
    width="100%", height="800px")
 

如果要共享TensorBoard结果,可以通过将以下内容复制到代码单元中来将日志上传到TensorBoard.dev

tensorboard dev upload --logdir  {logdir}/sizes

防止过度拟合的策略

在进入本节内容之前,请复制上面"Tiny"模型中的训练日志,以用作比较的基准。

 shutil.rmtree(logdir/'regularizers/Tiny', ignore_errors=True)
shutil.copytree(logdir/'sizes/Tiny', logdir/'regularizers/Tiny')
 
PosixPath('/tmp/tmpxq_r4ocw/tensorboard_logs/regularizers/Tiny')
 regularizer_histories = {}
regularizer_histories['Tiny'] = size_histories['Tiny']
 

增加体重调整

您可能熟悉Occam的Razor原理:给某事两种解释,最可能正确的解释是“最简单”的解释,即假设最少的一种解释。这也适用于通过神经网络学习的模型:给定一些训练数据和网络体系结构,可以使用多组权重值(多个模型)来解释数据,并且较简单的模型比复杂的模型更不适合过度拟合。

在这种情况下,“简单模型”是参数值的分布具有较小熵的模型(或如上节所述,具有总共较少参数的模型)。因此,减轻过度拟合的一种常用方法是通过仅使网络的权重取小的值来对网络的复杂性施加约束,这使得权重值的分布更加“规则”。这称为“权重调整”,它是通过向网络的损失函数中添加与权重较大相关的成本来完成的。此成本有两种口味:

  • L1正则化 ,其中增加的成本与权重系数的绝对值成正比(即,权重的所谓“ L1范数”)。

  • L2正则化 ,其中增加的成本与权重系数的值的平方成正比(即,权重的平方的“ L2范数”的平方)。 L2正则化在神经网络中也称为权重衰减。不要让其他名称使您感到困惑:权重衰减在数学上与L2正则化完全相同。

L1正则化将权重推向正好为零,从而鼓励了稀疏模型。 L2正则化将惩罚权重参数而不会使其稀疏,因为对于小权重,惩罚变为零。 L2更常见的原因之一。

tf.keras ,通过将权重正则化器实例作为关键字参数传递给图层来添加权重正则化。现在添加L2权重正则化。

 l2_model = tf.keras.Sequential([
    layers.Dense(512, activation='elu',
                 kernel_regularizer=regularizers.l2(0.001),
                 input_shape=(FEATURES,)),
    layers.Dense(512, activation='elu',
                 kernel_regularizer=regularizers.l2(0.001)),
    layers.Dense(512, activation='elu',
                 kernel_regularizer=regularizers.l2(0.001)),
    layers.Dense(512, activation='elu',
                 kernel_regularizer=regularizers.l2(0.001)),
    layers.Dense(1)
])

regularizer_histories['l2'] = compile_and_fit(l2_model, "regularizers/l2")
 
Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_14 (Dense)             (None, 512)               14848     
_________________________________________________________________
dense_15 (Dense)             (None, 512)               262656    
_________________________________________________________________
dense_16 (Dense)             (None, 512)               262656    
_________________________________________________________________
dense_17 (Dense)             (None, 512)               262656    
_________________________________________________________________
dense_18 (Dense)             (None, 1)                 513       
=================================================================
Total params: 803,329
Trainable params: 803,329
Non-trainable params: 0
_________________________________________________________________

Epoch: 0, accuracy:0.5123,  binary_crossentropy:0.8077,  loss:2.3264,  val_accuracy:0.4860,  val_binary_crossentropy:0.6764,  val_loss:2.1163,  
....................................................................................................
Epoch: 100, accuracy:0.6615,  binary_crossentropy:0.5958,  loss:0.6192,  val_accuracy:0.6500,  val_binary_crossentropy:0.5956,  val_loss:0.6191,  
....................................................................................................
Epoch: 200, accuracy:0.6715,  binary_crossentropy:0.5825,  loss:0.6049,  val_accuracy:0.6760,  val_binary_crossentropy:0.5766,  val_loss:0.5991,  
....................................................................................................
Epoch: 300, accuracy:0.6760,  binary_crossentropy:0.5783,  loss:0.6018,  val_accuracy:0.6900,  val_binary_crossentropy:0.5792,  val_loss:0.6027,  
....................................................................................................
Epoch: 400, accuracy:0.6869,  binary_crossentropy:0.5723,  loss:0.5963,  val_accuracy:0.6990,  val_binary_crossentropy:0.5761,  val_loss:0.6012,  
....................................................................................................
Epoch: 500, accuracy:0.6877,  binary_crossentropy:0.5656,  loss:0.5900,  val_accuracy:0.6850,  val_binary_crossentropy:0.5768,  val_loss:0.6012,  
.........................

l2(0.001)表示该层的权重矩阵中的每个系数将为网络的总损耗增加0.001 * weight_coefficient_value**2

这就是为什么我们直接监视binary_crossentropy 。因为它没有混入此正则化组件。

因此,具有L2正则化惩罚的相同"Large"模型的性能要好得多:

 plotter.plot(regularizer_histories)
plt.ylim([0.5, 0.7])
 
(0.5, 0.7)

png

如您所见, "L2"正则化模型现在比"Tiny"模型更具竞争力。尽管具有相同数量的参数,但"L2"模型也比其基于的"Large"模型更耐过度拟合。

更多信息

关于这种正则化有两点要注意。

首先:如果您正在编写自己的训练循环,则需要确保向模型询问其正则化损失。

 result = l2_model(features)
regularization_loss=tf.add_n(l2_model.losses)
 

第二:此实现通过将权重损失添加到模型的损失中,然后在此之后应用标准优化过程来工作。

还有第二种方法,它只对原始损耗运行优化器,然后在应用计算出的步骤时,优化器还会应用一些权重衰减。这种“解耦的权重衰减”在诸如optimizers.FTRLoptimizers.AdamW类的optimizers.FTRL optimizers.AdamW可见。

添加辍学

辍学是Hinton和他在多伦多大学的学生开发的最有效,最常用的神经网络正则化技术之一。

辍学的直观解释是,由于网络中的各个节点无法依赖其他节点的输出,因此每个节点必须输出自己有用的功能。

应用于图层的辍学包括在训练过程中随机“退出”(即设置为零)该图层的多个输出要素。假设在训练过程中,给定的图层通常会为给定的输入样本返回向量[0.2、0.5、1.3、0.8、1.1];在应用dropout之后,此向量将具有一些零个随机分布的条目,例如[0、0.5、1.3、0、1.1]。

“辍学率”是被清零的特征的一部分。通常设置在0.2到0.5之间。在测试时,不会丢失任何单元,而是将图层的输出值按等于丢失率的比例缩小,以平衡比训练时活动的单元更多的事实。

tf.keras您可以通过Dropout层在网络中引入Dropout,该层将立即应用于该层的输出。

让我们在网络中添加两个Dropout层,看看它们在减少过度拟合方面的表现如何:

 dropout_model = tf.keras.Sequential([
    layers.Dense(512, activation='elu', input_shape=(FEATURES,)),
    layers.Dropout(0.5),
    layers.Dense(512, activation='elu'),
    layers.Dropout(0.5),
    layers.Dense(512, activation='elu'),
    layers.Dropout(0.5),
    layers.Dense(512, activation='elu'),
    layers.Dropout(0.5),
    layers.Dense(1)
])

regularizer_histories['dropout'] = compile_and_fit(dropout_model, "regularizers/dropout")
 
Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_19 (Dense)             (None, 512)               14848     
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense_20 (Dense)             (None, 512)               262656    
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_21 (Dense)             (None, 512)               262656    
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_22 (Dense)             (None, 512)               262656    
_________________________________________________________________
dropout_3 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_23 (Dense)             (None, 1)                 513       
=================================================================
Total params: 803,329
Trainable params: 803,329
Non-trainable params: 0
_________________________________________________________________

Epoch: 0, accuracy:0.5072,  binary_crossentropy:0.7935,  loss:0.7935,  val_accuracy:0.5700,  val_binary_crossentropy:0.6825,  val_loss:0.6825,  
....................................................................................................
Epoch: 100, accuracy:0.6652,  binary_crossentropy:0.5943,  loss:0.5943,  val_accuracy:0.6690,  val_binary_crossentropy:0.5810,  val_loss:0.5810,  
....................................................................................................
Epoch: 200, accuracy:0.6849,  binary_crossentropy:0.5543,  loss:0.5543,  val_accuracy:0.6800,  val_binary_crossentropy:0.5859,  val_loss:0.5859,  
....................................................................................................
Epoch: 300, accuracy:0.7211,  binary_crossentropy:0.5089,  loss:0.5089,  val_accuracy:0.6790,  val_binary_crossentropy:0.6065,  val_loss:0.6065,  
..........................
 plotter.plot(regularizer_histories)
plt.ylim([0.5, 0.7])
 
(0.5, 0.7)

png

从该图中可以明显看出,这两种正则化方法都可以改善"Large"模型的行为。但这甚至还没有超过"Tiny"基准。

接下来,将它们一起尝试,看看效果是否更好。

L2 +组合辍学

 combined_model = tf.keras.Sequential([
    layers.Dense(512, kernel_regularizer=regularizers.l2(0.0001),
                 activation='elu', input_shape=(FEATURES,)),
    layers.Dropout(0.5),
    layers.Dense(512, kernel_regularizer=regularizers.l2(0.0001),
                 activation='elu'),
    layers.Dropout(0.5),
    layers.Dense(512, kernel_regularizer=regularizers.l2(0.0001),
                 activation='elu'),
    layers.Dropout(0.5),
    layers.Dense(512, kernel_regularizer=regularizers.l2(0.0001),
                 activation='elu'),
    layers.Dropout(0.5),
    layers.Dense(1)
])

regularizer_histories['combined'] = compile_and_fit(combined_model, "regularizers/combined")
 
Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_24 (Dense)             (None, 512)               14848     
_________________________________________________________________
dropout_4 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_25 (Dense)             (None, 512)               262656    
_________________________________________________________________
dropout_5 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_26 (Dense)             (None, 512)               262656    
_________________________________________________________________
dropout_6 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_27 (Dense)             (None, 512)               262656    
_________________________________________________________________
dropout_7 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_28 (Dense)             (None, 1)                 513       
=================================================================
Total params: 803,329
Trainable params: 803,329
Non-trainable params: 0
_________________________________________________________________

Epoch: 0, accuracy:0.5074,  binary_crossentropy:0.8059,  loss:0.9643,  val_accuracy:0.4720,  val_binary_crossentropy:0.6770,  val_loss:0.8347,  
....................................................................................................
Epoch: 100, accuracy:0.6461,  binary_crossentropy:0.6069,  loss:0.6370,  val_accuracy:0.6610,  val_binary_crossentropy:0.5833,  val_loss:0.6132,  
....................................................................................................
Epoch: 200, accuracy:0.6613,  binary_crossentropy:0.5935,  loss:0.6191,  val_accuracy:0.6800,  val_binary_crossentropy:0.5722,  val_loss:0.5978,  
....................................................................................................
Epoch: 300, accuracy:0.6704,  binary_crossentropy:0.5882,  loss:0.6158,  val_accuracy:0.6720,  val_binary_crossentropy:0.5614,  val_loss:0.5889,  
....................................................................................................
Epoch: 400, accuracy:0.6715,  binary_crossentropy:0.5796,  loss:0.6090,  val_accuracy:0.6930,  val_binary_crossentropy:0.5630,  val_loss:0.5923,  
....................................................................................................
Epoch: 500, accuracy:0.6768,  binary_crossentropy:0.5732,  loss:0.6043,  val_accuracy:0.6850,  val_binary_crossentropy:0.5603,  val_loss:0.5915,  
....................................................................................................
Epoch: 600, accuracy:0.6794,  binary_crossentropy:0.5709,  loss:0.6034,  val_accuracy:0.6860,  val_binary_crossentropy:0.5521,  val_loss:0.5847,  
....................................................................................................
Epoch: 700, accuracy:0.6851,  binary_crossentropy:0.5633,  loss:0.5972,  val_accuracy:0.6940,  val_binary_crossentropy:0.5366,  val_loss:0.5705,  
....................................................................................................
Epoch: 800, accuracy:0.6825,  binary_crossentropy:0.5624,  loss:0.5976,  val_accuracy:0.7000,  val_binary_crossentropy:0.5465,  val_loss:0.5816,  
....................................................................................................
Epoch: 900, accuracy:0.6931,  binary_crossentropy:0.5578,  loss:0.5940,  val_accuracy:0.6970,  val_binary_crossentropy:0.5375,  val_loss:0.5737,  
.
 plotter.plot(regularizer_histories)
plt.ylim([0.5, 0.7])
 
(0.5, 0.7)

png

到目前为止,这种带有"Combined"正则化的模型显然是最好的。

在TensorBoard中查看

这些模型还记录了TensorBoard日志。

要在笔记本中打开嵌入式tensorboard查看器,请将以下内容复制到代码单元中:

 %tensorboard --logdir {logdir}/regularizers
 

您可以在TensorDoard.dev查看此笔记本的上次运行结果

为了方便起见,它也包含在<iframe>中:

 display.IFrame(
    src="https://tensorboard.dev/experiment/fGInKDo8TXes1z7HQku9mw/#scalars&_smoothingWeight=0.97",
    width = "100%",
    height="800px")

 

这是通过以下方式上传的:

tensorboard dev upload --logdir  {logdir}/regularizers

结论

回顾一下:以下是防止神经网络过度拟合的最常见方法:

  • 获取更多培训数据。
  • 减少网络容量。
  • 添加体重调整。
  • 添加辍学。

本指南未涵盖的两种重要方法是:

  • 数据扩充
  • 批量标准化

请记住,每种方法都可以单独提供帮助,但通常将它们组合起来会更加有效。

 
#
# Copyright (c) 2017 François Chollet
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.