آموزش توزیع شده با Keras

مشاهده در TensorFlow.org در Google Colab اجرا کنید مشاهده منبع در GitHub دانلود دفترچه یادداشت

بررسی اجمالی

tf.distribute.Strategy API یک تجرید برای توزیع های آموزشی خود را در سراسر واحد پردازشی را فراهم می کند. این به شما امکان می دهد با استفاده از مدل های موجود و کد آموزشی با حداقل تغییرات ، آموزش های توزیع شده را انجام دهید.

این آموزش نشان می دهد که چگونه به استفاده از tf.distribute.MirroredStrategy به انجام در نمودار تکرار با آموزش همزمان در بسیاری از پردازنده بر روی یک دستگاه. این استراتژی اساساً همه متغیرهای مدل را در هر پردازنده کپی می کند. سپس، آن را با استفاده از همه کاهش به ترکیب شیب از تمام پردازنده ها، و ارزش ترکیبی برای تمام نسخه از مدل اعمال می شود.

شما را به استفاده از tf.keras رابط های برنامه کاربردی برای ساخت مدل و Model.fit برای آموزش آن. (برای کسب اطلاعات در مورد آموزش توزیع شده با یک حلقه آموزش سفارشی و MirroredStrategy ، چک کردن این آموزش .)

MirroredStrategy مدل خود را بر روی GPU های متعدد در یک دستگاه واحد آموزش می دهد. برای آموزش همزمان در بسیاری از پردازنده بر روی کارگران متعدد، استفاده از tf.distribute.MultiWorkerMirroredStrategy با Keras Model.fit یا یک حلقه آموزش سفارشی . برای گزینه های دیگر، به مراجعه راهنمای آموزش توزیع شده .

برای یادگیری در مورد استراتژی های مختلف دیگر، است که وجود دارد آموزش توزیع شده با TensorFlow راهنمای.

برپایی

import tensorflow_datasets as tfds
import tensorflow as tf

import os

# Load the TensorBoard notebook extension.
%load_ext tensorboard
2021-08-04 01:24:55.165631: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
print(tf.__version__)
2.5.0

مجموعه داده را بارگیری کنید

بار مجموعه داده MNIST از TensorFlow مجموعه داده . این یک مجموعه داده در بر می گرداند tf.data فرمت.

تنظیم with_info آرگومان به True شامل ابرداده برای کل مجموعه داده است که در اینجا به ذخیره info . از جمله موارد دیگر ، این شیء فراداده شامل تعداد نمونه قطار و آزمایش است.

datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)

mnist_train, mnist_test = datasets['train'], datasets['test']
2021-08-04 01:25:00.048530: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-08-04 01:25:00.691099: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 01:25:00.691993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:00:05.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-08-04 01:25:00.692033: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-04 01:25:00.695439: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-08-04 01:25:00.695536: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-08-04 01:25:00.696685: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-08-04 01:25:00.697009: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-08-04 01:25:00.698067: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-08-04 01:25:00.698998: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-08-04 01:25:00.699164: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-08-04 01:25:00.699264: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 01:25:00.700264: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 01:25:00.701157: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-08-04 01:25:00.701928: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-04 01:25:00.702642: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 01:25:00.703535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:00:05.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-08-04 01:25:00.703621: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 01:25:00.704507: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 01:25:00.705349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-08-04 01:25:00.705388: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-04 01:25:01.356483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-08-04 01:25:01.356521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 
2021-08-04 01:25:01.356530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N 
2021-08-04 01:25:01.356777: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 01:25:01.357792: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 01:25:01.358756: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 01:25:01.359641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14646 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0)

استراتژی توزیع را مشخص کنید

درست MirroredStrategy شی. این توزیع رسیدگی و ارائه یک مدیر زمینه ( MirroredStrategy.scope ) برای ساخت در داخل مدل خود را.

strategy = tf.distribute.MirroredStrategy()
WARNING:tensorflow:Collective ops is not configured at program startup. Some performance features may not be enabled.
WARNING:tensorflow:Collective ops is not configured at program startup. Some performance features may not be enabled.
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
Number of devices: 1

خط لوله ورودی را تنظیم کنید

هنگام آموزش یک مدل با GPU های متعدد ، می توانید با افزایش اندازه دسته ، از قدرت محاسباتی اضافی به طور مثر استفاده کنید. به طور کلی ، از بزرگترین اندازه دسته ای متناسب با حافظه GPU استفاده کنید و بر این اساس میزان یادگیری را تنظیم کنید.

# You can also do info.splits.total_num_examples to get the total
# number of examples in the dataset.

num_train_examples = info.splits['train'].num_examples
num_test_examples = info.splits['test'].num_examples

BUFFER_SIZE = 10000

BATCH_SIZE_PER_REPLICA = 64
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

تعریف یک تابع است که عادی مقادیر پیکسل تصویر از [0, 255] وسیعی به [0, 1] محدوده ( پوسته پوسته شدن ویژگی ):

def scale(image, label):
  image = tf.cast(image, tf.float32)
  image /= 255

  return image, label

درخواست این scale تابع برای آموزش و آزمون داده ها، و سپس با استفاده از tf.data.Dataset رابط های برنامه کاربردی به زدن داده های آموزشی ( Dataset.shuffle )، و دسته ای آن را ( Dataset.batch ). توجه داشته باشید که شما نیز نگه داشتن کش در حافظه از داده های آموزشی به منظور بهبود عملکرد ( Dataset.cache ).

train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)

مدل را ایجاد کنید

ایجاد و کامپایل مدل Keras در زمینه Strategy.scope :

with strategy.scope():
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64, activation='relu'),
      tf.keras.layers.Dense(10)
  ])

  model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

تماس های تلفنی را تعریف کنید

تعریف زیر tf.keras.callbacks :

برای مقاصد گویا، اضافه کردن مخاطبین سفارشی به نام PrintLR برای نمایش نرخ آموزش در نوت بوک.

# Define the checkpoint directory to store the checkpoints.
checkpoint_dir = './training_checkpoints'
# Define the name of the checkpoint files.
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
# Define a function for decaying the learning rate.
# You can define any decay function you need.
def decay(epoch):
  if epoch < 3:
    return 1e-3
  elif epoch >= 3 and epoch < 7:
    return 1e-4
  else:
    return 1e-5
# Define a callback for printing the learning rate at the end of each epoch.
class PrintLR(tf.keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs=None):
    print('\nLearning rate for epoch {} is {}'.format(epoch + 1,
                                                      model.optimizer.lr.numpy()))
# Put all the callbacks together.
callbacks = [
    tf.keras.callbacks.TensorBoard(log_dir='./logs'),
    tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,
                                       save_weights_only=True),
    tf.keras.callbacks.LearningRateScheduler(decay),
    PrintLR()
]
2021-08-04 01:25:02.054144: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session initializing.
2021-08-04 01:25:02.054179: I tensorflow/core/profiler/lib/profiler_session.cc:141] Profiler session started.
2021-08-04 01:25:02.054232: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1611] Profiler found 1 GPUs
2021-08-04 01:25:02.098001: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcupti.so.11.2
2021-08-04 01:25:02.288095: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session tear down.
2021-08-04 01:25:02.292220: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1743] CUPTI activity buffer flushed

آموزش و ارزیابی کنید

در حال حاضر، آموزش مدل به روش معمول با تماس با Model.fit به مدل و عبور در مجموعه داده های ایجاد شده در آغاز آموزش. این مرحله یکسان است آیا شما آموزش را توزیع می کنید یا خیر.

EPOCHS = 12

model.fit(train_dataset, epochs=EPOCHS, callbacks=callbacks)
2021-08-04 01:25:02.342811: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:461] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
2021-08-04 01:25:02.389307: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-08-04 01:25:02.389734: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2000179999 Hz
Epoch 1/12
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2021-08-04 01:25:05.851687: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-08-04 01:25:07.965516: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8100
2021-08-04 01:25:13.166255: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-08-04 01:25:13.566160: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
1/938 [..............................] - ETA: 3:09:47 - loss: 2.2850 - accuracy: 0.1094
2021-08-04 01:25:14.615346: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session initializing.
2021-08-04 01:25:14.615388: I tensorflow/core/profiler/lib/profiler_session.cc:141] Profiler session started.
3/938 [..............................] - ETA: 4:21 - loss: 2.1694 - accuracy: 0.3333WARNING:tensorflow:Callback method `on_train_batch_begin` is slow compared to the batch time (batch time: 0.0045s vs `on_train_batch_begin` time: 0.0762s). Check your callbacks.
2021-08-04 01:25:15.082713: I tensorflow/core/profiler/lib/profiler_session.cc:66] Profiler session collecting data.
2021-08-04 01:25:15.085886: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1743] CUPTI activity buffer flushed
2021-08-04 01:25:15.122453: I tensorflow/core/profiler/internal/gpu/cupti_collector.cc:673]  GpuTracer has collected 96 callback api events and 93 activity events. 
2021-08-04 01:25:15.126946: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session tear down.
2021-08-04 01:25:15.138108: I tensorflow/core/profiler/rpc/client/save_profile.cc:137] Creating directory: ./logs/train/plugins/profile/2021_08_04_01_25_15
2021-08-04 01:25:15.146767: I tensorflow/core/profiler/rpc/client/save_profile.cc:143] Dumped gzipped tool data for trace.json.gz to ./logs/train/plugins/profile/2021_08_04_01_25_15/kokoro-gcp-ubuntu-prod-1251741625.trace.json.gz
2021-08-04 01:25:15.154434: I tensorflow/core/profiler/rpc/client/save_profile.cc:137] Creating directory: ./logs/train/plugins/profile/2021_08_04_01_25_15
2021-08-04 01:25:15.155169: I tensorflow/core/profiler/rpc/client/save_profile.cc:143] Dumped gzipped tool data for memory_profile.json.gz to ./logs/train/plugins/profile/2021_08_04_01_25_15/kokoro-gcp-ubuntu-prod-1251741625.memory_profile.json.gz
2021-08-04 01:25:15.155597: I tensorflow/core/profiler/rpc/client/capture_profile.cc:251] Creating directory: ./logs/train/plugins/profile/2021_08_04_01_25_15Dumped tool data for xplane.pb to ./logs/train/plugins/profile/2021_08_04_01_25_15/kokoro-gcp-ubuntu-prod-1251741625.xplane.pb
Dumped tool data for overview_page.pb to ./logs/train/plugins/profile/2021_08_04_01_25_15/kokoro-gcp-ubuntu-prod-1251741625.overview_page.pb
Dumped tool data for input_pipeline.pb to ./logs/train/plugins/profile/2021_08_04_01_25_15/kokoro-gcp-ubuntu-prod-1251741625.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to ./logs/train/plugins/profile/2021_08_04_01_25_15/kokoro-gcp-ubuntu-prod-1251741625.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to ./logs/train/plugins/profile/2021_08_04_01_25_15/kokoro-gcp-ubuntu-prod-1251741625.kernel_stats.pb

WARNING:tensorflow:Callback method `on_train_batch_begin` is slow compared to the batch time (batch time: 0.0045s vs `on_train_batch_begin` time: 0.0762s). Check your callbacks.
WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0045s vs `on_train_batch_end` time: 0.0155s). Check your callbacks.
WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0045s vs `on_train_batch_end` time: 0.0155s). Check your callbacks.
938/938 [==============================] - 16s 4ms/step - loss: 0.1997 - accuracy: 0.9421

Learning rate for epoch 1 is 0.0010000000474974513
Epoch 2/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0656 - accuracy: 0.9805

Learning rate for epoch 2 is 0.0010000000474974513
Epoch 3/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0461 - accuracy: 0.9857

Learning rate for epoch 3 is 0.0010000000474974513
Epoch 4/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0244 - accuracy: 0.9935

Learning rate for epoch 4 is 9.999999747378752e-05
Epoch 5/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0217 - accuracy: 0.9943

Learning rate for epoch 5 is 9.999999747378752e-05
Epoch 6/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0199 - accuracy: 0.9948

Learning rate for epoch 6 is 9.999999747378752e-05
Epoch 7/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0182 - accuracy: 0.9955

Learning rate for epoch 7 is 9.999999747378752e-05
Epoch 8/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0156 - accuracy: 0.9963

Learning rate for epoch 8 is 9.999999747378752e-06
Epoch 9/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0154 - accuracy: 0.9964

Learning rate for epoch 9 is 9.999999747378752e-06
Epoch 10/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0152 - accuracy: 0.9965

Learning rate for epoch 10 is 9.999999747378752e-06
Epoch 11/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0150 - accuracy: 0.9966

Learning rate for epoch 11 is 9.999999747378752e-06
Epoch 12/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0149 - accuracy: 0.9967

Learning rate for epoch 12 is 9.999999747378752e-06
<tensorflow.python.keras.callbacks.History at 0x7f4e5c176dd0>

بررسی نقاط بازرسی ذخیره شده:

# Check the checkpoint directory.
ls {checkpoint_dir}
checkpoint           ckpt_4.data-00000-of-00001
ckpt_1.data-00000-of-00001   ckpt_4.index
ckpt_1.index             ckpt_5.data-00000-of-00001
ckpt_10.data-00000-of-00001  ckpt_5.index
ckpt_10.index            ckpt_6.data-00000-of-00001
ckpt_11.data-00000-of-00001  ckpt_6.index
ckpt_11.index            ckpt_7.data-00000-of-00001
ckpt_12.data-00000-of-00001  ckpt_7.index
ckpt_12.index            ckpt_8.data-00000-of-00001
ckpt_2.data-00000-of-00001   ckpt_8.index
ckpt_2.index             ckpt_9.data-00000-of-00001
ckpt_3.data-00000-of-00001   ckpt_9.index
ckpt_3.index

برای بررسی چگونگی به خوبی انجام مدل، بار شدن پاسگاه ایست و بازرسی و پاسخ Model.evaluate در داده ها از آزمون:

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

eval_loss, eval_acc = model.evaluate(eval_dataset)

print('Eval loss: {}, Eval accuracy: {}'.format(eval_loss, eval_acc))
2021-08-04 01:25:49.277864: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:461] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
157/157 [==============================] - 2s 4ms/step - loss: 0.0371 - accuracy: 0.9875
Eval loss: 0.03712465986609459, Eval accuracy: 0.987500011920929

برای تجسم خروجی ، TensorBoard را راه اندازی کرده و گزارش ها را مشاهده کنید:

%tensorboard --logdir=logs

ls -sh ./logs
total 4.0K
4.0K train

صادرات به SavedModel

صادرات نمودار و متغیرهای به فرمت SavedModel پلت فرم اگنوستیک با استفاده از Model.save . پس از مدل شما ذخیره شده است، شما می توانید آن را با یا بدون بار Strategy.scope .

path = 'saved_model/'
model.save(path, save_format='tf')
2021-08-04 01:25:51.983973: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
INFO:tensorflow:Assets written to: saved_model/assets
INFO:tensorflow:Assets written to: saved_model/assets

حالا، بار مدل بدون Strategy.scope :

unreplicated_model = tf.keras.models.load_model(path)

unreplicated_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=tf.keras.optimizers.Adam(),
    metrics=['accuracy'])

eval_loss, eval_acc = unreplicated_model.evaluate(eval_dataset)

print('Eval loss: {}, Eval Accuracy: {}'.format(eval_loss, eval_acc))
157/157 [==============================] - 0s 2ms/step - loss: 0.0371 - accuracy: 0.9875
Eval loss: 0.03712465986609459, Eval Accuracy: 0.987500011920929

بار این مدل را با Strategy.scope :

with strategy.scope():
  replicated_model = tf.keras.models.load_model(path)
  replicated_model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                           optimizer=tf.keras.optimizers.Adam(),
                           metrics=['accuracy'])

  eval_loss, eval_acc = replicated_model.evaluate(eval_dataset)
  print ('Eval loss: {}, Eval Accuracy: {}'.format(eval_loss, eval_acc))
2021-08-04 01:25:53.544239: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:461] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
157/157 [==============================] - 2s 2ms/step - loss: 0.0371 - accuracy: 0.9875
Eval loss: 0.03712465986609459, Eval Accuracy: 0.987500011920929

منابع اضافی

مثال های بیشتر که با استفاده از استراتژی توزیع های مختلف با Keras Model.fit API:

  1. حل مشکلات GLUE با استفاده از برت در TPU آموزش استفاده از tf.distribute.MirroredStrategy برای آموزش در GPU ها و tf.distribute.TPUStrategy -ان TPU ها.
  2. ذخیره و بارگذاری یک مدل با استفاده از یک استراتژی توزیع demonstates آموزش نحوه استفاده از API های SavedModel با tf.distribute.Strategy .
  3. مدل رسمی TensorFlow می توان به پیکربندی اجرا استراتژی توزیع های متعدد.

برای کسب اطلاعات بیشتر در مورد استراتژی های توزیع TensorFlow:

  1. آموزش سفارشی با tf.distribute.Strategy آموزش نشان می دهد که چگونه به استفاده از tf.distribute.MirroredStrategy برای آموزش تک کارگر با یک حلقه آموزش سفارشی.
  2. آموزش چند کارگر با Keras آموزش نشان می دهد که چگونه به استفاده از MultiWorkerMirroredStrategy با Model.fit .
  3. حلقه آموزش سفارشی با Keras و MultiWorkerMirroredStrategy آموزش نشان می دهد که چگونه به استفاده از MultiWorkerMirroredStrategy با Keras و یک حلقه آموزش سفارشی.
  4. آموزش توزیع شده در TensorFlow راهنمای یک مرور کلی از استراتژی توزیع موجود فراهم می کند.
  5. عملکرد بهتر با tf.function راهنمای اطلاعات در مورد استراتژی های دیگر و ابزار، مانند فراهم می کند TensorFlow نیمرخ شما می توانید به بهینه سازی عملکرد مدل TensorFlow خود استفاده کنید.