עזרה להגן על שונית המחסום הגדולה עם TensorFlow על Kaggle הצטרפו אתגר

אימונים מבוזרים עם קרס

הצג באתר TensorFlow.org הפעל בגוגל קולאב צפה במקור ב-GitHub הורד מחברת

סקירה כללית

tf.distribute.Strategy API מספק הפשטה להפצת האימונים שלך בין יחידות עיבוד מרובות. זה מאפשר לך לבצע הדרכה מבוזרת באמצעות מודלים קיימים וקוד אימון במינימום שינויים.

הדרכה זו מדגימה כיצד להשתמש tf.distribute.MirroredStrategy לבצע שכפול ב-גרף עם הכשרה סינכרוני על GPUs רבה על מכונה אחת. האסטרטגיה בעצם מעתיקה את כל המשתנים של המודל לכל מעבד. ואז, היא משתמשת כל-להפחית לשלב את הדרגתיים מכל מעבדים, ומחיל את הערך המשולב לכל עותקים של המודל.

תשתמש tf.keras APIs לבנות את המודל Model.fit להכשרה בתחום זה. (כדי ללמוד על אימון מופץ עם לולאת אימונים מותאמים אישית ואת MirroredStrategy , לבדוק במדריך זה .)

MirroredStrategy דגמי רכבות שלך על GPUs מרובים במחשב יחיד. לקבלת הדרכה סינכרוני על GPUs רבה על עובדים רבים, השתמש tf.distribute.MultiWorkerMirroredStrategy עם Keras Model.fit או לולאת אימונים מותאמים אישית . לאפשרויות אחרות, עיין מדריך אימונים מבוזרת .

למידע על אסטרטגיות שונות אחרות, יש את האימונים מבוזרת עם TensorFlow מדריך.

להכין

import tensorflow_datasets as tfds
import tensorflow as tf

import os

# Load the TensorBoard notebook extension.
%load_ext tensorboard
2021-08-04 01:24:55.165631: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
print(tf.__version__)
2.5.0

הורד את מערך הנתונים

טען את הנתונים MNIST מ מערכי נתונים TensorFlow . זה מחזיר במערך של tf.data פורמט.

הגדרת with_info הטיעון כדי True כוללת את metadata עבור במערך כול, אשר נשמר כאן info . בין היתר, אובייקט מטא-נתונים זה כולל את מספר דוגמאות הרכבת והמבחן.

datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)

mnist_train, mnist_test = datasets['train'], datasets['test']
2021-08-04 01:25:00.048530: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-08-04 01:25:00.691099: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 01:25:00.691993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:00:05.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-08-04 01:25:00.692033: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-04 01:25:00.695439: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-08-04 01:25:00.695536: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-08-04 01:25:00.696685: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-08-04 01:25:00.697009: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-08-04 01:25:00.698067: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-08-04 01:25:00.698998: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-08-04 01:25:00.699164: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-08-04 01:25:00.699264: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 01:25:00.700264: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 01:25:00.701157: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-08-04 01:25:00.701928: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-04 01:25:00.702642: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 01:25:00.703535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:00:05.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-08-04 01:25:00.703621: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 01:25:00.704507: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 01:25:00.705349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-08-04 01:25:00.705388: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-04 01:25:01.356483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-08-04 01:25:01.356521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]   0 
2021-08-04 01:25:01.356530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:  N 
2021-08-04 01:25:01.356777: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 01:25:01.357792: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 01:25:01.358756: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 01:25:01.359641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14646 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0)

הגדר את אסטרטגיית ההפצה

צור MirroredStrategy אובייקט. זה יטפל הפצה ולספק מנהל הקשר ( MirroredStrategy.scope ) לבנות בתוך המודל שלך.

strategy = tf.distribute.MirroredStrategy()
WARNING:tensorflow:Collective ops is not configured at program startup. Some performance features may not be enabled.
WARNING:tensorflow:Collective ops is not configured at program startup. Some performance features may not be enabled.
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
Number of devices: 1

הגדר את צינור הקלט

בעת אימון מודל עם מספר GPUs, אתה יכול להשתמש בכוח המחשוב הנוסף ביעילות על ידי הגדלת גודל האצווה. באופן כללי, השתמש בגודל האצווה הגדול ביותר שמתאים לזיכרון ה-GPU וכוון את קצב הלמידה בהתאם.

# You can also do info.splits.total_num_examples to get the total
# number of examples in the dataset.

num_train_examples = info.splits['train'].num_examples
num_test_examples = info.splits['test'].num_examples

BUFFER_SIZE = 10000

BATCH_SIZE_PER_REPLICA = 64
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

גדר פונקציה מנרמלת את ערכי פיקסלים תמונה מתוך [0, 255] הטווח אל [0, 1] הטווח ( scaling התכונה ):

def scale(image, label):
 image = tf.cast(image, tf.float32)
 image /= 255

 return image, label

החל זו scale פונקציה לנתונים הכשרה המבחן, ולאחר מכן להשתמש tf.data.Dataset APIs כדי לטרוף את נתוני האימון ( Dataset.shuffle ), אצווה זה ( Dataset.batch ). שמתי לב, אתה גם פוקח עליו מטמון בתוך הזיכרון של נתון אימונים כדי לשפר את הביצועים ( Dataset.cache ).

train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)

צור את הדגם

צור ולעבד את המודל Keras בהקשר של Strategy.scope :

with strategy.scope():
 model = tf.keras.Sequential([
   tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
   tf.keras.layers.MaxPooling2D(),
   tf.keras.layers.Flatten(),
   tf.keras.layers.Dense(64, activation='relu'),
   tf.keras.layers.Dense(10)
 ])

 model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        optimizer=tf.keras.optimizers.Adam(),
        metrics=['accuracy'])
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

הגדר את ההתקשרות חזרה

קבע את ההגדרות הבאות tf.keras.callbacks :

לצורך המחשה, מוסיפים פונקציית מנהג שנקרא PrintLR להצגת שיעור הלמידה המחברת.

# Define the checkpoint directory to store the checkpoints.
checkpoint_dir = './training_checkpoints'
# Define the name of the checkpoint files.
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
# Define a function for decaying the learning rate.
# You can define any decay function you need.
def decay(epoch):
 if epoch < 3:
  return 1e-3
 elif epoch >= 3 and epoch < 7:
  return 1e-4
 else:
  return 1e-5
# Define a callback for printing the learning rate at the end of each epoch.
class PrintLR(tf.keras.callbacks.Callback):
 def on_epoch_end(self, epoch, logs=None):
  print('\nLearning rate for epoch {} is {}'.format(epoch + 1,
                           model.optimizer.lr.numpy()))
# Put all the callbacks together.
callbacks = [
  tf.keras.callbacks.TensorBoard(log_dir='./logs'),
  tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,
                    save_weights_only=True),
  tf.keras.callbacks.LearningRateScheduler(decay),
  PrintLR()
]
2021-08-04 01:25:02.054144: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session initializing.
2021-08-04 01:25:02.054179: I tensorflow/core/profiler/lib/profiler_session.cc:141] Profiler session started.
2021-08-04 01:25:02.054232: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1611] Profiler found 1 GPUs
2021-08-04 01:25:02.098001: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcupti.so.11.2
2021-08-04 01:25:02.288095: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session tear down.
2021-08-04 01:25:02.292220: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1743] CUPTI activity buffer flushed

לאמן ולהעריך

עכשיו, לאמן את המודל בדרך הרגילה על ידי התקשרות Model.fit לדגם ולהעביר את הנתונים שנוצרו בראשית הדרכה. שלב זה זהה בין אם אתה מפיץ את ההדרכה ובין אם לא.

EPOCHS = 12

model.fit(train_dataset, epochs=EPOCHS, callbacks=callbacks)
2021-08-04 01:25:02.342811: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:461] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
2021-08-04 01:25:02.389307: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-08-04 01:25:02.389734: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2000179999 Hz
Epoch 1/12
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2021-08-04 01:25:05.851687: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-08-04 01:25:07.965516: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8100
2021-08-04 01:25:13.166255: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-08-04 01:25:13.566160: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
1/938 [..............................] - ETA: 3:09:47 - loss: 2.2850 - accuracy: 0.1094
2021-08-04 01:25:14.615346: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session initializing.
2021-08-04 01:25:14.615388: I tensorflow/core/profiler/lib/profiler_session.cc:141] Profiler session started.
3/938 [..............................] - ETA: 4:21 - loss: 2.1694 - accuracy: 0.3333WARNING:tensorflow:Callback method `on_train_batch_begin` is slow compared to the batch time (batch time: 0.0045s vs `on_train_batch_begin` time: 0.0762s). Check your callbacks.
2021-08-04 01:25:15.082713: I tensorflow/core/profiler/lib/profiler_session.cc:66] Profiler session collecting data.
2021-08-04 01:25:15.085886: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1743] CUPTI activity buffer flushed
2021-08-04 01:25:15.122453: I tensorflow/core/profiler/internal/gpu/cupti_collector.cc:673] GpuTracer has collected 96 callback api events and 93 activity events. 
2021-08-04 01:25:15.126946: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session tear down.
2021-08-04 01:25:15.138108: I tensorflow/core/profiler/rpc/client/save_profile.cc:137] Creating directory: ./logs/train/plugins/profile/2021_08_04_01_25_15
2021-08-04 01:25:15.146767: I tensorflow/core/profiler/rpc/client/save_profile.cc:143] Dumped gzipped tool data for trace.json.gz to ./logs/train/plugins/profile/2021_08_04_01_25_15/kokoro-gcp-ubuntu-prod-1251741625.trace.json.gz
2021-08-04 01:25:15.154434: I tensorflow/core/profiler/rpc/client/save_profile.cc:137] Creating directory: ./logs/train/plugins/profile/2021_08_04_01_25_15
2021-08-04 01:25:15.155169: I tensorflow/core/profiler/rpc/client/save_profile.cc:143] Dumped gzipped tool data for memory_profile.json.gz to ./logs/train/plugins/profile/2021_08_04_01_25_15/kokoro-gcp-ubuntu-prod-1251741625.memory_profile.json.gz
2021-08-04 01:25:15.155597: I tensorflow/core/profiler/rpc/client/capture_profile.cc:251] Creating directory: ./logs/train/plugins/profile/2021_08_04_01_25_15Dumped tool data for xplane.pb to ./logs/train/plugins/profile/2021_08_04_01_25_15/kokoro-gcp-ubuntu-prod-1251741625.xplane.pb
Dumped tool data for overview_page.pb to ./logs/train/plugins/profile/2021_08_04_01_25_15/kokoro-gcp-ubuntu-prod-1251741625.overview_page.pb
Dumped tool data for input_pipeline.pb to ./logs/train/plugins/profile/2021_08_04_01_25_15/kokoro-gcp-ubuntu-prod-1251741625.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to ./logs/train/plugins/profile/2021_08_04_01_25_15/kokoro-gcp-ubuntu-prod-1251741625.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to ./logs/train/plugins/profile/2021_08_04_01_25_15/kokoro-gcp-ubuntu-prod-1251741625.kernel_stats.pb

WARNING:tensorflow:Callback method `on_train_batch_begin` is slow compared to the batch time (batch time: 0.0045s vs `on_train_batch_begin` time: 0.0762s). Check your callbacks.
WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0045s vs `on_train_batch_end` time: 0.0155s). Check your callbacks.
WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0045s vs `on_train_batch_end` time: 0.0155s). Check your callbacks.
938/938 [==============================] - 16s 4ms/step - loss: 0.1997 - accuracy: 0.9421

Learning rate for epoch 1 is 0.0010000000474974513
Epoch 2/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0656 - accuracy: 0.9805

Learning rate for epoch 2 is 0.0010000000474974513
Epoch 3/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0461 - accuracy: 0.9857

Learning rate for epoch 3 is 0.0010000000474974513
Epoch 4/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0244 - accuracy: 0.9935

Learning rate for epoch 4 is 9.999999747378752e-05
Epoch 5/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0217 - accuracy: 0.9943

Learning rate for epoch 5 is 9.999999747378752e-05
Epoch 6/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0199 - accuracy: 0.9948

Learning rate for epoch 6 is 9.999999747378752e-05
Epoch 7/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0182 - accuracy: 0.9955

Learning rate for epoch 7 is 9.999999747378752e-05
Epoch 8/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0156 - accuracy: 0.9963

Learning rate for epoch 8 is 9.999999747378752e-06
Epoch 9/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0154 - accuracy: 0.9964

Learning rate for epoch 9 is 9.999999747378752e-06
Epoch 10/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0152 - accuracy: 0.9965

Learning rate for epoch 10 is 9.999999747378752e-06
Epoch 11/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0150 - accuracy: 0.9966

Learning rate for epoch 11 is 9.999999747378752e-06
Epoch 12/12
938/938 [==============================] - 3s 3ms/step - loss: 0.0149 - accuracy: 0.9967

Learning rate for epoch 12 is 9.999999747378752e-06
<tensorflow.python.keras.callbacks.History at 0x7f4e5c176dd0>

בדוק אם יש מחסומים שמורים:

# Check the checkpoint directory.
ls {checkpoint_dir}
checkpoint      ckpt_4.data-00000-of-00001
ckpt_1.data-00000-of-00001  ckpt_4.index
ckpt_1.index       ckpt_5.data-00000-of-00001
ckpt_10.data-00000-of-00001 ckpt_5.index
ckpt_10.index      ckpt_6.data-00000-of-00001
ckpt_11.data-00000-of-00001 ckpt_6.index
ckpt_11.index      ckpt_7.data-00000-of-00001
ckpt_12.data-00000-of-00001 ckpt_7.index
ckpt_12.index      ckpt_8.data-00000-of-00001
ckpt_2.data-00000-of-00001  ckpt_8.index
ckpt_2.index       ckpt_9.data-00000-of-00001
ckpt_3.data-00000-of-00001  ckpt_9.index
ckpt_3.index

כדי לבדוק באיזו מידה מבצע המודל, לטעון את המחסום האחרון ולקרוא Model.evaluate על נתוני הבדיקה:

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

eval_loss, eval_acc = model.evaluate(eval_dataset)

print('Eval loss: {}, Eval accuracy: {}'.format(eval_loss, eval_acc))
2021-08-04 01:25:49.277864: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:461] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
157/157 [==============================] - 2s 4ms/step - loss: 0.0371 - accuracy: 0.9875
Eval loss: 0.03712465986609459, Eval accuracy: 0.987500011920929

כדי לראות את הפלט, הפעל את TensorBoard והצג את היומנים:

%tensorboard --logdir=logs

ls -sh ./logs
total 4.0K
4.0K train

ייצא ל- SavedModel

ייצוא הגרף ואת המשתנים לפורמט SavedModel פלטפורמה אגנוסטי באמצעות Model.save . לאחר המודל שלך נשמר, אתה יכול לטעון אותו עם או בלי Strategy.scope .

path = 'saved_model/'
model.save(path, save_format='tf')
2021-08-04 01:25:51.983973: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
INFO:tensorflow:Assets written to: saved_model/assets
INFO:tensorflow:Assets written to: saved_model/assets

עכשיו, לטעון את המודל ללא Strategy.scope :

unreplicated_model = tf.keras.models.load_model(path)

unreplicated_model.compile(
  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
  optimizer=tf.keras.optimizers.Adam(),
  metrics=['accuracy'])

eval_loss, eval_acc = unreplicated_model.evaluate(eval_dataset)

print('Eval loss: {}, Eval Accuracy: {}'.format(eval_loss, eval_acc))
157/157 [==============================] - 0s 2ms/step - loss: 0.0371 - accuracy: 0.9875
Eval loss: 0.03712465986609459, Eval Accuracy: 0.987500011920929

טען את המודל עם Strategy.scope :

with strategy.scope():
 replicated_model = tf.keras.models.load_model(path)
 replicated_model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(),
              metrics=['accuracy'])

 eval_loss, eval_acc = replicated_model.evaluate(eval_dataset)
 print ('Eval loss: {}, Eval Accuracy: {}'.format(eval_loss, eval_acc))
2021-08-04 01:25:53.544239: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:461] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
157/157 [==============================] - 2s 2ms/step - loss: 0.0371 - accuracy: 0.9875
Eval loss: 0.03712465986609459, Eval Accuracy: 0.987500011920929

משאבים נוספים

דוגמה נוספות המשתמשות אסטרטגיות הפצה שונות עם Keras Model.fit API:

 1. לפתור המשימות הדבקות באמצעות ברט על TPU הדרכה משתמשות tf.distribute.MirroredStrategy לאימונים על GPUs ו tf.distribute.TPUStrategy -ב TPUs.
 2. לשמור ולטעון מודל באמצעות אסטרטגיית ההפצה demonstates הדרכה כיצד להשתמש בממשקי API SavedModel עם tf.distribute.Strategy .
 3. דגמי TensorFlow הרשמיים יכולים להיות מוגדרים להפעיל אסטרטגיות הפצה מרובות.

למידע נוסף על אסטרטגיות הפצה של TensorFlow:

 1. האימונים המותאמים אישית עם tf.distribute.Strategy מופעי הדרכה כיצד להשתמש tf.distribute.MirroredStrategy לאימונים יחיד העובד עם לולאת אימונים מותאמים אישית.
 2. הכשרה רב-עובד עם Keras מופעים הדרכה כיצד להשתמש MultiWorkerMirroredStrategy עם Model.fit .
 3. לולאת האימונים המותאמים אישית עם Keras ו MultiWorkerMirroredStrategy מופעי הדרכה כיצד להשתמש MultiWorkerMirroredStrategy עם Keras ו לולאת אימונים מותאמים אישית.
 4. ההכשרה שהופצה TensorFlow מדריך מספקת סקירה של אסטרטגיות הפצה הזמינות.
 5. ביצועים טובים יותר עם tf.function המדריך מספק מידע על אסטרטגיות אחרות וכלים, כגון Profiler TensorFlow אתה יכול להשתמש בו כדי למטב את הביצועים של דגמי TensorFlow שלך.