Đào tạo phân tán NasNet với tensorflow_cloud và Google Cloud

Xem trên TensorFlow.org

Chạy trong Google Colab

Xem trên GitHub

Tải xuống sổ ghi chép

Chạy ở Kaggle

Ví dụ này dựa trên phân loại Hình ảnh thông qua việc tinh chỉnh bằng EffientNet để trình bày cách đào tạo mô hình NasNetMobile bằng cách sử dụng tenorflow_cloud và Google Cloud Platform trên quy mô lớn bằng cách sử dụng đào tạo phân tán.

Nhập các mô-đun cần thiết

import tensorflow as tf
tf.version.VERSION

'2.6.0'

! pip install -q tensorflow-cloud

import tensorflow_cloud as tfc
tfc.__version__

import sys

Cấu hình dự án

Thiết lập các thông số dự án. Để biết các thông số cụ thể của Google Cloud, hãy tham khảo Hướng dẫn thiết lập dự án Google Cloud .

# Set Google Cloud Specific parameters

# TODO: Please set GCP_PROJECT_ID to your own Google Cloud project ID.
GCP_PROJECT_ID = 'YOUR_PROJECT_ID'

# TODO: set GCS_BUCKET to your own Google Cloud Storage (GCS) bucket.
GCS_BUCKET = 'YOUR_GCS_BUCKET_NAME'

# DO NOT CHANGE: Currently only the 'us-central1' region is supported.
REGION = 'us-central1'

# OPTIONAL: You can change the job name to any string.
JOB_NAME = 'nasnet'

# Setting location were training logs and checkpoints will be stored
GCS_BASE_PATH = f'gs://{GCS_BUCKET}/{JOB_NAME}'
TENSORBOARD_LOGS_DIR = os.path.join(GCS_BASE_PATH,"logs")
MODEL_CHECKPOINT_DIR = os.path.join(GCS_BASE_PATH,"checkpoints")
SAVED_MODEL_DIR = os.path.join(GCS_BASE_PATH,"saved_model")

Xác thực sổ ghi chép để sử dụng Dự án Google Cloud của bạn

Đối với Kaggle Notebook, hãy nhấp vào "Tiện ích bổ sung" -> "Google Cloud SDK" trước khi chạy ô bên dưới.

# Using tfc.remote() to ensure this code only runs in notebook
if not tfc.remote():

    # Authentication for Kaggle Notebooks
    if "kaggle_secrets" in sys.modules:
        from kaggle_secrets import UserSecretsClient
        UserSecretsClient().set_gcloud_credentials(project=GCP_PROJECT_ID)

    # Authentication for Colab Notebooks
    if "google.colab" in sys.modules:
        from google.colab import auth
        auth.authenticate_user()
        os.environ["GOOGLE_CLOUD_PROJECT"] = GCP_PROJECT_ID

Tải và chuẩn bị dữ liệu

Đọc dữ liệu thô và phân chia để huấn luyện và kiểm tra các tập dữ liệu.

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

# Setting input specific parameters
# The model expects input of dimension (INPUT_IMG_SIZE, INPUT_IMG_SIZE, 3)
INPUT_IMG_SIZE = 32
NUM_CLASSES = 10

Thêm API lớp tiền xử lý để tăng cường hình ảnh.

from tensorflow.keras.layers.experimental import preprocessing
from tensorflow.keras.models import Sequential


img_augmentation = Sequential(
    [
        # Resizing input to better match ImageNet size
        preprocessing.Resizing(256, 256),
        preprocessing.RandomRotation(factor=0.15),
        preprocessing.RandomFlip(),
        preprocessing.RandomContrast(factor=0.1),
    ],
    name="img_augmentation",
)

Tải mô hình và chuẩn bị đào tạo

Chúng tôi sẽ tải một mô hình được huấn luyện trước NASNetMobile (có trọng số) và giải phóng một số lớp để tinh chỉnh mô hình cho phù hợp hơn với tập dữ liệu.

from tensorflow.keras import layers

def build_model(num_classes, input_image_size):
    inputs = layers.Input(shape=(input_image_size, input_image_size, 3))
    x = img_augmentation(inputs)

    model = tf.keras.applications.NASNetMobile(
        input_shape=None,
        include_top=False,
        weights="imagenet",
        input_tensor=x,
        pooling=None,
        classes=num_classes,
    )

    # Freeze the pretrained weights
    model.trainable = False

    # We unfreeze the top 20 layers while leaving BatchNorm layers frozen
    for layer in model.layers[-20:]:
        if not isinstance(layer, layers.BatchNormalization):
            layer.trainable = True

    # Rebuild top
    x = layers.GlobalAveragePooling2D(name="avg_pool")(model.output)
    x = layers.BatchNormalization()(x)

    x = layers.Dense(128, activation="relu")(x)
    x = layers.Dense(64, activation="relu")(x)
    outputs = layers.Dense(num_classes, activation="softmax", name="pred")(x)

    # Compile
    model = tf.keras.Model(inputs, outputs, name="NASNetMobile")
    optimizer = tf.keras.optimizers.Adam(learning_rate=3e-4)
    model.compile(
        optimizer=optimizer,
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"]
    )
    return model

model = build_model(NUM_CLASSES, INPUT_IMG_SIZE)

if tfc.remote():
    # Configure Tensorboard logs
    callbacks=[
        tf.keras.callbacks.TensorBoard(log_dir=TENSORBOARD_LOGS_DIR),
        tf.keras.callbacks.ModelCheckpoint(
            MODEL_CHECKPOINT_DIR,
            save_best_only=True),
        tf.keras.callbacks.EarlyStopping(
            monitor='loss',
            min_delta =0.001,
            patience=3)]

    model.fit(x=x_train, y=y_train, epochs=100,
              validation_split=0.2, callbacks=callbacks)

    model.save(SAVED_MODEL_DIR)

else:
    # Run the training for 1 epoch and a small subset of the data to validate setup
    model.fit(x=x_train[:100], y=y_train[:100], validation_split=0.2, epochs=1)

Bắt đầu đào tạo từ xa

Bước này sẽ chuẩn bị mã của bạn từ sổ ghi chép này để thực thi từ xa và bắt đầu đào tạo phân tán từ xa trên Google Cloud Platform để đào tạo mô hình. Sau khi công việc được gửi, bạn có thể chuyển sang bước tiếp theo để theo dõi tiến độ công việc thông qua Tensorboard.

# If you are using a custom image you can install modules via requirements
# txt file.
with open('requirements.txt','w') as f:
    f.write('tensorflow-cloud\n')

# Optional: Some recommended base images. If you provide none the system
# will choose one for you.
TF_GPU_IMAGE= "tensorflow/tensorflow:latest-gpu"
TF_CPU_IMAGE= "tensorflow/tensorflow:latest"

# Submit a distributed training job using GPUs.
tfc.run(
    distribution_strategy='auto',
    requirements_txt='requirements.txt',
    docker_config=tfc.DockerConfig(
        parent_image=TF_GPU_IMAGE,
        image_build_bucket=GCS_BUCKET
        ),
    chief_config=tfc.COMMON_MACHINE_CONFIGS['K80_1X'],
      worker_config=tfc.COMMON_MACHINE_CONFIGS['K80_1X'],
      worker_count=3,
    job_labels={'job': JOB_NAME}
)

Kết quả đào tạo

Kết nối lại phiên bản Colab của bạn

Hầu hết các công việc đào tạo từ xa đều diễn ra trong thời gian dài. Nếu bạn đang sử dụng Colab thì có thể sẽ hết thời gian chờ trước khi có kết quả đào tạo. Trong trường hợp đó, hãy chạy lại các phần sau để kết nối lại và định cấu hình phiên bản Colab của bạn nhằm truy cập vào kết quả đào tạo. Chạy các phần sau theo thứ tự:

Nhập các mô-đun cần thiết
Cấu hình dự án
Xác thực sổ ghi chép để sử dụng Dự án Google Cloud của bạn

Tải bảng kéo

Trong khi quá trình đào tạo đang diễn ra, bạn có thể sử dụng Tensorboard để xem kết quả. Lưu ý rằng kết quả sẽ chỉ hiển thị sau khi quá trình đào tạo của bạn bắt đầu. Có thể sẽ mất vài phút.

%load_ext tensorboard
%tensorboard --logdir $TENSORBOARD_LOGS_DIR

Tải mô hình được đào tạo của bạn

trained_model = tf.keras.models.load_model(SAVED_MODEL_DIR)
trained_model.summary()