Modelo transformador para la comprensión del lenguaje

Ver en TensorFlow.org Ejecutar en Google Colab Ver fuente en GitHubDescargar cuaderno

Este tutorial trenes al modelo de Transformer para traducir un conjunto de datos portugués al Inglés . Este es un ejemplo avanzado que supone el conocimiento de generación de texto y la atención .

La idea central detrás del modelo de Transformer es la auto-atención, la habilidad de asistir a diferentes posiciones de la secuencia de entrada para calcular una representación de esa secuencia. Transformador crea pilas de capas de auto-atención y se explica a continuación en las secciones Scaled atención producto escalar y la atención Multi-cabeza.

Una de tamaño variable mangos modelo transformador de entrada usando pilas de capas de auto-atención en lugar de RNNs o CNNs . Esta arquitectura general tiene una serie de ventajas:

  • No hace suposiciones sobre las relaciones temporales / espaciales entre los datos. Esto es ideal para el procesamiento de un conjunto de objetos (por ejemplo, unidades de StarCraft ).
  • Las salidas de capa se pueden calcular en paralelo, en lugar de una serie como un RNN.
  • Artículos distantes pueden afectar la producción de la otra sin pasar a través de muchos RNN-etapas, o capas de convolución (véase Scene Memory transformador por ejemplo).
  • Puede aprender dependencias de largo alcance. Este es un desafío en muchas tareas de secuencia.

Las desventajas de esta arquitectura son:

  • Por una serie de tiempo, la salida de un paso de tiempo se calcula a partir de toda la historia en lugar de solo las entradas y del estado escondido actual. Esto puede ser menos eficiente.
  • Si la entrada tiene una relación temporal / espacial, como texto, se debe añadir un poco de codificación posicional o el modelo será efectivamente ver una bolsa de palabras.

Después de entrenar el modelo en este cuaderno, podrá ingresar una oración en portugués y devolver la traducción al inglés.

Mapa de calor de atención

Configuración

pip install tensorflow_datasets
pip install -U tensorflow-text
import collections
import logging
import os
import pathlib
import re
import string
import sys
import time

import numpy as np
import matplotlib.pyplot as plt

import tensorflow_datasets as tfds
import tensorflow_text as text
import tensorflow as tf
2021-08-11 18:06:01.275429: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
logging.getLogger('tensorflow').setLevel(logging.ERROR)  # suppress warnings

Descarga el conjunto de datos

Uso TensorFlow conjuntos de datos para cargar el conjunto de datos en portugués-Inglés del proyecto abierto de traducción TED Talks .

Este conjunto de datos contiene aproximadamente 50000 ejemplos de entrenamiento, 1100 ejemplos de validación y 2000 ejemplos de prueba.

examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True,
                               as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']
2021-08-11 18:06:06.574219: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-08-11 18:06:07.217445: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-11 18:06:07.218327: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:00:05.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-08-11 18:06:07.218366: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-11 18:06:07.221506: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-08-11 18:06:07.221620: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-08-11 18:06:07.222729: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-08-11 18:06:07.223084: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-08-11 18:06:07.224143: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-08-11 18:06:07.225154: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-08-11 18:06:07.225371: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-08-11 18:06:07.225496: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-11 18:06:07.226385: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-11 18:06:07.227265: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-08-11 18:06:07.227758: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-11 18:06:07.228383: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-11 18:06:07.229369: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:00:05.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-08-11 18:06:07.229453: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-11 18:06:07.230338: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-11 18:06:07.231219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-08-11 18:06:07.231261: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-11 18:06:07.828832: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-08-11 18:06:07.828868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 
2021-08-11 18:06:07.828877: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N 
2021-08-11 18:06:07.829113: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-11 18:06:07.830098: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-11 18:06:07.831030: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-11 18:06:07.831854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14646 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0)

El tf.data.Dataset objeto devuelto por TensorFlow conjuntos de datos produce pares de ejemplos de texto:

for pt_examples, en_examples in train_examples.batch(3).take(1):
  for pt in pt_examples.numpy():
    print(pt.decode('utf-8'))

  print()

  for en in en_examples.numpy():
    print(en.decode('utf-8'))
2021-08-11 18:06:07.938998: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-08-11 18:06:07.939525: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2000165000 Hz
e quando melhoramos a procura , tiramos a única vantagem da impressão , que é a serendipidade .
mas e se estes fatores fossem ativos ?
mas eles não tinham a curiosidade de me testar .

and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n't test for curiosity .

Tokenización y destokenización de texto

No puede entrenar un modelo directamente en el texto. El texto debe convertirse primero a alguna representación numérica. Normalmente, convierte el texto en secuencias de ID de token, que se utilizan como índices en una incrustación.

Una aplicación muy popular se demuestra en la palabra parcial tokenizer tutorial se basa tokenizers palabra parcial ( text.BertTokenizer ) optimizados para este conjunto de datos y los exporta en un saved_model .

Descargar y descomprimir e importar el saved_model :

model_name = "ted_hrlr_translate_pt_en_converter"
tf.keras.utils.get_file(
    f"{model_name}.zip",
    f"https://storage.googleapis.com/download.tensorflow.org/models/{model_name}.zip",
    cache_dir='.', cache_subdir='', extract=True
)
Downloading data from https://storage.googleapis.com/download.tensorflow.org/models/ted_hrlr_translate_pt_en_converter.zip
188416/184801 [==============================] - 0s 0us/step
'./ted_hrlr_translate_pt_en_converter.zip'
tokenizers = tf.saved_model.load(model_name)

El tf.saved_model contiene dos tokenizers de texto, uno para Inglés y otro para el portugués. Ambos tienen los mismos métodos:

[item for item in dir(tokenizers.en) if not item.startswith('_')]
['detokenize',
 'get_reserved_tokens',
 'get_vocab_path',
 'get_vocab_size',
 'lookup',
 'tokenize',
 'tokenizer',
 'vocab']

El tokenize método convierte un lote de cadenas a un acolchado-lote de ID de testigo. Este método divide la puntuación, las minúsculas y normaliza unicode la entrada antes de la tokenización. Esa estandarización no es visible aquí porque los datos de entrada ya están estandarizados.

for en in en_examples.numpy():
  print(en.decode('utf-8'))
and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n't test for curiosity .
encoded = tokenizers.en.tokenize(en_examples)

for row in encoded.to_list():
  print(row)
[2, 72, 117, 79, 1259, 1491, 2362, 13, 79, 150, 184, 311, 71, 103, 2308, 74, 2679, 13, 148, 80, 55, 4840, 1434, 2423, 540, 15, 3]
[2, 87, 90, 107, 76, 129, 1852, 30, 3]
[2, 87, 83, 149, 50, 9, 56, 664, 85, 2512, 15, 3]

Los detokenize intentos método para convertir estos ID de testigo volver al texto legible por humanos:

round_trip = tokenizers.en.detokenize(encoded)
for line in round_trip.numpy():
  print(line.decode('utf-8'))
and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n ' t test for curiosity .

El nivel inferior lookup conversos método de token-IDs a texto token:

tokens = tokenizers.en.lookup(encoded)
tokens
<tf.RaggedTensor [[b'[START]', b'and', b'when', b'you', b'improve', b'search', b'##ability', b',', b'you', b'actually', b'take', b'away', b'the', b'one', b'advantage', b'of', b'print', b',', b'which', b'is', b's', b'##ere', b'##nd', b'##ip', b'##ity', b'.', b'[END]'], [b'[START]', b'but', b'what', b'if', b'it', b'were', b'active', b'?', b'[END]'], [b'[START]', b'but', b'they', b'did', b'n', b"'", b't', b'test', b'for', b'curiosity', b'.', b'[END]']]>

Aquí puede ver el aspecto de "subpalabra" de los tokenizadores. La palabra "capacidad de búsqueda" se descompone en "capacidad de búsqueda ##" y la palabra "serendipia" en "s ## ere ## nd ## ip ## ity"

Configurar canalización de entrada

Para crear una canalización de entrada adecuada para el entrenamiento, aplicará algunas transformaciones al conjunto de datos.

Esta función se utilizará para codificar los lotes de texto sin formato:

def tokenize_pairs(pt, en):
    pt = tokenizers.pt.tokenize(pt)
    # Convert from ragged to dense, padding with zeros.
    pt = pt.to_tensor()

    en = tokenizers.en.tokenize(en)
    # Convert from ragged to dense, padding with zeros.
    en = en.to_tensor()
    return pt, en

Aquí hay una canalización de entrada simple que procesa, baraja y agrupa los datos:

BUFFER_SIZE = 20000
BATCH_SIZE = 64
def make_batches(ds):
  return (
      ds
      .cache()
      .shuffle(BUFFER_SIZE)
      .batch(BATCH_SIZE)
      .map(tokenize_pairs, num_parallel_calls=tf.data.AUTOTUNE)
      .prefetch(tf.data.AUTOTUNE))


train_batches = make_batches(train_examples)
val_batches = make_batches(val_examples)

Codificación posicional

Las capas de atención ven su entrada como un conjunto de vectores, sin orden secuencial. Este modelo tampoco contiene capas recurrentes o convolucionales. Debido a esto, se agrega una "codificación posicional" para darle al modelo alguna información sobre la posición relativa de los tokens en la oración.

El vector de codificación posicional se agrega al vector de incrustación. Las incrustaciones representan un token en un espacio d-dimensional donde los tokens con un significado similar estarán más cerca unos de otros. Pero las incrustaciones no codifican la posición relativa de los tokens en una oración. Así que después de la adición de la codificación posicional, tokens estarán más cerca el uno al otro sobre la base de la similitud de su significado y de su posición en la oración, en el espacio d-dimensional.

La fórmula para calcular la codificación posicional es la siguiente:

$$\Large{PE_{(pos, 2i)} = \sin(pos / 10000^{2i / d_{model} })} $$
$$\Large{PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i / d_{model} })} $$
def get_angles(pos, i, d_model):
  angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
  return pos * angle_rates
def positional_encoding(position, d_model):
  angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                          np.arange(d_model)[np.newaxis, :],
                          d_model)

  # apply sin to even indices in the array; 2i
  angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

  # apply cos to odd indices in the array; 2i+1
  angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

  pos_encoding = angle_rads[np.newaxis, ...]

  return tf.cast(pos_encoding, dtype=tf.float32)
n, d = 2048, 512
pos_encoding = positional_encoding(n, d)
print(pos_encoding.shape)
pos_encoding = pos_encoding[0]

# Juggle the dimensions for the plot
pos_encoding = tf.reshape(pos_encoding, (n, d//2, 2))
pos_encoding = tf.transpose(pos_encoding, (2, 1, 0))
pos_encoding = tf.reshape(pos_encoding, (d, n))

plt.pcolormesh(pos_encoding, cmap='RdBu')
plt.ylabel('Depth')
plt.xlabel('Position')
plt.colorbar()
plt.show()
(1, 2048, 512)

png

Enmascaramiento

Enmascare todas las fichas de pad en el lote de secuencia. Asegura que el modelo no trate el relleno como entrada. La máscara indica que el valor de la almohadilla de 0 está presente: se da salida a un 1 en esos lugares, y un 0 de otro modo.

def create_padding_mask(seq):
  seq = tf.cast(tf.math.equal(seq, 0), tf.float32)

  # add extra dimensions to add the padding
  # to the attention logits.
  return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)
x = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])
create_padding_mask(x)
<tf.Tensor: shape=(3, 1, 1, 5), dtype=float32, numpy=
array([[[[0., 0., 1., 1., 0.]]],


       [[[0., 0., 0., 1., 1.]]],


       [[[1., 1., 1., 0., 0.]]]], dtype=float32)>

La máscara de anticipación se utiliza para enmascarar los tokens futuros en una secuencia. En otras palabras, la máscara indica qué entradas no deben usarse.

Esto significa que para predecir el tercer token, solo se utilizarán el primero y el segundo token. De manera similar, para predecir el cuarto token, solo se usarán el primero, el segundo y el tercero, y así sucesivamente.

def create_look_ahead_mask(size):
  mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
  return mask  # (seq_len, seq_len)
x = tf.random.uniform((1, 3))
temp = create_look_ahead_mask(x.shape[1])
temp
<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[0., 1., 1.],
       [0., 0., 1.],
       [0., 0., 0.]], dtype=float32)>

Atención de productos escalados

scaled_dot_product_attention

La función de atención que utiliza el transformador toma tres entradas: Q (consulta), K (tecla), V (valor). La ecuación utilizada para calcular los pesos de atención es:

$$\Large{Attention(Q, K, V) = softmax_k\left(\frac{QK^T}{\sqrt{d_k} }\right) V} $$

La atención del producto escalado se escala por un factor de raíz cuadrada de la profundidad. Esto se hace porque para valores grandes de profundidad, el producto escalar aumenta en magnitud empujando la función softmax donde tiene pequeños gradientes que resultan en un softmax muy duro.

Por ejemplo, considere que Q y K tienen una media de 0 y una varianza de 1. Su multiplicación de matrices tendrán una media de 0 y una varianza de dk . Así que la raíz cuadrada de dk se utiliza para escalar, por lo que se obtiene una varianza constante, independientemente del valor de dk . Si la variación es demasiado baja, la salida puede ser demasiado plana para optimizarla de manera eficaz. Si la varianza es demasiado alta, el softmax puede saturarse en la inicialización, lo que dificulta el aprendizaje.

La máscara se multiplica por -1e9 (cerca del infinito negativo). Esto se hace porque la máscara se suma con la multiplicación de matriz escalada de Q y K y se aplica inmediatamente antes de un softmax. El objetivo es poner a cero estas celdas, y las grandes entradas negativas a softmax están cerca de cero en la salida.

def scaled_dot_product_attention(q, k, v, mask):
  """Calculate the attention weights.
  q, k, v must have matching leading dimensions.
  k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
  The mask has different shapes depending on its type(padding or look ahead)
  but it must be broadcastable for addition.

  Args:
    q: query shape == (..., seq_len_q, depth)
    k: key shape == (..., seq_len_k, depth)
    v: value shape == (..., seq_len_v, depth_v)
    mask: Float tensor with shape broadcastable
          to (..., seq_len_q, seq_len_k). Defaults to None.

  Returns:
    output, attention_weights
  """

  matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

  # scale matmul_qk
  dk = tf.cast(tf.shape(k)[-1], tf.float32)
  scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

  # add the mask to the scaled tensor.
  if mask is not None:
    scaled_attention_logits += (mask * -1e9)

  # softmax is normalized on the last axis (seq_len_k) so that the scores
  # add up to 1.
  attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)

  output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

  return output, attention_weights

A medida que la normalización softmax se realiza en K, sus valores deciden la importancia que se le da a Q.

La salida representa la multiplicación de los pesos de atención y el vector V (valor). Esto asegura que los tokens en los que desea enfocarse se mantengan como están y los tokens irrelevantes se eliminan.

def print_out(q, k, v):
  temp_out, temp_attn = scaled_dot_product_attention(
      q, k, v, None)
  print('Attention weights are:')
  print(temp_attn)
  print('Output is:')
  print(temp_out)
np.set_printoptions(suppress=True)

temp_k = tf.constant([[10, 0, 0],
                      [0, 10, 0],
                      [0, 0, 10],
                      [0, 0, 10]], dtype=tf.float32)  # (4, 3)

temp_v = tf.constant([[1, 0],
                      [10, 0],
                      [100, 5],
                      [1000, 6]], dtype=tf.float32)  # (4, 2)

# This `query` aligns with the second `key`,
# so the second `value` is returned.
temp_q = tf.constant([[0, 10, 0]], dtype=tf.float32)  # (1, 3)
print_out(temp_q, temp_k, temp_v)
2021-08-11 18:06:14.236081: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
Attention weights are:
tf.Tensor([[0. 1. 0. 0.]], shape=(1, 4), dtype=float32)
Output is:
tf.Tensor([[10.  0.]], shape=(1, 2), dtype=float32)
2021-08-11 18:06:14.623940: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
# This query aligns with a repeated key (third and fourth),
# so all associated values get averaged.
temp_q = tf.constant([[0, 0, 10]], dtype=tf.float32)  # (1, 3)
print_out(temp_q, temp_k, temp_v)
Attention weights are:
tf.Tensor([[0.  0.  0.5 0.5]], shape=(1, 4), dtype=float32)
Output is:
tf.Tensor([[550.    5.5]], shape=(1, 2), dtype=float32)
# This query aligns equally with the first and second key,
# so their values get averaged.
temp_q = tf.constant([[10, 10, 0]], dtype=tf.float32)  # (1, 3)
print_out(temp_q, temp_k, temp_v)
Attention weights are:
tf.Tensor([[0.5 0.5 0.  0. ]], shape=(1, 4), dtype=float32)
Output is:
tf.Tensor([[5.5 0. ]], shape=(1, 2), dtype=float32)

Pase todas las consultas juntas.

temp_q = tf.constant([[0, 0, 10],
                      [0, 10, 0],
                      [10, 10, 0]], dtype=tf.float32)  # (3, 3)
print_out(temp_q, temp_k, temp_v)
Attention weights are:
tf.Tensor(
[[0.  0.  0.5 0.5]
 [0.  1.  0.  0. ]
 [0.5 0.5 0.  0. ]], shape=(3, 4), dtype=float32)
Output is:
tf.Tensor(
[[550.    5.5]
 [ 10.    0. ]
 [  5.5   0. ]], shape=(3, 2), dtype=float32)

Atención multicabezal

atención de múltiples cabezas

La atención de varios cabezales consta de cuatro partes:

  • Capas lineales y divididas en cabezas.
  • Atención de productos escalados.
  • Concatenación de cabezas.
  • Capa lineal final.

Cada bloque de atención de varios cabezales recibe tres entradas; Q (consulta), K (clave), V (valor). Estos se colocan en capas lineales (densas) y se dividen en varias cabezas.

El scaled_dot_product_attention definido anteriormente se aplica a cada cabeza (transmitido por la eficiencia). Se debe utilizar una máscara adecuada en el paso de atención. Se concatena La salida atención para cada cabezal de entonces (utilizando tf.transpose , y tf.reshape ) y puesto a través de una final Dense capa.

En lugar de una sola cabeza de atención, Q, K y V se dividen en múltiples cabezas porque permite que el modelo atienda conjuntamente la información de diferentes subespacios de representación en diferentes posiciones. Después de la división, cada cabezal tiene una dimensionalidad reducida, por lo que el costo total de cálculo es el mismo que el de una atención de un solo cabezal con dimensionalidad completa.

class MultiHeadAttention(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads):
    super(MultiHeadAttention, self).__init__()
    self.num_heads = num_heads
    self.d_model = d_model

    assert d_model % self.num_heads == 0

    self.depth = d_model // self.num_heads

    self.wq = tf.keras.layers.Dense(d_model)
    self.wk = tf.keras.layers.Dense(d_model)
    self.wv = tf.keras.layers.Dense(d_model)

    self.dense = tf.keras.layers.Dense(d_model)

  def split_heads(self, x, batch_size):
    """Split the last dimension into (num_heads, depth).
    Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
    """
    x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
    return tf.transpose(x, perm=[0, 2, 1, 3])

  def call(self, v, k, q, mask):
    batch_size = tf.shape(q)[0]

    q = self.wq(q)  # (batch_size, seq_len, d_model)
    k = self.wk(k)  # (batch_size, seq_len, d_model)
    v = self.wv(v)  # (batch_size, seq_len, d_model)

    q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
    k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
    v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)

    # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
    # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
    scaled_attention, attention_weights = scaled_dot_product_attention(
        q, k, v, mask)

    scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)

    concat_attention = tf.reshape(scaled_attention,
                                  (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)

    output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)

    return output, attention_weights

Crear un MultiHeadAttention capa de probar. En cada lugar en la secuencia, y , la MultiHeadAttention ejecuta los 8 cabezas de atención en todos los otros lugares de la secuencia, devolviendo un nuevo vector de la misma longitud en cada ubicación.

temp_mha = MultiHeadAttention(d_model=512, num_heads=8)
y = tf.random.uniform((1, 60, 512))  # (batch_size, encoder_sequence, d_model)
out, attn = temp_mha(y, k=y, q=y, mask=None)
out.shape, attn.shape
(TensorShape([1, 60, 512]), TensorShape([1, 8, 60, 60]))

Red de avance puntual

La red de avance puntual consta de dos capas completamente conectadas con una activación ReLU en el medio.

def point_wise_feed_forward_network(d_model, dff):
  return tf.keras.Sequential([
      tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
      tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
  ])
sample_ffn = point_wise_feed_forward_network(512, 2048)
sample_ffn(tf.random.uniform((64, 50, 512))).shape
TensorShape([64, 50, 512])

Codificador y decodificador

transformador

El modelo de transformador sigue el mismo patrón general como un estándar de secuencia a la secuencia con el modelo de atención .

  • La frase de entrada se pasa a través N capas de codificador que genera una salida por cada ficha en la secuencia.
  • El decodificador presta atención a la salida del codificador y a su propia entrada (atención propia) para predecir la siguiente palabra.

Capa de codificador

Cada capa de codificador consta de subcapas:

  1. Atención multicabezal (con mascarilla acolchada)
  2. Redes de avance puntual.

Cada una de estas subcapas tiene una conexión residual a su alrededor seguida de una normalización de capa. Las conexiones residuales ayudan a evitar el problema del gradiente que desaparece en las redes profundas.

La salida de cada subcapa es LayerNorm(x + Sublayer(x)) . La normalización se realiza en el d_model eje (última). Hay N capas de codificador en el transformador.

class EncoderLayer(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, dff, rate=0.1):
    super(EncoderLayer, self).__init__()

    self.mha = MultiHeadAttention(d_model, num_heads)
    self.ffn = point_wise_feed_forward_network(d_model, dff)

    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

    self.dropout1 = tf.keras.layers.Dropout(rate)
    self.dropout2 = tf.keras.layers.Dropout(rate)

  def call(self, x, training, mask):

    attn_output, _ = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)
    attn_output = self.dropout1(attn_output, training=training)
    out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)

    ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
    ffn_output = self.dropout2(ffn_output, training=training)
    out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)

    return out2
sample_encoder_layer = EncoderLayer(512, 8, 2048)

sample_encoder_layer_output = sample_encoder_layer(
    tf.random.uniform((64, 43, 512)), False, None)

sample_encoder_layer_output.shape  # (batch_size, input_seq_len, d_model)
TensorShape([64, 43, 512])

Capa de decodificador

Cada capa de decodificador consta de subcapas:

  1. Atención multicabezal enmascarada (con máscara de anticipación y máscara de relleno)
  2. Atención multicabezal (con mascarilla acolchada). V (valor) y K (tecla) reciben la salida del codificador como entradas. Q (consulta) recibe la salida de la atención subcapa con varios cabezales enmascarado.
  3. Redes de avance puntual

Cada una de estas subcapas tiene una conexión residual a su alrededor seguida de una normalización de capa. La salida de cada subcapa es LayerNorm(x + Sublayer(x)) . La normalización se realiza en el d_model eje (última).

Hay N capas de decodificadores en el transformador.

Cuando Q recibe la salida del primer bloque de atención del decodificador y K recibe la salida del codificador, las ponderaciones de atención representan la importancia dada a la entrada del decodificador en función de la salida del codificador. En otras palabras, el decodificador predice el siguiente token mirando la salida del codificador y atendiendo su propia salida. Vea la demostración anterior en la sección de atención del producto escalado.

class DecoderLayer(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, dff, rate=0.1):
    super(DecoderLayer, self).__init__()

    self.mha1 = MultiHeadAttention(d_model, num_heads)
    self.mha2 = MultiHeadAttention(d_model, num_heads)

    self.ffn = point_wise_feed_forward_network(d_model, dff)

    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

    self.dropout1 = tf.keras.layers.Dropout(rate)
    self.dropout2 = tf.keras.layers.Dropout(rate)
    self.dropout3 = tf.keras.layers.Dropout(rate)

  def call(self, x, enc_output, training,
           look_ahead_mask, padding_mask):
    # enc_output.shape == (batch_size, input_seq_len, d_model)

    attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
    attn1 = self.dropout1(attn1, training=training)
    out1 = self.layernorm1(attn1 + x)

    attn2, attn_weights_block2 = self.mha2(
        enc_output, enc_output, out1, padding_mask)  # (batch_size, target_seq_len, d_model)
    attn2 = self.dropout2(attn2, training=training)
    out2 = self.layernorm2(attn2 + out1)  # (batch_size, target_seq_len, d_model)

    ffn_output = self.ffn(out2)  # (batch_size, target_seq_len, d_model)
    ffn_output = self.dropout3(ffn_output, training=training)
    out3 = self.layernorm3(ffn_output + out2)  # (batch_size, target_seq_len, d_model)

    return out3, attn_weights_block1, attn_weights_block2
sample_decoder_layer = DecoderLayer(512, 8, 2048)

sample_decoder_layer_output, _, _ = sample_decoder_layer(
    tf.random.uniform((64, 50, 512)), sample_encoder_layer_output,
    False, None, None)

sample_decoder_layer_output.shape  # (batch_size, target_seq_len, d_model)
TensorShape([64, 50, 512])

Codificador

El Encoder consta de:

  1. Incrustación de entrada
  2. Codificación posicional
  3. N capas de codificador

La entrada se somete a una incrustación que se suma con la codificación posicional. La salida de esta suma es la entrada a las capas del codificador. La salida del codificador es la entrada al decodificador.

class Encoder(tf.keras.layers.Layer):
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
               maximum_position_encoding, rate=0.1):
    super(Encoder, self).__init__()

    self.d_model = d_model
    self.num_layers = num_layers

    self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
    self.pos_encoding = positional_encoding(maximum_position_encoding,
                                            self.d_model)

    self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate)
                       for _ in range(num_layers)]

    self.dropout = tf.keras.layers.Dropout(rate)

  def call(self, x, training, mask):

    seq_len = tf.shape(x)[1]

    # adding embedding and position encoding.
    x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x += self.pos_encoding[:, :seq_len, :]

    x = self.dropout(x, training=training)

    for i in range(self.num_layers):
      x = self.enc_layers[i](x, training, mask)

    return x  # (batch_size, input_seq_len, d_model)
sample_encoder = Encoder(num_layers=2, d_model=512, num_heads=8,
                         dff=2048, input_vocab_size=8500,
                         maximum_position_encoding=10000)
temp_input = tf.random.uniform((64, 62), dtype=tf.int64, minval=0, maxval=200)

sample_encoder_output = sample_encoder(temp_input, training=False, mask=None)

print(sample_encoder_output.shape)  # (batch_size, input_seq_len, d_model)
(64, 62, 512)

Descifrador

El Decoder consiste en:

  1. Incrustación de salida
  2. Codificación posicional
  3. N capas de decodificador

El objetivo se somete a una incrustación que se suma con la codificación posicional. La salida de esta suma es la entrada a las capas del decodificador. La salida del decodificador es la entrada a la capa lineal final.

class Decoder(tf.keras.layers.Layer):
  def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size,
               maximum_position_encoding, rate=0.1):
    super(Decoder, self).__init__()

    self.d_model = d_model
    self.num_layers = num_layers

    self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
    self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)

    self.dec_layers = [DecoderLayer(d_model, num_heads, dff, rate)
                       for _ in range(num_layers)]
    self.dropout = tf.keras.layers.Dropout(rate)

  def call(self, x, enc_output, training,
           look_ahead_mask, padding_mask):

    seq_len = tf.shape(x)[1]
    attention_weights = {}

    x = self.embedding(x)  # (batch_size, target_seq_len, d_model)
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x += self.pos_encoding[:, :seq_len, :]

    x = self.dropout(x, training=training)

    for i in range(self.num_layers):
      x, block1, block2 = self.dec_layers[i](x, enc_output, training,
                                             look_ahead_mask, padding_mask)

      attention_weights[f'decoder_layer{i+1}_block1'] = block1
      attention_weights[f'decoder_layer{i+1}_block2'] = block2

    # x.shape == (batch_size, target_seq_len, d_model)
    return x, attention_weights
sample_decoder = Decoder(num_layers=2, d_model=512, num_heads=8,
                         dff=2048, target_vocab_size=8000,
                         maximum_position_encoding=5000)
temp_input = tf.random.uniform((64, 26), dtype=tf.int64, minval=0, maxval=200)

output, attn = sample_decoder(temp_input,
                              enc_output=sample_encoder_output,
                              training=False,
                              look_ahead_mask=None,
                              padding_mask=None)

output.shape, attn['decoder_layer2_block2'].shape
(TensorShape([64, 26, 512]), TensorShape([64, 8, 26, 62]))

Crea el transformador

El transformador consta del codificador, el decodificador y una capa lineal final. La salida del decodificador es la entrada a la capa lineal y se devuelve su salida.

class Transformer(tf.keras.Model):
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
               target_vocab_size, pe_input, pe_target, rate=0.1):
    super().__init__()
    self.encoder = Encoder(num_layers, d_model, num_heads, dff,
                             input_vocab_size, pe_input, rate)

    self.decoder = Decoder(num_layers, d_model, num_heads, dff,
                           target_vocab_size, pe_target, rate)

    self.final_layer = tf.keras.layers.Dense(target_vocab_size)

  def call(self, inputs, training):
    # Keras models prefer if you pass all your inputs in the first argument
    inp, tar = inputs

    enc_padding_mask, look_ahead_mask, dec_padding_mask = self.create_masks(inp, tar)

    enc_output = self.encoder(inp, training, enc_padding_mask)  # (batch_size, inp_seq_len, d_model)

    # dec_output.shape == (batch_size, tar_seq_len, d_model)
    dec_output, attention_weights = self.decoder(
        tar, enc_output, training, look_ahead_mask, dec_padding_mask)

    final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)

    return final_output, attention_weights

  def create_masks(self, inp, tar):
    # Encoder padding mask
    enc_padding_mask = create_padding_mask(inp)

    # Used in the 2nd attention block in the decoder.
    # This padding mask is used to mask the encoder outputs.
    dec_padding_mask = create_padding_mask(inp)

    # Used in the 1st attention block in the decoder.
    # It is used to pad and mask future tokens in the input received by
    # the decoder.
    look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])
    dec_target_padding_mask = create_padding_mask(tar)
    look_ahead_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)

    return enc_padding_mask, look_ahead_mask, dec_padding_mask
sample_transformer = Transformer(
    num_layers=2, d_model=512, num_heads=8, dff=2048,
    input_vocab_size=8500, target_vocab_size=8000,
    pe_input=10000, pe_target=6000)

temp_input = tf.random.uniform((64, 38), dtype=tf.int64, minval=0, maxval=200)
temp_target = tf.random.uniform((64, 36), dtype=tf.int64, minval=0, maxval=200)

fn_out, _ = sample_transformer([temp_input, temp_target], training=False)

fn_out.shape  # (batch_size, tar_seq_len, target_vocab_size)
TensorShape([64, 36, 8000])

Establecer hiperparámetros

Para mantener este ejemplo pequeña y relativamente rápido, los valores para num_layers, d_model, dff se han reducido.

El modelo base se describe en el papel utilizado: num_layers=6, d_model=512, dff=2048 .

num_layers = 4
d_model = 128
dff = 512
num_heads = 8
dropout_rate = 0.1

Optimizador

Utilizar el optimizador de Adán con un programador de velocidad de aprendizaje personalizado de acuerdo a la fórmula en el papel .

$$\Large{lrate = d_{model}^{-0.5} * \min(step{\_}num^{-0.5}, step{\_}num \cdot warmup{\_}steps^{-1.5})}$$
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
  def __init__(self, d_model, warmup_steps=4000):
    super(CustomSchedule, self).__init__()

    self.d_model = d_model
    self.d_model = tf.cast(self.d_model, tf.float32)

    self.warmup_steps = warmup_steps

  def __call__(self, step):
    arg1 = tf.math.rsqrt(step)
    arg2 = step * (self.warmup_steps ** -1.5)

    return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)
learning_rate = CustomSchedule(d_model)

optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98,
                                     epsilon=1e-9)
temp_learning_rate_schedule = CustomSchedule(d_model)

plt.plot(temp_learning_rate_schedule(tf.range(40000, dtype=tf.float32)))
plt.ylabel("Learning Rate")
plt.xlabel("Train Step")
Text(0.5, 0, 'Train Step')

png

Pérdidas y métricas

Dado que las secuencias de destino están rellenas, es importante aplicar una máscara de relleno al calcular la pérdida.

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')
def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask

  return tf.reduce_sum(loss_)/tf.reduce_sum(mask)


def accuracy_function(real, pred):
  accuracies = tf.equal(real, tf.argmax(pred, axis=2))

  mask = tf.math.logical_not(tf.math.equal(real, 0))
  accuracies = tf.math.logical_and(mask, accuracies)

  accuracies = tf.cast(accuracies, dtype=tf.float32)
  mask = tf.cast(mask, dtype=tf.float32)
  return tf.reduce_sum(accuracies)/tf.reduce_sum(mask)
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.Mean(name='train_accuracy')

Entrenamiento y puntos de control

transformer = Transformer(
    num_layers=num_layers,
    d_model=d_model,
    num_heads=num_heads,
    dff=dff,
    input_vocab_size=tokenizers.pt.get_vocab_size().numpy(),
    target_vocab_size=tokenizers.en.get_vocab_size().numpy(),
    pe_input=1000,
    pe_target=1000,
    rate=dropout_rate)

Cree la ruta del punto de control y el administrador del punto de control. Esto será utilizado para salvar los puestos de control cada n épocas.

checkpoint_path = "./checkpoints/train"

ckpt = tf.train.Checkpoint(transformer=transformer,
                           optimizer=optimizer)

ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

# if a checkpoint exists, restore the latest checkpoint.
if ckpt_manager.latest_checkpoint:
  ckpt.restore(ckpt_manager.latest_checkpoint)
  print('Latest checkpoint restored!!')

El objetivo se divide en tar_inp y tar_real. tar_inp se pasa como entrada al decodificador. tar_real es que misma entrada desplazado por 1: En cada ubicación en tar_input , tar_real contiene la siguiente muestra que debe ser predicho.

Por ejemplo, sentence = "SOS Un león en la selva está durmiendo EOS"

tar_inp = "SOS Un león en la selva está durmiendo"

tar_real = "Un león en la selva está durmiendo EOS"

El transformador es un modelo autoregresivo: hace predicciones una parte a la vez y utiliza su salida hasta ahora para decidir qué hacer a continuación.

Durante el entrenamiento, este ejemplo se utiliza el maestro de forzamiento (como en el tutorial de generación de texto ). La obligación del maestro consiste en pasar la salida real al siguiente paso de tiempo independientemente de lo que predice el modelo en el paso de tiempo actual.

Como el transformador predice cada ficha, la auto-atención permite que se vea en las fichas anteriores en la secuencia de entrada a predecir mejor el siguiente token.

Para evitar que el modelo vea la salida esperada, el modelo usa una máscara de anticipación.

EPOCHS = 20
# The @tf.function trace-compiles train_step into a TF graph for faster
# execution. The function specializes to the precise shape of the argument
# tensors. To avoid re-tracing due to the variable sequence lengths or variable
# batch sizes (the last batch is smaller), use input_signature to specify
# more generic shapes.

train_step_signature = [
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
]


@tf.function(input_signature=train_step_signature)
def train_step(inp, tar):
  tar_inp = tar[:, :-1]
  tar_real = tar[:, 1:]

  with tf.GradientTape() as tape:
    predictions, _ = transformer([inp, tar_inp],
                                 training = True)
    loss = loss_function(tar_real, predictions)

  gradients = tape.gradient(loss, transformer.trainable_variables)
  optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))

  train_loss(loss)
  train_accuracy(accuracy_function(tar_real, predictions))

El portugués se utiliza como idioma de entrada y el inglés es el idioma de destino.

for epoch in range(EPOCHS):
  start = time.time()

  train_loss.reset_states()
  train_accuracy.reset_states()

  # inp -> portuguese, tar -> english
  for (batch, (inp, tar)) in enumerate(train_batches):
    train_step(inp, tar)

    if batch % 50 == 0:
      print(f'Epoch {epoch + 1} Batch {batch} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')

  if (epoch + 1) % 5 == 0:
    ckpt_save_path = ckpt_manager.save()
    print(f'Saving checkpoint for epoch {epoch+1} at {ckpt_save_path}')

  print(f'Epoch {epoch + 1} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')

  print(f'Time taken for 1 epoch: {time.time() - start:.2f} secs\n')
Epoch 1 Batch 0 Loss 8.8834 Accuracy 0.0000
Epoch 1 Batch 50 Loss 8.8171 Accuracy 0.0003
Epoch 1 Batch 100 Loss 8.7146 Accuracy 0.0117
Epoch 1 Batch 150 Loss 8.6025 Accuracy 0.0238
Epoch 1 Batch 200 Loss 8.4625 Accuracy 0.0317
Epoch 1 Batch 250 Loss 8.2914 Accuracy 0.0398
Epoch 1 Batch 300 Loss 8.1014 Accuracy 0.0487
Epoch 1 Batch 350 Loss 7.9027 Accuracy 0.0587
Epoch 1 Batch 400 Loss 7.7155 Accuracy 0.0669
Epoch 1 Batch 450 Loss 7.5463 Accuracy 0.0746
Epoch 1 Batch 500 Loss 7.3971 Accuracy 0.0814
Epoch 1 Batch 550 Loss 7.2641 Accuracy 0.0878
Epoch 1 Batch 600 Loss 7.1401 Accuracy 0.0944
Epoch 1 Batch 650 Loss 7.0245 Accuracy 0.1011
Epoch 1 Batch 700 Loss 6.9162 Accuracy 0.1077
Epoch 1 Batch 750 Loss 6.8175 Accuracy 0.1136
Epoch 1 Batch 800 Loss 6.7260 Accuracy 0.1192
Epoch 1 Loss 6.7104 Accuracy 0.1201
Time taken for 1 epoch: 63.51 secs

Epoch 2 Batch 0 Loss 5.0207 Accuracy 0.2346
Epoch 2 Batch 50 Loss 5.2307 Accuracy 0.2122
Epoch 2 Batch 100 Loss 5.2121 Accuracy 0.2158
Epoch 2 Batch 150 Loss 5.1723 Accuracy 0.2200
Epoch 2 Batch 200 Loss 5.1427 Accuracy 0.2231
Epoch 2 Batch 250 Loss 5.1162 Accuracy 0.2259
Epoch 2 Batch 300 Loss 5.0927 Accuracy 0.2283
Epoch 2 Batch 350 Loss 5.0732 Accuracy 0.2303
Epoch 2 Batch 400 Loss 5.0555 Accuracy 0.2322
Epoch 2 Batch 450 Loss 5.0329 Accuracy 0.2345
Epoch 2 Batch 500 Loss 5.0156 Accuracy 0.2357
Epoch 2 Batch 550 Loss 4.9974 Accuracy 0.2373
Epoch 2 Batch 600 Loss 4.9774 Accuracy 0.2391
Epoch 2 Batch 650 Loss 4.9573 Accuracy 0.2408
Epoch 2 Batch 700 Loss 4.9408 Accuracy 0.2422
Epoch 2 Batch 750 Loss 4.9237 Accuracy 0.2436
Epoch 2 Batch 800 Loss 4.9054 Accuracy 0.2451
Epoch 2 Loss 4.9014 Accuracy 0.2455
Time taken for 1 epoch: 50.09 secs

Epoch 3 Batch 0 Loss 4.4613 Accuracy 0.2930
Epoch 3 Batch 50 Loss 4.5732 Accuracy 0.2737
Epoch 3 Batch 100 Loss 4.5668 Accuracy 0.2750
Epoch 3 Batch 150 Loss 4.5608 Accuracy 0.2747
Epoch 3 Batch 200 Loss 4.5519 Accuracy 0.2754
Epoch 3 Batch 250 Loss 4.5469 Accuracy 0.2755
Epoch 3 Batch 300 Loss 4.5367 Accuracy 0.2761
Epoch 3 Batch 350 Loss 4.5232 Accuracy 0.2777
Epoch 3 Batch 400 Loss 4.5089 Accuracy 0.2794
Epoch 3 Batch 450 Loss 4.4975 Accuracy 0.2807
Epoch 3 Batch 500 Loss 4.4818 Accuracy 0.2827
Epoch 3 Batch 550 Loss 4.4671 Accuracy 0.2844
Epoch 3 Batch 600 Loss 4.4516 Accuracy 0.2862
Epoch 3 Batch 650 Loss 4.4376 Accuracy 0.2879
Epoch 3 Batch 700 Loss 4.4223 Accuracy 0.2897
Epoch 3 Batch 750 Loss 4.4060 Accuracy 0.2916
Epoch 3 Batch 800 Loss 4.3902 Accuracy 0.2936
Epoch 3 Loss 4.3880 Accuracy 0.2939
Time taken for 1 epoch: 49.94 secs

Epoch 4 Batch 0 Loss 4.0719 Accuracy 0.3241
Epoch 4 Batch 50 Loss 4.0570 Accuracy 0.3289
Epoch 4 Batch 100 Loss 4.0280 Accuracy 0.3336
Epoch 4 Batch 150 Loss 4.0128 Accuracy 0.3356
Epoch 4 Batch 200 Loss 3.9983 Accuracy 0.3381
Epoch 4 Batch 250 Loss 3.9811 Accuracy 0.3405
Epoch 4 Batch 300 Loss 3.9700 Accuracy 0.3419
Epoch 4 Batch 350 Loss 3.9590 Accuracy 0.3435
Epoch 4 Batch 400 Loss 3.9454 Accuracy 0.3455
Epoch 4 Batch 450 Loss 3.9319 Accuracy 0.3472
Epoch 4 Batch 500 Loss 3.9187 Accuracy 0.3488
Epoch 4 Batch 550 Loss 3.9056 Accuracy 0.3505
Epoch 4 Batch 600 Loss 3.8928 Accuracy 0.3523
Epoch 4 Batch 650 Loss 3.8805 Accuracy 0.3541
Epoch 4 Batch 700 Loss 3.8652 Accuracy 0.3561
Epoch 4 Batch 750 Loss 3.8496 Accuracy 0.3581
Epoch 4 Batch 800 Loss 3.8367 Accuracy 0.3598
Epoch 4 Loss 3.8342 Accuracy 0.3602
Time taken for 1 epoch: 50.11 secs

Epoch 5 Batch 0 Loss 3.3066 Accuracy 0.4257
Epoch 5 Batch 50 Loss 3.5243 Accuracy 0.3967
Epoch 5 Batch 100 Loss 3.4975 Accuracy 0.4005
Epoch 5 Batch 150 Loss 3.4915 Accuracy 0.4017
Epoch 5 Batch 200 Loss 3.4911 Accuracy 0.4016
Epoch 5 Batch 250 Loss 3.4790 Accuracy 0.4037
Epoch 5 Batch 300 Loss 3.4692 Accuracy 0.4048
Epoch 5 Batch 350 Loss 3.4599 Accuracy 0.4059
Epoch 5 Batch 400 Loss 3.4491 Accuracy 0.4074
Epoch 5 Batch 450 Loss 3.4356 Accuracy 0.4092
Epoch 5 Batch 500 Loss 3.4243 Accuracy 0.4106
Epoch 5 Batch 550 Loss 3.4159 Accuracy 0.4115
Epoch 5 Batch 600 Loss 3.4083 Accuracy 0.4124
Epoch 5 Batch 650 Loss 3.4002 Accuracy 0.4134
Epoch 5 Batch 700 Loss 3.3922 Accuracy 0.4146
Epoch 5 Batch 750 Loss 3.3824 Accuracy 0.4158
Epoch 5 Batch 800 Loss 3.3739 Accuracy 0.4168
Saving checkpoint for epoch 5 at ./checkpoints/train/ckpt-1
Epoch 5 Loss 3.3715 Accuracy 0.4171
Time taken for 1 epoch: 50.52 secs

Epoch 6 Batch 0 Loss 3.2623 Accuracy 0.4515
Epoch 6 Batch 50 Loss 3.1389 Accuracy 0.4400
Epoch 6 Batch 100 Loss 3.1121 Accuracy 0.4445
Epoch 6 Batch 150 Loss 3.0973 Accuracy 0.4471
Epoch 6 Batch 200 Loss 3.0844 Accuracy 0.4496
Epoch 6 Batch 250 Loss 3.0725 Accuracy 0.4519
Epoch 6 Batch 300 Loss 3.0648 Accuracy 0.4533
Epoch 6 Batch 350 Loss 3.0572 Accuracy 0.4545
Epoch 6 Batch 400 Loss 3.0523 Accuracy 0.4551
Epoch 6 Batch 450 Loss 3.0423 Accuracy 0.4568
Epoch 6 Batch 500 Loss 3.0327 Accuracy 0.4582
Epoch 6 Batch 550 Loss 3.0237 Accuracy 0.4594
Epoch 6 Batch 600 Loss 3.0123 Accuracy 0.4609
Epoch 6 Batch 650 Loss 3.0031 Accuracy 0.4622
Epoch 6 Batch 700 Loss 2.9950 Accuracy 0.4634
Epoch 6 Batch 750 Loss 2.9881 Accuracy 0.4646
Epoch 6 Batch 800 Loss 2.9792 Accuracy 0.4658
Epoch 6 Loss 2.9779 Accuracy 0.4660
Time taken for 1 epoch: 49.88 secs

Epoch 7 Batch 0 Loss 2.6813 Accuracy 0.5090
Epoch 7 Batch 50 Loss 2.7174 Accuracy 0.4964
Epoch 7 Batch 100 Loss 2.7045 Accuracy 0.4980
Epoch 7 Batch 150 Loss 2.6982 Accuracy 0.4991
Epoch 7 Batch 200 Loss 2.6947 Accuracy 0.5002
Epoch 7 Batch 250 Loss 2.6909 Accuracy 0.5007
Epoch 7 Batch 300 Loss 2.6822 Accuracy 0.5023
Epoch 7 Batch 350 Loss 2.6765 Accuracy 0.5036
Epoch 7 Batch 400 Loss 2.6743 Accuracy 0.5043
Epoch 7 Batch 450 Loss 2.6693 Accuracy 0.5050
Epoch 7 Batch 500 Loss 2.6641 Accuracy 0.5055
Epoch 7 Batch 550 Loss 2.6581 Accuracy 0.5063
Epoch 7 Batch 600 Loss 2.6516 Accuracy 0.5074
Epoch 7 Batch 650 Loss 2.6488 Accuracy 0.5080
Epoch 7 Batch 700 Loss 2.6465 Accuracy 0.5087
Epoch 7 Batch 750 Loss 2.6440 Accuracy 0.5091
Epoch 7 Batch 800 Loss 2.6428 Accuracy 0.5093
Epoch 7 Loss 2.6422 Accuracy 0.5094
Time taken for 1 epoch: 50.18 secs

Epoch 8 Batch 0 Loss 2.4922 Accuracy 0.5348
Epoch 8 Batch 50 Loss 2.4303 Accuracy 0.5361
Epoch 8 Batch 100 Loss 2.4228 Accuracy 0.5369
Epoch 8 Batch 150 Loss 2.4262 Accuracy 0.5373
Epoch 8 Batch 200 Loss 2.4359 Accuracy 0.5361
Epoch 8 Batch 250 Loss 2.4285 Accuracy 0.5373
Epoch 8 Batch 300 Loss 2.4315 Accuracy 0.5368
Epoch 8 Batch 350 Loss 2.4285 Accuracy 0.5368
Epoch 8 Batch 400 Loss 2.4265 Accuracy 0.5373
Epoch 8 Batch 450 Loss 2.4289 Accuracy 0.5371
Epoch 8 Batch 500 Loss 2.4252 Accuracy 0.5377
Epoch 8 Batch 550 Loss 2.4228 Accuracy 0.5382
Epoch 8 Batch 600 Loss 2.4201 Accuracy 0.5386
Epoch 8 Batch 650 Loss 2.4199 Accuracy 0.5388
Epoch 8 Batch 700 Loss 2.4177 Accuracy 0.5394
Epoch 8 Batch 750 Loss 2.4135 Accuracy 0.5400
Epoch 8 Batch 800 Loss 2.4100 Accuracy 0.5407
Epoch 8 Loss 2.4093 Accuracy 0.5407
Time taken for 1 epoch: 50.20 secs

Epoch 9 Batch 0 Loss 2.1292 Accuracy 0.5823
Epoch 9 Batch 50 Loss 2.2265 Accuracy 0.5628
Epoch 9 Batch 100 Loss 2.2201 Accuracy 0.5643
Epoch 9 Batch 150 Loss 2.2378 Accuracy 0.5617
Epoch 9 Batch 200 Loss 2.2386 Accuracy 0.5619
Epoch 9 Batch 250 Loss 2.2369 Accuracy 0.5620
Epoch 9 Batch 300 Loss 2.2367 Accuracy 0.5624
Epoch 9 Batch 350 Loss 2.2420 Accuracy 0.5618
Epoch 9 Batch 400 Loss 2.2416 Accuracy 0.5621
Epoch 9 Batch 450 Loss 2.2421 Accuracy 0.5620
Epoch 9 Batch 500 Loss 2.2397 Accuracy 0.5624
Epoch 9 Batch 550 Loss 2.2369 Accuracy 0.5631
Epoch 9 Batch 600 Loss 2.2350 Accuracy 0.5636
Epoch 9 Batch 650 Loss 2.2332 Accuracy 0.5640
Epoch 9 Batch 700 Loss 2.2332 Accuracy 0.5641
Epoch 9 Batch 750 Loss 2.2344 Accuracy 0.5641
Epoch 9 Batch 800 Loss 2.2331 Accuracy 0.5644
Epoch 9 Loss 2.2332 Accuracy 0.5644
Time taken for 1 epoch: 49.98 secs

Epoch 10 Batch 0 Loss 1.9345 Accuracy 0.6113
Epoch 10 Batch 50 Loss 2.0750 Accuracy 0.5826
Epoch 10 Batch 100 Loss 2.0660 Accuracy 0.5854
Epoch 10 Batch 150 Loss 2.0716 Accuracy 0.5848
Epoch 10 Batch 200 Loss 2.0771 Accuracy 0.5843
Epoch 10 Batch 250 Loss 2.0872 Accuracy 0.5827
Epoch 10 Batch 300 Loss 2.0928 Accuracy 0.5822
Epoch 10 Batch 350 Loss 2.0927 Accuracy 0.5824
Epoch 10 Batch 400 Loss 2.0932 Accuracy 0.5823
Epoch 10 Batch 450 Loss 2.0987 Accuracy 0.5815
Epoch 10 Batch 500 Loss 2.0972 Accuracy 0.5817
Epoch 10 Batch 550 Loss 2.0952 Accuracy 0.5821
Epoch 10 Batch 600 Loss 2.0951 Accuracy 0.5823
Epoch 10 Batch 650 Loss 2.0961 Accuracy 0.5824
Epoch 10 Batch 700 Loss 2.0953 Accuracy 0.5825
Epoch 10 Batch 750 Loss 2.0957 Accuracy 0.5826
Epoch 10 Batch 800 Loss 2.0964 Accuracy 0.5827
Saving checkpoint for epoch 10 at ./checkpoints/train/ckpt-2
Epoch 10 Loss 2.0967 Accuracy 0.5826
Time taken for 1 epoch: 50.31 secs

Epoch 11 Batch 0 Loss 1.8751 Accuracy 0.6013
Epoch 11 Batch 50 Loss 1.9732 Accuracy 0.5984
Epoch 11 Batch 100 Loss 1.9554 Accuracy 0.6015
Epoch 11 Batch 150 Loss 1.9659 Accuracy 0.6000
Epoch 11 Batch 200 Loss 1.9657 Accuracy 0.6003
Epoch 11 Batch 250 Loss 1.9708 Accuracy 0.5998
Epoch 11 Batch 300 Loss 1.9733 Accuracy 0.5998
Epoch 11 Batch 350 Loss 1.9698 Accuracy 0.6006
Epoch 11 Batch 400 Loss 1.9696 Accuracy 0.6006
Epoch 11 Batch 450 Loss 1.9692 Accuracy 0.6006
Epoch 11 Batch 500 Loss 1.9704 Accuracy 0.6008
Epoch 11 Batch 550 Loss 1.9720 Accuracy 0.6006
Epoch 11 Batch 600 Loss 1.9744 Accuracy 0.6002
Epoch 11 Batch 650 Loss 1.9782 Accuracy 0.5996
Epoch 11 Batch 700 Loss 1.9801 Accuracy 0.5993
Epoch 11 Batch 750 Loss 1.9810 Accuracy 0.5993
Epoch 11 Batch 800 Loss 1.9847 Accuracy 0.5988
Epoch 11 Loss 1.9847 Accuracy 0.5988
Time taken for 1 epoch: 50.20 secs

Epoch 12 Batch 0 Loss 1.8019 Accuracy 0.6165
Epoch 12 Batch 50 Loss 1.8706 Accuracy 0.6121
Epoch 12 Batch 100 Loss 1.8494 Accuracy 0.6170
Epoch 12 Batch 150 Loss 1.8620 Accuracy 0.6141
Epoch 12 Batch 200 Loss 1.8596 Accuracy 0.6152
Epoch 12 Batch 250 Loss 1.8639 Accuracy 0.6149
Epoch 12 Batch 300 Loss 1.8700 Accuracy 0.6141
Epoch 12 Batch 350 Loss 1.8721 Accuracy 0.6137
Epoch 12 Batch 400 Loss 1.8756 Accuracy 0.6137
Epoch 12 Batch 450 Loss 1.8741 Accuracy 0.6140
Epoch 12 Batch 500 Loss 1.8762 Accuracy 0.6135
Epoch 12 Batch 550 Loss 1.8781 Accuracy 0.6136
Epoch 12 Batch 600 Loss 1.8792 Accuracy 0.6137
Epoch 12 Batch 650 Loss 1.8813 Accuracy 0.6135
Epoch 12 Batch 700 Loss 1.8833 Accuracy 0.6132
Epoch 12 Batch 750 Loss 1.8870 Accuracy 0.6128
Epoch 12 Batch 800 Loss 1.8878 Accuracy 0.6127
Epoch 12 Loss 1.8877 Accuracy 0.6128
Time taken for 1 epoch: 50.11 secs

Epoch 13 Batch 0 Loss 1.7577 Accuracy 0.6166
Epoch 13 Batch 50 Loss 1.7580 Accuracy 0.6288
Epoch 13 Batch 100 Loss 1.7815 Accuracy 0.6261
Epoch 13 Batch 150 Loss 1.7886 Accuracy 0.6255
Epoch 13 Batch 200 Loss 1.7870 Accuracy 0.6266
Epoch 13 Batch 250 Loss 1.7852 Accuracy 0.6269
Epoch 13 Batch 300 Loss 1.7886 Accuracy 0.6266
Epoch 13 Batch 350 Loss 1.7887 Accuracy 0.6265
Epoch 13 Batch 400 Loss 1.7909 Accuracy 0.6263
Epoch 13 Batch 450 Loss 1.7928 Accuracy 0.6259
Epoch 13 Batch 500 Loss 1.7948 Accuracy 0.6257
Epoch 13 Batch 550 Loss 1.7940 Accuracy 0.6260
Epoch 13 Batch 600 Loss 1.7980 Accuracy 0.6255
Epoch 13 Batch 650 Loss 1.8034 Accuracy 0.6248
Epoch 13 Batch 700 Loss 1.8055 Accuracy 0.6246
Epoch 13 Batch 750 Loss 1.8079 Accuracy 0.6242
Epoch 13 Batch 800 Loss 1.8100 Accuracy 0.6240
Epoch 13 Loss 1.8095 Accuracy 0.6242
Time taken for 1 epoch: 50.46 secs

Epoch 14 Batch 0 Loss 1.5412 Accuracy 0.6549
Epoch 14 Batch 50 Loss 1.6754 Accuracy 0.6440
Epoch 14 Batch 100 Loss 1.6976 Accuracy 0.6404
Epoch 14 Batch 150 Loss 1.7039 Accuracy 0.6400
Epoch 14 Batch 200 Loss 1.7044 Accuracy 0.6393
Epoch 14 Batch 250 Loss 1.7092 Accuracy 0.6387
Epoch 14 Batch 300 Loss 1.7117 Accuracy 0.6381
Epoch 14 Batch 350 Loss 1.7155 Accuracy 0.6378
Epoch 14 Batch 400 Loss 1.7190 Accuracy 0.6373
Epoch 14 Batch 450 Loss 1.7239 Accuracy 0.6366
Epoch 14 Batch 500 Loss 1.7270 Accuracy 0.6363
Epoch 14 Batch 550 Loss 1.7268 Accuracy 0.6365
Epoch 14 Batch 600 Loss 1.7279 Accuracy 0.6364
Epoch 14 Batch 650 Loss 1.7310 Accuracy 0.6361
Epoch 14 Batch 700 Loss 1.7349 Accuracy 0.6356
Epoch 14 Batch 750 Loss 1.7368 Accuracy 0.6354
Epoch 14 Batch 800 Loss 1.7388 Accuracy 0.6352
Epoch 14 Loss 1.7391 Accuracy 0.6350
Time taken for 1 epoch: 50.06 secs

Epoch 15 Batch 0 Loss 1.4886 Accuracy 0.6655
Epoch 15 Batch 50 Loss 1.6380 Accuracy 0.6488
Epoch 15 Batch 100 Loss 1.6306 Accuracy 0.6508
Epoch 15 Batch 150 Loss 1.6376 Accuracy 0.6498
Epoch 15 Batch 200 Loss 1.6446 Accuracy 0.6488
Epoch 15 Batch 250 Loss 1.6529 Accuracy 0.6472
Epoch 15 Batch 300 Loss 1.6506 Accuracy 0.6480
Epoch 15 Batch 350 Loss 1.6529 Accuracy 0.6476
Epoch 15 Batch 400 Loss 1.6582 Accuracy 0.6466
Epoch 15 Batch 450 Loss 1.6600 Accuracy 0.6462
Epoch 15 Batch 500 Loss 1.6602 Accuracy 0.6461
Epoch 15 Batch 550 Loss 1.6621 Accuracy 0.6460
Epoch 15 Batch 600 Loss 1.6657 Accuracy 0.6455
Epoch 15 Batch 650 Loss 1.6684 Accuracy 0.6452
Epoch 15 Batch 700 Loss 1.6717 Accuracy 0.6447
Epoch 15 Batch 750 Loss 1.6751 Accuracy 0.6442
Epoch 15 Batch 800 Loss 1.6780 Accuracy 0.6438
Saving checkpoint for epoch 15 at ./checkpoints/train/ckpt-3
Epoch 15 Loss 1.6776 Accuracy 0.6439
Time taken for 1 epoch: 50.45 secs

Epoch 16 Batch 0 Loss 1.6809 Accuracy 0.6423
Epoch 16 Batch 50 Loss 1.5821 Accuracy 0.6579
Epoch 16 Batch 100 Loss 1.5793 Accuracy 0.6589
Epoch 16 Batch 150 Loss 1.5799 Accuracy 0.6592
Epoch 16 Batch 200 Loss 1.5882 Accuracy 0.6573
Epoch 16 Batch 250 Loss 1.5968 Accuracy 0.6562
Epoch 16 Batch 300 Loss 1.5993 Accuracy 0.6558
Epoch 16 Batch 350 Loss 1.6046 Accuracy 0.6549
Epoch 16 Batch 400 Loss 1.6066 Accuracy 0.6549
Epoch 16 Batch 450 Loss 1.6089 Accuracy 0.6543
Epoch 16 Batch 500 Loss 1.6086 Accuracy 0.6547
Epoch 16 Batch 550 Loss 1.6116 Accuracy 0.6542
Epoch 16 Batch 600 Loss 1.6148 Accuracy 0.6537
Epoch 16 Batch 650 Loss 1.6167 Accuracy 0.6534
Epoch 16 Batch 700 Loss 1.6182 Accuracy 0.6532
Epoch 16 Batch 750 Loss 1.6223 Accuracy 0.6526
Epoch 16 Batch 800 Loss 1.6257 Accuracy 0.6522
Epoch 16 Loss 1.6262 Accuracy 0.6521
Time taken for 1 epoch: 50.25 secs

Epoch 17 Batch 0 Loss 1.4921 Accuracy 0.6865
Epoch 17 Batch 50 Loss 1.5425 Accuracy 0.6633
Epoch 17 Batch 100 Loss 1.5382 Accuracy 0.6653
Epoch 17 Batch 150 Loss 1.5448 Accuracy 0.6641
Epoch 17 Batch 200 Loss 1.5442 Accuracy 0.6642
Epoch 17 Batch 250 Loss 1.5507 Accuracy 0.6631
Epoch 17 Batch 300 Loss 1.5498 Accuracy 0.6634
Epoch 17 Batch 350 Loss 1.5504 Accuracy 0.6633
Epoch 17 Batch 400 Loss 1.5554 Accuracy 0.6626
Epoch 17 Batch 450 Loss 1.5544 Accuracy 0.6627
Epoch 17 Batch 500 Loss 1.5573 Accuracy 0.6623
Epoch 17 Batch 550 Loss 1.5585 Accuracy 0.6621
Epoch 17 Batch 600 Loss 1.5618 Accuracy 0.6616
Epoch 17 Batch 650 Loss 1.5658 Accuracy 0.6610
Epoch 17 Batch 700 Loss 1.5685 Accuracy 0.6608
Epoch 17 Batch 750 Loss 1.5706 Accuracy 0.6605
Epoch 17 Batch 800 Loss 1.5733 Accuracy 0.6601
Epoch 17 Loss 1.5730 Accuracy 0.6602
Time taken for 1 epoch: 49.99 secs

Epoch 18 Batch 0 Loss 1.3174 Accuracy 0.7121
Epoch 18 Batch 50 Loss 1.4979 Accuracy 0.6729
Epoch 18 Batch 100 Loss 1.5023 Accuracy 0.6712
Epoch 18 Batch 150 Loss 1.4990 Accuracy 0.6719
Epoch 18 Batch 200 Loss 1.5051 Accuracy 0.6705
Epoch 18 Batch 250 Loss 1.5029 Accuracy 0.6709
Epoch 18 Batch 300 Loss 1.5063 Accuracy 0.6706
Epoch 18 Batch 350 Loss 1.5092 Accuracy 0.6701
Epoch 18 Batch 400 Loss 1.5103 Accuracy 0.6702
Epoch 18 Batch 450 Loss 1.5150 Accuracy 0.6692
Epoch 18 Batch 500 Loss 1.5157 Accuracy 0.6690
Epoch 18 Batch 550 Loss 1.5176 Accuracy 0.6686
Epoch 18 Batch 600 Loss 1.5211 Accuracy 0.6682
Epoch 18 Batch 650 Loss 1.5220 Accuracy 0.6681
Epoch 18 Batch 700 Loss 1.5251 Accuracy 0.6676
Epoch 18 Batch 750 Loss 1.5284 Accuracy 0.6671
Epoch 18 Batch 800 Loss 1.5310 Accuracy 0.6667
Epoch 18 Loss 1.5326 Accuracy 0.6664
Time taken for 1 epoch: 50.17 secs

Epoch 19 Batch 0 Loss 1.1448 Accuracy 0.7440
Epoch 19 Batch 50 Loss 1.4359 Accuracy 0.6841
Epoch 19 Batch 100 Loss 1.4299 Accuracy 0.6847
Epoch 19 Batch 150 Loss 1.4454 Accuracy 0.6812
Epoch 19 Batch 200 Loss 1.4525 Accuracy 0.6800
Epoch 19 Batch 250 Loss 1.4549 Accuracy 0.6792
Epoch 19 Batch 300 Loss 1.4585 Accuracy 0.6785
Epoch 19 Batch 350 Loss 1.4612 Accuracy 0.6780
Epoch 19 Batch 400 Loss 1.4655 Accuracy 0.6771
Epoch 19 Batch 450 Loss 1.4698 Accuracy 0.6762
Epoch 19 Batch 500 Loss 1.4731 Accuracy 0.6756
Epoch 19 Batch 550 Loss 1.4752 Accuracy 0.6754
Epoch 19 Batch 600 Loss 1.4789 Accuracy 0.6749
Epoch 19 Batch 650 Loss 1.4824 Accuracy 0.6743
Epoch 19 Batch 700 Loss 1.4861 Accuracy 0.6737
Epoch 19 Batch 750 Loss 1.4904 Accuracy 0.6730
Epoch 19 Batch 800 Loss 1.4922 Accuracy 0.6727
Epoch 19 Loss 1.4918 Accuracy 0.6728
Time taken for 1 epoch: 50.39 secs

Epoch 20 Batch 0 Loss 1.6377 Accuracy 0.6362
Epoch 20 Batch 50 Loss 1.4039 Accuracy 0.6865
Epoch 20 Batch 100 Loss 1.4086 Accuracy 0.6860
Epoch 20 Batch 150 Loss 1.4210 Accuracy 0.6838
Epoch 20 Batch 200 Loss 1.4186 Accuracy 0.6847
Epoch 20 Batch 250 Loss 1.4228 Accuracy 0.6839
Epoch 20 Batch 300 Loss 1.4270 Accuracy 0.6827
Epoch 20 Batch 350 Loss 1.4326 Accuracy 0.6817
Epoch 20 Batch 400 Loss 1.4357 Accuracy 0.6811
Epoch 20 Batch 450 Loss 1.4408 Accuracy 0.6804
Epoch 20 Batch 500 Loss 1.4422 Accuracy 0.6801
Epoch 20 Batch 550 Loss 1.4448 Accuracy 0.6797
Epoch 20 Batch 600 Loss 1.4451 Accuracy 0.6798
Epoch 20 Batch 650 Loss 1.4471 Accuracy 0.6796
Epoch 20 Batch 700 Loss 1.4501 Accuracy 0.6792
Epoch 20 Batch 750 Loss 1.4512 Accuracy 0.6792
Epoch 20 Batch 800 Loss 1.4548 Accuracy 0.6788
Saving checkpoint for epoch 20 at ./checkpoints/train/ckpt-4
Epoch 20 Loss 1.4550 Accuracy 0.6787
Time taken for 1 epoch: 50.33 secs

Ejecutar inferencia

Los siguientes pasos se utilizan para la inferencia:

  • Codificar la frase de entrada utilizando el tokenizer portuguesa ( tokenizers.pt ). Esta es la entrada del codificador.
  • La entrada del decodificador se inicializa a la [START] token.
  • Calcule las máscaras de relleno y las máscaras de anticipación.
  • El decoder continuación da salida a las predicciones observando la encoder output y su propia salida (auto-atención).
  • Concatenar el token predicho a la entrada del decodificador y pasarlo al decodificador.
  • En este enfoque, el decodificador predice el siguiente token basándose en los tokens anteriores que predijo.
class Translator(tf.Module):
  def __init__(self, tokenizers, transformer):
    self.tokenizers = tokenizers
    self.transformer = transformer

  def __call__(self, sentence, max_length=20):
    # input sentence is portuguese, hence adding the start and end token
    assert isinstance(sentence, tf.Tensor)
    if len(sentence.shape) == 0:
      sentence = sentence[tf.newaxis]

    sentence = self.tokenizers.pt.tokenize(sentence).to_tensor()

    encoder_input = sentence

    # as the target is english, the first token to the transformer should be the
    # english start token.
    start_end = self.tokenizers.en.tokenize([''])[0]
    start = start_end[0][tf.newaxis]
    end = start_end[1][tf.newaxis]

    # `tf.TensorArray` is required here (instead of a python list) so that the
    # dynamic-loop can be traced by `tf.function`.
    output_array = tf.TensorArray(dtype=tf.int64, size=0, dynamic_size=True)
    output_array = output_array.write(0, start)

    for i in tf.range(max_length):
      output = tf.transpose(output_array.stack())
      predictions, _ = self.transformer([encoder_input, output], training=False)

      # select the last token from the seq_len dimension
      predictions = predictions[:, -1:, :]  # (batch_size, 1, vocab_size)

      predicted_id = tf.argmax(predictions, axis=-1)

      # concatentate the predicted_id to the output which is given to the decoder
      # as its input.
      output_array = output_array.write(i+1, predicted_id[0])

      if predicted_id == end:
        break

    output = tf.transpose(output_array.stack())
    # output.shape (1, tokens)
    text = tokenizers.en.detokenize(output)[0]  # shape: ()

    tokens = tokenizers.en.lookup(output)[0]

    # `tf.function` prevents us from using the attention_weights that were
    # calculated on the last iteration of the loop. So recalculate them outside
    # the loop.
    _, attention_weights = self.transformer([encoder_input, output[:,:-1]], training=False)

    return text, tokens, attention_weights

Crear una instancia de este Translator de clase, y probarlo un par de veces:

translator = Translator(tokenizers, transformer)
def print_translation(sentence, tokens, ground_truth):
  print(f'{"Input:":15s}: {sentence}')
  print(f'{"Prediction":15s}: {tokens.numpy().decode("utf-8")}')
  print(f'{"Ground truth":15s}: {ground_truth}')
sentence = "este é um problema que temos que resolver."
ground_truth = "this is a problem we have to solve ."

translated_text, translated_tokens, attention_weights = translator(
    tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)
Input:         : este é um problema que temos que resolver.
Prediction     : this is a problem that we have to solve .
Ground truth   : this is a problem we have to solve .
sentence = "os meus vizinhos ouviram sobre esta ideia."
ground_truth = "and my neighboring homes heard about this idea ."

translated_text, translated_tokens, attention_weights = translator(
    tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)
Input:         : os meus vizinhos ouviram sobre esta ideia.
Prediction     : my neighbors heard about this idea .
Ground truth   : and my neighboring homes heard about this idea .
sentence = "vou então muito rapidamente partilhar convosco algumas histórias de algumas coisas mágicas que aconteceram."
ground_truth = "so i \'ll just share with you some stories very quickly of some magical things that have happened ."

translated_text, translated_tokens, attention_weights = translator(
    tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)
Input:         : vou então muito rapidamente partilhar convosco algumas histórias de algumas coisas mágicas que aconteceram.
Prediction     : so i ' m going to be very quickly to share some stories of some magic stories that happened .
Ground truth   : so i 'll just share with you some stories very quickly of some magical things that have happened .

Parcelas de atención

Los Translator retornos de clase un diccionario de atención mapas se puede utilizar para visualizar el funcionamiento interno de la modelo:

sentence = "este é o primeiro livro que eu fiz."
ground_truth = "this is the first book i've ever done."

translated_text, translated_tokens, attention_weights = translator(
    tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)
Input:         : este é o primeiro livro que eu fiz.
Prediction     : this is the first book i did .
Ground truth   : this is the first book i've ever done.
def plot_attention_head(in_tokens, translated_tokens, attention):
  # The plot is of the attention when a token was generated.
  # The model didn't generate `<START>` in the output. Skip it.
  translated_tokens = translated_tokens[1:]

  ax = plt.gca()
  ax.matshow(attention)
  ax.set_xticks(range(len(in_tokens)))
  ax.set_yticks(range(len(translated_tokens)))

  labels = [label.decode('utf-8') for label in in_tokens.numpy()]
  ax.set_xticklabels(
      labels, rotation=90)

  labels = [label.decode('utf-8') for label in translated_tokens.numpy()]
  ax.set_yticklabels(labels)
head = 0
# shape: (batch=1, num_heads, seq_len_q, seq_len_k)
attention_heads = tf.squeeze(
  attention_weights['decoder_layer4_block2'], 0)
attention = attention_heads[head]
attention.shape
TensorShape([9, 11])
in_tokens = tf.convert_to_tensor([sentence])
in_tokens = tokenizers.pt.tokenize(in_tokens).to_tensor()
in_tokens = tokenizers.pt.lookup(in_tokens)[0]
in_tokens
<tf.Tensor: shape=(11,), dtype=string, numpy=
array([b'[START]', b'este', b'e', b'o', b'primeiro', b'livro', b'que',
       b'eu', b'fiz', b'.', b'[END]'], dtype=object)>
translated_tokens
<tf.Tensor: shape=(10,), dtype=string, numpy=
array([b'[START]', b'this', b'is', b'the', b'first', b'book', b'i',
       b'did', b'.', b'[END]'], dtype=object)>
plot_attention_head(in_tokens, translated_tokens, attention)

png

def plot_attention_weights(sentence, translated_tokens, attention_heads):
  in_tokens = tf.convert_to_tensor([sentence])
  in_tokens = tokenizers.pt.tokenize(in_tokens).to_tensor()
  in_tokens = tokenizers.pt.lookup(in_tokens)[0]
  in_tokens

  fig = plt.figure(figsize=(16, 8))

  for h, head in enumerate(attention_heads):
    ax = fig.add_subplot(2, 4, h+1)

    plot_attention_head(in_tokens, translated_tokens, head)

    ax.set_xlabel(f'Head {h+1}')

  plt.tight_layout()
  plt.show()
plot_attention_weights(sentence, translated_tokens,
                       attention_weights['decoder_layer4_block2'][0])

png

El modelo funciona bien con palabras desconocidas. Ni "triceratops" ni "enciclopedia" están en el conjunto de datos de entrada y el modelo casi aprende a transliterarlos, incluso sin un vocabulario compartido:

sentence = "Eu li sobre triceratops na enciclopédia."
ground_truth = "I read about triceratops in the encyclopedia."

translated_text, translated_tokens, attention_weights = translator(
    tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)

plot_attention_weights(sentence, translated_tokens,
                       attention_weights['decoder_layer4_block2'][0])
Input:         : Eu li sobre triceratops na enciclopédia.
Prediction     : i read about triopters on the encyclopedia .
Ground truth   : I read about triceratops in the encyclopedia.

png

Exportar

Ese modelo de inferencia está trabajando, así que la próxima podrás exportarlo como tf.saved_model .

Para hacer eso, lo envuelve en otra tf.Module sub-clase, esta vez con un tf.function en el __call__ método:

class ExportTranslator(tf.Module):
  def __init__(self, translator):
    self.translator = translator

  @tf.function(input_signature=[tf.TensorSpec(shape=[], dtype=tf.string)])
  def __call__(self, sentence):
    (result, 
     tokens,
     attention_weights) = self.translator(sentence, max_length=100)

    return result

En lo anterior tf.function sólo se devuelve la sentencia de salida. Gracias a la ejecución no estricta en tf.function cualquier valor innecesarios no se computan.

translator = ExportTranslator(translator)

Dado que el modelo está decodificando las predicciones usando tf.argmax las predicciones son deterministas. El modelo original y una recargado de su SavedModel deben dar predicciones idénticas:

translator("este é o primeiro livro que eu fiz.").numpy()
b'this is the first book i did .'
tf.saved_model.save(translator, export_dir='translator')
2021-08-11 18:23:29.706465: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
WARNING:absl:Found untraced functions such as embedding_4_layer_call_fn, embedding_4_layer_call_and_return_conditional_losses, dropout_37_layer_call_fn, dropout_37_layer_call_and_return_conditional_losses, embedding_5_layer_call_fn while saving (showing 5 of 560). These functions will not be directly callable after loading.
reloaded = tf.saved_model.load('translator')
reloaded("este é o primeiro livro que eu fiz.").numpy()
b'this is the first book i did .'

Resumen

En este tutorial, aprendió sobre la codificación posicional, la atención de múltiples cabezales, la importancia del enmascaramiento y cómo crear un transformador.

Intente usar un conjunto de datos diferente para entrenar el transformador. También puede crear el transformador base o el transformador XL cambiando los hiperparámetros anteriores. También puede utilizar las capas definidas aquí para crear BERT y trenes estado de los modelos de arte. Además, puede implementar la búsqueda de haces para obtener mejores predicciones.