Ayuda a proteger la Gran Barrera de Coral con TensorFlow en Kaggle Únete Challenge

Preprocesar datos con TensorFlow Transform

El componente de ingeniería de funciones de TensorFlow Extended (TFX)

Este cuaderno ejemplo colab ofrece un ejemplo muy sencillo de cómo TensorFlow Transform ( tf.Transform ) se puede utilizar para preprocesar los datos usando exactamente el mismo código tanto para la formación de un modelo y servir inferencias en la producción.

TensorFlow Transform es una biblioteca para preprocesar datos de entrada para TensorFlow, incluida la creación de funciones que requieren un pase completo sobre el conjunto de datos de entrenamiento. Por ejemplo, con TensorFlow Transform podrías:

  • Normalizar un valor de entrada utilizando la media y la desviación estándar
  • Convierta cadenas en números enteros generando un vocabulario sobre todos los valores de entrada
  • Convierta flotantes en números enteros asignándolos a depósitos, según la distribución de datos observada

TensorFlow tiene soporte integrado para manipulaciones en un solo ejemplo o un lote de ejemplos. tf.Transform extiende estas capacidades para apoyar pases completos sobre todo el conjunto de datos de entrenamiento.

La salida de tf.Transform se exporta como un gráfico TensorFlow que se puede utilizar tanto para la formación y servir. Usar el mismo gráfico tanto para el entrenamiento como para el servicio puede evitar el sesgo, ya que se aplican las mismas transformaciones en ambas etapas.

Actualizar Pip

Para evitar actualizar Pip en un sistema cuando se ejecuta localmente, verifique que estemos ejecutando en Colab. Por supuesto, los sistemas locales se pueden actualizar por separado.

try:
  import colab
  !pip install --upgrade pip
except:
  pass

Instalar TensorFlow Transform

pip install -q -U tensorflow_transform==0.24.1

¿Reinició el tiempo de ejecución?

Si está utilizando Google Colab, la primera vez que ejecuta la celda anterior, debe reiniciar el tiempo de ejecución (Tiempo de ejecución> Reiniciar tiempo de ejecución ...). Esto se debe a la forma en que Colab carga los paquetes.

Importaciones

import pprint
import tempfile

import tensorflow as tf
import tensorflow_transform as tft

import tensorflow_transform.beam as tft_beam
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import schema_utils
2021-12-04 10:36:49.432439: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory

Datos: crea algunos datos ficticios

Crearemos algunos datos ficticios simples para nuestro ejemplo simple:

  • raw_data son los datos en bruto inicial que vamos a preproceso
  • raw_data_metadata contiene el esquema que nos indica el tipo de cada una de las columnas en raw_data . En este caso, es muy sencillo.
raw_data = [
      {'x': 1, 'y': 1, 's': 'hello'},
      {'x': 2, 'y': 2, 's': 'world'},
      {'x': 3, 'y': 3, 's': 'hello'}
  ]

raw_data_metadata = dataset_metadata.DatasetMetadata(
    schema_utils.schema_from_feature_spec({
        'y': tf.io.FixedLenFeature([], tf.float32),
        'x': tf.io.FixedLenFeature([], tf.float32),
        's': tf.io.FixedLenFeature([], tf.string),
    }))

Transformar: crear una función de preprocesamiento

La función de pre-procesamiento es el concepto más importante de tf.Transform. Una función de preprocesamiento es donde realmente ocurre la transformación del conjunto de datos. Se acepta y devuelve un diccionario de los tensores, donde significa un tensor de un Tensor oSparseTensor . Hay dos grupos principales de llamadas a API que normalmente forman el corazón de una función de preprocesamiento:

  1. TensorFlow Operaciones: Cualquier función que acepta y vuelve tensores, lo que normalmente significa ops TensorFlow. Estos agregan operaciones de TensorFlow al gráfico que transforma los datos sin procesar en datos transformados, un vector de características a la vez. Estos se ejecutarán para cada ejemplo, tanto durante el entrenamiento como durante el servicio.
  2. Tensorflow Transformar Analizadores / asignadores: Cualquiera de los analizadores / creadores de mapas proporcionados por tf.Transform. Estos también aceptan y devuelven tensores y, por lo general, contienen una combinación de operaciones de Tensorflow y cálculo de Beam, pero a diferencia de las operaciones de TensorFlow, solo se ejecutan en la canalización de Beam durante el análisis, lo que requiere un pase completo sobre todo el conjunto de datos de entrenamiento. El cálculo de Beam se ejecuta solo una vez, durante el entrenamiento, y normalmente realiza una pasada completa sobre todo el conjunto de datos de entrenamiento. Crean constantes tensoriales, que se agregan a su gráfico. Por ejemplo, tft.min calcula el mínimo de un tensor sobre el conjunto de datos de entrenamiento, mientras que tft.scale_by_min_max primero calcula el mínimo y el máximo de un tensor sobre el conjunto de datos de entrenamiento y luego escala el tensor para que esté dentro de un rango especificado por el usuario, [output_min, output_max]. tf.Transform proporciona un conjunto fijo de estos analizadores / mapeadores, pero esto se ampliará en versiones futuras.
def preprocessing_fn(inputs):
    """Preprocess input columns into transformed columns."""
    x = inputs['x']
    y = inputs['y']
    s = inputs['s']
    x_centered = x - tft.mean(x)
    y_normalized = tft.scale_to_0_1(y)
    s_integerized = tft.compute_and_apply_vocabulary(s)
    x_centered_times_y_normalized = (x_centered * y_normalized)
    return {
        'x_centered': x_centered,
        'y_normalized': y_normalized,
        's_integerized': s_integerized,
        'x_centered_times_y_normalized': x_centered_times_y_normalized,
    }

Poniendolo todo junto

Ahora estamos listos para transformar nuestros datos. Usaremos Apache Beam con un ejecutor directo y proporcionaremos tres entradas:

  1. raw_data - Los datos de entrada sin procesar que hemos creado anteriormente
  2. raw_data_metadata - El esquema de los datos en bruto
  3. preprocessing_fn - La función que hemos creado para hacer nuestra transformación
def main():
  # Ignore the warnings
  with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
    transformed_dataset, transform_fn = (  # pylint: disable=unused-variable
        (raw_data, raw_data_metadata) | tft_beam.AnalyzeAndTransformDataset(
            preprocessing_fn))

  transformed_data, transformed_metadata = transformed_dataset  # pylint: disable=unused-variable

  print('\nRaw data:\n{}\n'.format(pprint.pformat(raw_data)))
  print('Transformed data:\n{}'.format(pprint.pformat(transformed_data)))

if __name__ == '__main__':
  main()
WARNING:tensorflow:Tensorflow version (2.3.4) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:apache_beam.runners.interactive.interactive_environment:Dependencies required for Interactive Beam PCollection visualization are not available, please use: `pip install apache-beam[interactive]` to install necessary dependencies to enable all data visualization features.
WARNING:tensorflow:Tensorflow version (2.3.4) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:Tensorflow version (2.3.4) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow_transform/tf_utils.py:218: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use ref() instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow_transform/tf_utils.py:218: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use ref() instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:201: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.
2021-12-04 10:36:53.509410: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-12-04 10:36:53.509550: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory
2021-12-04 10:36:53.511132: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2021-12-04 10:36:53.511207: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory
2021-12-04 10:36:55.915819: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:201: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:No assets to write.
WARNING:tensorflow:Issue encountered when serializing tft_analyzer_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_analyzer_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_mapper_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_mapper_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
INFO:tensorflow:SavedModel written to: /tmp/tmpkna9acoi/tftransform_tmp/a5ed953ca79c40928ed395f17a62dd60/saved_model.pb
INFO:tensorflow:SavedModel written to: /tmp/tmpkna9acoi/tftransform_tmp/a5ed953ca79c40928ed395f17a62dd60/saved_model.pb
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:No assets to write.
WARNING:tensorflow:Issue encountered when serializing tft_analyzer_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_analyzer_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_mapper_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_mapper_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
INFO:tensorflow:SavedModel written to: /tmp/tmpkna9acoi/tftransform_tmp/8cf4760564cd4ca883a37862ebd6adf3/saved_model.pb
INFO:tensorflow:SavedModel written to: /tmp/tmpkna9acoi/tftransform_tmp/8cf4760564cd4ca883a37862ebd6adf3/saved_model.pb
WARNING:tensorflow:Tensorflow version (2.3.4) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:Tensorflow version (2.3.4) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['/tmpfs/src/tf_docs_env/lib/python3.7/site-packages/ipykernel_launcher.py', '-f', '/tmp/tmp1e0i3s6r.json', '--HistoryManager.hist_file=:memory:']
WARNING:root:Make sure that locally built Python SDK docker image has Python 3.7 interpreter.
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
2021-12-04 10:36:57.429894: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-12-04 10:36:57.430028: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory
2021-12-04 10:36:57.430105: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2021-12-04 10:36:57.430179: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory
2021-12-04 10:36:57.430197: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets written to: /tmp/tmpkna9acoi/tftransform_tmp/8347ab3739e64ab4ab4ef1ec0e3f5b3c/assets
INFO:tensorflow:Assets written to: /tmp/tmpkna9acoi/tftransform_tmp/8347ab3739e64ab4ab4ef1ec0e3f5b3c/assets
INFO:tensorflow:SavedModel written to: /tmp/tmpkna9acoi/tftransform_tmp/8347ab3739e64ab4ab4ef1ec0e3f5b3c/saved_model.pb
INFO:tensorflow:SavedModel written to: /tmp/tmpkna9acoi/tftransform_tmp/8347ab3739e64ab4ab4ef1ec0e3f5b3c/saved_model.pb
WARNING:tensorflow:Expected binary or unicode string, got type_url: "type.googleapis.com/tensorflow.AssetFileDef"
value: "\n\013\n\tConst_3:0\022-vocab_compute_and_apply_vocabulary_vocabulary"
WARNING:tensorflow:Expected binary or unicode string, got type_url: "type.googleapis.com/tensorflow.AssetFileDef"
value: "\n\013\n\tConst_3:0\022-vocab_compute_and_apply_vocabulary_vocabulary"
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
WARNING:tensorflow:Expected binary or unicode string, got type_url: "type.googleapis.com/tensorflow.AssetFileDef"
value: "\n\013\n\tConst_3:0\022-vocab_compute_and_apply_vocabulary_vocabulary"
2021-12-04 10:36:58.058030: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-12-04 10:36:58.058120: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory
2021-12-04 10:36:58.058198: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2021-12-04 10:36:58.058256: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory
2021-12-04 10:36:58.058291: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
WARNING:tensorflow:Expected binary or unicode string, got type_url: "type.googleapis.com/tensorflow.AssetFileDef"
value: "\n\013\n\tConst_3:0\022-vocab_compute_and_apply_vocabulary_vocabulary"
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
Raw data:
[{'s': 'hello', 'x': 1, 'y': 1},
 {'s': 'world', 'x': 2, 'y': 2},
 {'s': 'hello', 'x': 3, 'y': 3}]

Transformed data:
[{'s_integerized': 0,
  'x_centered': -1.0,
  'x_centered_times_y_normalized': -0.0,
  'y_normalized': 0.0},
 {'s_integerized': 1,
  'x_centered': 0.0,
  'x_centered_times_y_normalized': 0.0,
  'y_normalized': 0.5},
 {'s_integerized': 0,
  'x_centered': 1.0,
  'x_centered_times_y_normalized': 1.0,
  'y_normalized': 1.0}]

¿Es esta la respuesta correcta?

Anteriormente, hemos utilizado tf.Transform para hacer esto:

x_centered = x - tft.mean(x)
y_normalized = tft.scale_to_0_1(y)
s_integerized = tft.compute_and_apply_vocabulary(s)
x_centered_times_y_normalized = (x_centered * y_normalized)

x_centrado

Con la entrada de [1, 2, 3] la media de x es 2, y restarlo de x para centrar nuestros valores de x en 0. Así que nuestro resultado de [-1.0, 0.0, 1.0] es correcta.

y_normalizado

Quisimos escalar nuestros valores de y entre 0 y 1. Nuestro entrada fue [1, 2, 3] por lo que nuestro resultado de [0.0, 0.5, 1.0] es correcto.

s_integerized

Queríamos mapear nuestras cadenas a índices en un vocabulario, y solo había 2 palabras en nuestro vocabulario ("hola" y "mundo"). Así que con la entrada de ["hello", "world", "hello"] nuestro resultado de [0, 1, 0] es la correcta. Dado que "hola" aparece con mayor frecuencia en estos datos, será la primera entrada en el vocabulario.

x_centered_times_y_normalized

Quisimos crear una nueva característica mediante el cruce de x_centered y y_normalized mediante la multiplicación. Tenga en cuenta que esto multiplica los resultados, no los valores originales, y nuestro nuevo resultado de [-0.0, 0.0, 1.0] es la correcta.