La journée communautaire ML est le 9 novembre ! Rejoignez - nous pour les mises à jour de tensorflow, JAX et plus En savoir plus

Prétraiter les données avec TensorFlow Transform

Le composant d'ingénierie des fonctionnalités de TensorFlow Extended (TFX)

Ce bloc - notes par exemple COLAB fournit un exemple très simple de la façon dont tensorflow Transform ( tf.Transform ) peut être utilisé pour les données prétraiter en utilisant exactement le même code pour la formation d' un modèle et de servir des conclusions dans la production.

TensorFlow Transform est une bibliothèque de prétraitement des données d'entrée pour TensorFlow, y compris la création de fonctionnalités qui nécessitent un passage complet sur l'ensemble de données d'entraînement. Par exemple, en utilisant TensorFlow Transform, vous pouvez :

  • Normaliser une valeur d'entrée en utilisant la moyenne et l'écart type
  • Convertir des chaînes en nombres entiers en générant un vocabulaire sur toutes les valeurs d'entrée
  • Convertir des flottants en entiers en les affectant à des buckets, en fonction de la distribution des données observée

TensorFlow prend en charge les manipulations sur un seul exemple ou un lot d'exemples. tf.Transform étend ces capacités à l' appui complet passe sur le jeu de données de formation ensemble.

La sortie de tf.Transform est exportée sous forme de graphique de tensorflow que vous pouvez utiliser pour la formation et au service. L'utilisation du même graphique pour l'entraînement et la diffusion peut éviter le biais, car les mêmes transformations sont appliquées aux deux étapes.

Pip de mise à niveau

Pour éviter de mettre à niveau Pip dans un système lors de l'exécution locale, assurez-vous que nous exécutons dans Colab. Les systèmes locaux peuvent bien sûr être mis à niveau séparément.

try:
  import colab
  !pip install --upgrade pip
except:
  pass

Installer la transformation TensorFlow

pip install -q -U tensorflow_transform==0.24.1

As-tu redémarré le runtime ?

Si vous utilisez Google Colab, la première fois que vous exécutez la cellule ci-dessus, vous devez redémarrer le runtime (Runtime > Redémarrer le runtime...). Cela est dû à la façon dont Colab charge les packages.

Importations

import pprint
import tempfile

import tensorflow as tf
import tensorflow_transform as tft

import tensorflow_transform.beam as tft_beam
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import schema_utils
2021-09-30 02:35:36.130218: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory

Données : créez des données factices

Nous allons créer des données factices simples pour notre exemple simple :

  • raw_data est les données brutes initiale que nous allons prétraiter
  • raw_data_metadata contient le schéma qui nous dit les types de chacune des colonnes en raw_data . Dans ce cas, c'est très simple.
raw_data = [
      {'x': 1, 'y': 1, 's': 'hello'},
      {'x': 2, 'y': 2, 's': 'world'},
      {'x': 3, 'y': 3, 's': 'hello'}
  ]

raw_data_metadata = dataset_metadata.DatasetMetadata(
    schema_utils.schema_from_feature_spec({
        'y': tf.io.FixedLenFeature([], tf.float32),
        'x': tf.io.FixedLenFeature([], tf.float32),
        's': tf.io.FixedLenFeature([], tf.string),
    }))

Transformer : créer une fonction de prétraitement

La fonction est prétraiter le concept le plus important de la tf.Transform. Une fonction de prétraitement est l'endroit où la transformation de l'ensemble de données se produit réellement. Il accepte et retourne un dictionnaire de tenseurs, où un tenseur signifie un Tensor ouSparseTensor . Il existe deux groupes principaux d'appels d'API qui constituent généralement le cœur d'une fonction de prétraitement :

  1. Tensorflow Ops: Toute fonction qui accepte et retourne tenseurs, ce qui signifie généralement ops tensorflow. Ceux-ci ajoutent au graphique des opérations TensorFlow qui transforment les données brutes en données transformées, un vecteur de caractéristiques à la fois. Ceux-ci fonctionneront pour chaque exemple, à la fois pendant l'entraînement et le service.
  2. Tensorflow Transformer / Analisateurs cartographes: Tous des analyseurs / cartographes fournis par tf.Transform. Ceux-ci acceptent et renvoient également des tenseurs, et contiennent généralement une combinaison d'opérations Tensorflow et de calcul Beam, mais contrairement aux opérations TensorFlow, ils ne s'exécutent que dans le pipeline Beam pendant l'analyse nécessitant un passage complet sur l'ensemble de données d'entraînement. Le calcul Beam ne s'exécute qu'une seule fois, pendant l'entraînement, et effectue généralement un passage complet sur l'ensemble de l'ensemble de données d'entraînement. Ils créent des constantes de tenseur, qui sont ajoutées à votre graphique. Par exemple, tft.min calcule le minimum d'un tenseur sur l'ensemble de données d'entraînement tandis que tft.scale_by_min_max calcule d'abord le min et le max d'un tenseur sur l'ensemble de données d'entraînement, puis met le tenseur à l'échelle pour qu'il se situe dans une plage spécifiée par l'utilisateur, [output_min, sortie_max]. tf.Transform fournit un ensemble fixe de tels analyseurs/mappers, mais cela sera étendu dans les futures versions.
def preprocessing_fn(inputs):
    """Preprocess input columns into transformed columns."""
    x = inputs['x']
    y = inputs['y']
    s = inputs['s']
    x_centered = x - tft.mean(x)
    y_normalized = tft.scale_to_0_1(y)
    s_integerized = tft.compute_and_apply_vocabulary(s)
    x_centered_times_y_normalized = (x_centered * y_normalized)
    return {
        'x_centered': x_centered,
        'y_normalized': y_normalized,
        's_integerized': s_integerized,
        'x_centered_times_y_normalized': x_centered_times_y_normalized,
    }

Mettre tous ensemble

Nous sommes maintenant prêts à transformer nos données. Nous utiliserons Apache Beam avec un programme d'exécution direct et fournirons trois entrées :

  1. raw_data - Les données d'entrée brutes que nous avons créé ci - dessus
  2. raw_data_metadata - Le schéma pour les données brutes
  3. preprocessing_fn - La fonction que nous avons créé pour faire de notre transformation
def main():
  # Ignore the warnings
  with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
    transformed_dataset, transform_fn = (  # pylint: disable=unused-variable
        (raw_data, raw_data_metadata) | tft_beam.AnalyzeAndTransformDataset(
            preprocessing_fn))

  transformed_data, transformed_metadata = transformed_dataset  # pylint: disable=unused-variable

  print('\nRaw data:\n{}\n'.format(pprint.pformat(raw_data)))
  print('Transformed data:\n{}'.format(pprint.pformat(transformed_data)))

if __name__ == '__main__':
  main()
WARNING:tensorflow:Tensorflow version (2.3.4) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:apache_beam.runners.interactive.interactive_environment:Dependencies required for Interactive Beam PCollection visualization are not available, please use: `pip install apache-beam[interactive]` to install necessary dependencies to enable all data visualization features.
WARNING:tensorflow:Tensorflow version (2.3.4) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:Tensorflow version (2.3.4) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow_transform/tf_utils.py:218: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use ref() instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow_transform/tf_utils.py:218: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use ref() instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:201: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.
2021-09-30 02:35:39.630074: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-09-30 02:35:39.630191: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory
2021-09-30 02:35:39.631735: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2021-09-30 02:35:39.631825: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory
2021-09-30 02:35:42.050118: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:201: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:No assets to write.
WARNING:tensorflow:Issue encountered when serializing tft_analyzer_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_analyzer_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_mapper_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_mapper_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
INFO:tensorflow:SavedModel written to: /tmp/tmpgxogds6h/tftransform_tmp/3870536211c2433c88344664db1cc9a5/saved_model.pb
INFO:tensorflow:SavedModel written to: /tmp/tmpgxogds6h/tftransform_tmp/3870536211c2433c88344664db1cc9a5/saved_model.pb
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:No assets to write.
WARNING:tensorflow:Issue encountered when serializing tft_analyzer_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_analyzer_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_mapper_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_mapper_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
INFO:tensorflow:SavedModel written to: /tmp/tmpgxogds6h/tftransform_tmp/4be8ced647674a68b8264eedec220ce7/saved_model.pb
INFO:tensorflow:SavedModel written to: /tmp/tmpgxogds6h/tftransform_tmp/4be8ced647674a68b8264eedec220ce7/saved_model.pb
WARNING:tensorflow:Tensorflow version (2.3.4) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:Tensorflow version (2.3.4) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['/home/kbuilder/.local/lib/python3.7/site-packages/ipykernel_launcher.py', '-f', '/tmp/tmpc07eikb2.json', '--HistoryManager.hist_file=:memory:']
WARNING:root:Make sure that locally built Python SDK docker image has Python 3.7 interpreter.
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
2021-09-30 02:35:43.470522: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-09-30 02:35:43.470629: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory
2021-09-30 02:35:43.470705: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2021-09-30 02:35:43.470764: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory
2021-09-30 02:35:43.470782: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets written to: /tmp/tmpgxogds6h/tftransform_tmp/179077f0193745ae94366666096b2628/assets
INFO:tensorflow:Assets written to: /tmp/tmpgxogds6h/tftransform_tmp/179077f0193745ae94366666096b2628/assets
INFO:tensorflow:SavedModel written to: /tmp/tmpgxogds6h/tftransform_tmp/179077f0193745ae94366666096b2628/saved_model.pb
INFO:tensorflow:SavedModel written to: /tmp/tmpgxogds6h/tftransform_tmp/179077f0193745ae94366666096b2628/saved_model.pb
WARNING:tensorflow:Expected binary or unicode string, got type_url: "type.googleapis.com/tensorflow.AssetFileDef"
value: "\n\013\n\tConst_3:0\022-vocab_compute_and_apply_vocabulary_vocabulary"
WARNING:tensorflow:Expected binary or unicode string, got type_url: "type.googleapis.com/tensorflow.AssetFileDef"
value: "\n\013\n\tConst_3:0\022-vocab_compute_and_apply_vocabulary_vocabulary"
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
WARNING:tensorflow:Expected binary or unicode string, got type_url: "type.googleapis.com/tensorflow.AssetFileDef"
value: "\n\013\n\tConst_3:0\022-vocab_compute_and_apply_vocabulary_vocabulary"
2021-09-30 02:35:44.086035: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-09-30 02:35:44.086161: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory
2021-09-30 02:35:44.086286: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2021-09-30 02:35:44.086361: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory
2021-09-30 02:35:44.086389: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
WARNING:tensorflow:Expected binary or unicode string, got type_url: "type.googleapis.com/tensorflow.AssetFileDef"
value: "\n\013\n\tConst_3:0\022-vocab_compute_and_apply_vocabulary_vocabulary"
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
Raw data:
[{'s': 'hello', 'x': 1, 'y': 1},
 {'s': 'world', 'x': 2, 'y': 2},
 {'s': 'hello', 'x': 3, 'y': 3}]

Transformed data:
[{'s_integerized': 0,
  'x_centered': -1.0,
  'x_centered_times_y_normalized': -0.0,
  'y_normalized': 0.0},
 {'s_integerized': 1,
  'x_centered': 0.0,
  'x_centered_times_y_normalized': 0.0,
  'y_normalized': 0.5},
 {'s_integerized': 0,
  'x_centered': 1.0,
  'x_centered_times_y_normalized': 1.0,
  'y_normalized': 1.0}]

Est-ce la bonne réponse ?

Auparavant, nous avons utilisé tf.Transform pour ce faire:

x_centered = x - tft.mean(x)
y_normalized = tft.scale_to_0_1(y)
s_integerized = tft.compute_and_apply_vocabulary(s)
x_centered_times_y_normalized = (x_centered * y_normalized)

x_centré

Avec l' entrée de [1, 2, 3] la moyenne de x est 2, et nous soustraire de x au centre de nos valeurs x à 0. Donc , notre résultat de [-1.0, 0.0, 1.0] est correcte.

y_normalisé

Nous voulions à l' échelle de nos valeurs y entre 0 et 1. Notre entrée était [1, 2, 3] de sorte que notre résultat de [0.0, 0.5, 1.0] est correcte.

s_integerized

Nous voulions mapper nos chaînes à des index dans un vocabulaire, et il n'y avait que 2 mots dans notre vocabulaire ("bonjour" et "monde"). Donc , avec l' entrée de ["hello", "world", "hello"] notre résultat de [0, 1, 0] est correcte. Puisque « bonjour » apparaît le plus fréquemment dans ces données, ce sera la première entrée du vocabulaire.

x_centered_times_y_normalized

Nous voulions créer une nouvelle fonctionnalité en traversant x_centered et y_normalized en utilisant la multiplication. Notez que ce qui multiplie les résultats, et non pas les valeurs d' origine, et notre nouveau résultat de [-0.0, 0.0, 1.0] est correcte.