ヘルプKaggleにTensorFlowグレートバリアリーフを保護チャレンジに参加

TensorFlowTransformを使用してデータを前処理します

TensorFlow Extended(TFX)の特徴エンジニアリングコンポーネント

この例コラボノートはどのように非常に簡単な例を提供TensorFlowは、(変換tf.Transformモデルを訓練し、生産に推論をサービングの両方のための正確に同じコードを使用して、前処理データを用いることができます。

TensorFlow Transformは、トレーニングデータセットのフルパスを必要とする機能の作成など、TensorFlowの入力データを前処理するためのライブラリです。たとえば、TensorFlow Transformを使用すると、次のことができます。

  • 平均と標準偏差を使用して入力値を正規化します
  • すべての入力値に対して語彙を生成することにより、文字列を整数に変換します
  • 観測されたデータ分布に基づいて、フロートをバケットに割り当てることにより、フロートを整数に変換します

TensorFlowには、単一のサンプルまたはサンプルのバッチに対する操作のサポートが組み込まれています。 tf.Transform全体のトレーニングデータセットを完全にパスをサポートするために、これらの機能を拡張します。

出力tf.Transformあなたがトレーニングとサービス提供の両方に使用できるTensorFlowグラフとしてエクスポートされます。トレーニングとサービングの両方に同じグラフを使用すると、両方の段階で同じ変換が適用されるため、スキューを防ぐことができます。

アップグレードピップ

ローカルで実行しているときにシステムでPipをアップグレードしないようにするには、Colabで実行していることを確認してください。もちろん、ローカルシステムは個別にアップグレードできます。

try:
  import colab
  !pip install --upgrade pip
except:
  pass

TensorFlowTransformをインストールします

pip install -q -U tensorflow_transform==0.24.1

ランタイムを再起動しましたか?

上記のセルを初めて実行するときにGoogleColabを使用している場合は、ランタイムを再起動する必要があります([ランタイム]> [ランタイムの再起動...])。これは、Colabがパッケージをロードする方法によるものです。

輸入

import pprint
import tempfile

import tensorflow as tf
import tensorflow_transform as tft

import tensorflow_transform.beam as tft_beam
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import schema_utils
2021-12-04 10:36:49.432439: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory

データ:ダミーデータを作成します

簡単な例として、いくつかの簡単なダミーデータを作成します。

  • raw_data私たちが前処理しようとしていることを最初の生データであり、
  • raw_data_metadata私たちの列の各種類告げるスキーマが含まraw_data 。この場合、それは非常に簡単です。
raw_data = [
      {'x': 1, 'y': 1, 's': 'hello'},
      {'x': 2, 'y': 2, 's': 'world'},
      {'x': 3, 'y': 3, 's': 'hello'}
  ]

raw_data_metadata = dataset_metadata.DatasetMetadata(
    schema_utils.schema_from_feature_spec({
        'y': tf.io.FixedLenFeature([], tf.float32),
        'x': tf.io.FixedLenFeature([], tf.float32),
        's': tf.io.FixedLenFeature([], tf.string),
    }))

変換:前処理関数を作成します

前処理機能はtf.Transformの最も重要な概念です。前処理関数は、データセットの変換が実際に行われる場所です。それは受け入れ、テンソルは意味テンソルの辞書、返しTensorまたはSparseTensor 。通常、前処理関数の中心を形成するAPI呼び出しには2つの主要なグループがあります。

  1. TensorFlowオプス:テンソルを受け入れ、返す任意の関数で、通常TensorFlowオプスを意味しています。これらは、生データを一度に1つの特徴ベクトルで変換されたデータに変換するグラフにTensorFlow操作を追加します。これらは、トレーニングとサービングの両方で、すべての例で実行されます。
  2. アナライザ/マッパーを変換Tensorflow:tf.Transformが提供する解析器/マッパーのいずれか。これらはテンソルも受け入れて返し、通常はTensorflow opsとBeam計算の組み合わせを含みますが、TensorFlow opsとは異なり、分析中にBeamパイプラインでのみ実行され、トレーニングデータセット全体を完全に通過する必要があります。ビーム計算は、トレーニング中に1回だけ実行され、通常、トレーニングデータセット全体を完全に通過します。それらはテンソル定数を作成し、それがグラフに追加されます。たとえば、tft.minはトレーニングデータセットのテンソルの最小値を計算し、tft.scale_by_min_maxは最初にトレーニングデータセットのテンソルの最小値と最大値を計算してから、ユーザー指定の範囲[output_min、 output_max]。 tf.Transformは、そのようなアナライザー/マッパーの固定セットを提供しますが、これは将来のバージョンで拡張される予定です。
def preprocessing_fn(inputs):
    """Preprocess input columns into transformed columns."""
    x = inputs['x']
    y = inputs['y']
    s = inputs['s']
    x_centered = x - tft.mean(x)
    y_normalized = tft.scale_to_0_1(y)
    s_integerized = tft.compute_and_apply_vocabulary(s)
    x_centered_times_y_normalized = (x_centered * y_normalized)
    return {
        'x_centered': x_centered,
        'y_normalized': y_normalized,
        's_integerized': s_integerized,
        'x_centered_times_y_normalized': x_centered_times_y_normalized,
    }

すべてを一緒に入れて

これで、データを変換する準備が整いました。ダイレクトランナーでApacheBeamを使用し、次の3つの入力を提供します。

  1. raw_data -私たちは上記で作成したことを生の入力データを
  2. raw_data_metadata -生データのスキーマ
  3. preprocessing_fn -私たちは私たちの変換を行うために作成した機能
def main():
  # Ignore the warnings
  with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
    transformed_dataset, transform_fn = (  # pylint: disable=unused-variable
        (raw_data, raw_data_metadata) | tft_beam.AnalyzeAndTransformDataset(
            preprocessing_fn))

  transformed_data, transformed_metadata = transformed_dataset  # pylint: disable=unused-variable

  print('\nRaw data:\n{}\n'.format(pprint.pformat(raw_data)))
  print('Transformed data:\n{}'.format(pprint.pformat(transformed_data)))

if __name__ == '__main__':
  main()
WARNING:tensorflow:Tensorflow version (2.3.4) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:apache_beam.runners.interactive.interactive_environment:Dependencies required for Interactive Beam PCollection visualization are not available, please use: `pip install apache-beam[interactive]` to install necessary dependencies to enable all data visualization features.
WARNING:tensorflow:Tensorflow version (2.3.4) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:Tensorflow version (2.3.4) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow_transform/tf_utils.py:218: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use ref() instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow_transform/tf_utils.py:218: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use ref() instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:201: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.
2021-12-04 10:36:53.509410: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-12-04 10:36:53.509550: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory
2021-12-04 10:36:53.511132: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2021-12-04 10:36:53.511207: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory
2021-12-04 10:36:55.915819: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:201: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:No assets to write.
WARNING:tensorflow:Issue encountered when serializing tft_analyzer_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_analyzer_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_mapper_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_mapper_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
INFO:tensorflow:SavedModel written to: /tmp/tmpkna9acoi/tftransform_tmp/a5ed953ca79c40928ed395f17a62dd60/saved_model.pb
INFO:tensorflow:SavedModel written to: /tmp/tmpkna9acoi/tftransform_tmp/a5ed953ca79c40928ed395f17a62dd60/saved_model.pb
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:No assets to write.
WARNING:tensorflow:Issue encountered when serializing tft_analyzer_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_analyzer_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_mapper_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_mapper_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
INFO:tensorflow:SavedModel written to: /tmp/tmpkna9acoi/tftransform_tmp/8cf4760564cd4ca883a37862ebd6adf3/saved_model.pb
INFO:tensorflow:SavedModel written to: /tmp/tmpkna9acoi/tftransform_tmp/8cf4760564cd4ca883a37862ebd6adf3/saved_model.pb
WARNING:tensorflow:Tensorflow version (2.3.4) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:Tensorflow version (2.3.4) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['/tmpfs/src/tf_docs_env/lib/python3.7/site-packages/ipykernel_launcher.py', '-f', '/tmp/tmp1e0i3s6r.json', '--HistoryManager.hist_file=:memory:']
WARNING:root:Make sure that locally built Python SDK docker image has Python 3.7 interpreter.
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
2021-12-04 10:36:57.429894: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-12-04 10:36:57.430028: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory
2021-12-04 10:36:57.430105: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2021-12-04 10:36:57.430179: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory
2021-12-04 10:36:57.430197: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets written to: /tmp/tmpkna9acoi/tftransform_tmp/8347ab3739e64ab4ab4ef1ec0e3f5b3c/assets
INFO:tensorflow:Assets written to: /tmp/tmpkna9acoi/tftransform_tmp/8347ab3739e64ab4ab4ef1ec0e3f5b3c/assets
INFO:tensorflow:SavedModel written to: /tmp/tmpkna9acoi/tftransform_tmp/8347ab3739e64ab4ab4ef1ec0e3f5b3c/saved_model.pb
INFO:tensorflow:SavedModel written to: /tmp/tmpkna9acoi/tftransform_tmp/8347ab3739e64ab4ab4ef1ec0e3f5b3c/saved_model.pb
WARNING:tensorflow:Expected binary or unicode string, got type_url: "type.googleapis.com/tensorflow.AssetFileDef"
value: "\n\013\n\tConst_3:0\022-vocab_compute_and_apply_vocabulary_vocabulary"
WARNING:tensorflow:Expected binary or unicode string, got type_url: "type.googleapis.com/tensorflow.AssetFileDef"
value: "\n\013\n\tConst_3:0\022-vocab_compute_and_apply_vocabulary_vocabulary"
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
WARNING:tensorflow:Expected binary or unicode string, got type_url: "type.googleapis.com/tensorflow.AssetFileDef"
value: "\n\013\n\tConst_3:0\022-vocab_compute_and_apply_vocabulary_vocabulary"
2021-12-04 10:36:58.058030: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-12-04 10:36:58.058120: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory
2021-12-04 10:36:58.058198: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2021-12-04 10:36:58.058256: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory
2021-12-04 10:36:58.058291: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
WARNING:tensorflow:Expected binary or unicode string, got type_url: "type.googleapis.com/tensorflow.AssetFileDef"
value: "\n\013\n\tConst_3:0\022-vocab_compute_and_apply_vocabulary_vocabulary"
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
Raw data:
[{'s': 'hello', 'x': 1, 'y': 1},
 {'s': 'world', 'x': 2, 'y': 2},
 {'s': 'hello', 'x': 3, 'y': 3}]

Transformed data:
[{'s_integerized': 0,
  'x_centered': -1.0,
  'x_centered_times_y_normalized': -0.0,
  'y_normalized': 0.0},
 {'s_integerized': 1,
  'x_centered': 0.0,
  'x_centered_times_y_normalized': 0.0,
  'y_normalized': 0.5},
 {'s_integerized': 0,
  'x_centered': 1.0,
  'x_centered_times_y_normalized': 1.0,
  'y_normalized': 1.0}]

これは正しい答えですか?

以前、我々は、使用tf.Transformこれを行うには:

x_centered = x - tft.mean(x)
y_normalized = tft.scale_to_0_1(y)
s_integerized = tft.compute_and_apply_vocabulary(s)
x_centered_times_y_normalized = (x_centered * y_normalized)

x_centered

入力と[1, 2, 3] 、xの平均値は2であり、我々の結果ので、0に我々のxの値を中心にXからそれを差し引く[-1.0, 0.0, 1.0]正しいです。

y_normalized

我々があった0と1私たちの入力との間に我々のyの値をスケーリングしたかった[1, 2, 3]の我々の結果に[0.0, 0.5, 1.0]正しいです。

s_integerized

文字列を語彙のインデックスにマッピングしたかったのですが、語彙には2つの単語(「hello」と「world」)しかありませんでした。だから、の入力して["hello", "world", "hello"]の私達の結果[0, 1, 0]正しいです。このデータでは「hello」が最も頻繁に発生するため、語彙の最初のエントリになります。

x_centered_times_y_normalized

私たちは、交配することによって新しい機能を作成したいx_centeredy_normalized乗算を使用して。この乗算結果ではなく、元の値、および当社の新しい結果という注意[-0.0, 0.0, 1.0]正しいです。