도움말 Kaggle에 TensorFlow과 그레이트 배리어 리프 (Great Barrier Reef)를 보호하기 도전에 참여

TensorFlow Transform으로 데이터 전처리

TensorFlow Extended(TFX)의 기능 엔지니어링 구성요소

이 예제 colab 노트북 방법의 아주 간단한 예를 제공 TensorFlow이 (변환 tf.Transform ) 모두 모델을 훈련 및 생산 추론를 제공 정확히 동일한 코드를 사용하여 전처리 데이터를 사용할 수 있습니다.

TensorFlow Transform은 훈련 데이터 세트에 대한 전체 전달이 필요한 기능 생성을 포함하여 TensorFlow용 입력 데이터를 사전 처리하기 위한 라이브러리입니다. 예를 들어 TensorFlow Transform을 사용하여 다음을 수행할 수 있습니다.

  • 평균과 표준편차를 이용하여 입력값 정규화
  • 모든 입력 값에 대해 어휘를 생성하여 문자열을 정수로 변환
  • 관찰된 데이터 분포를 기반으로 부동 소수점을 버킷에 할당하여 정수로 변환

TensorFlow에는 단일 예제 또는 예제 배치에 대한 조작 지원이 내장되어 있습니다. tf.Transform 전체 훈련 데이터 세트를 완벽하게 패스를 지원하기 위해 이러한 기능을 확장합니다.

의 출력 tf.Transform 당신이 훈련과 봉사에 모두 사용할 수있는 TensorFlow 그래프로 내보내집니다. 훈련과 제공 모두에 동일한 그래프를 사용하면 두 단계에 동일한 변환이 적용되므로 왜곡을 방지할 수 있습니다.

핍 업그레이드

로컬에서 실행할 때 시스템에서 Pip를 업그레이드하지 않으려면 Colab에서 실행 중인지 확인하세요. 물론 로컬 시스템은 별도로 업그레이드할 수 있습니다.

try:
  import colab
  !pip install --upgrade pip
except:
  pass

TensorFlow 변환 설치

pip install -q -U tensorflow_transform==0.24.1

런타임을 다시 시작했습니까?

Google Colab을 사용하는 경우 위의 셀을 처음 실행할 때 런타임을 다시 시작해야 합니다(런타임 > 런타임 다시 시작...). Colab이 패키지를 로드하는 방식 때문입니다.

수입품

import pprint
import tempfile

import tensorflow as tf
import tensorflow_transform as tft

import tensorflow_transform.beam as tft_beam
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import schema_utils
2021-10-29 00:03:58.554591: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory

데이터: 일부 더미 데이터 생성

간단한 예제를 위해 몇 가지 간단한 더미 데이터를 생성합니다.

  • raw_data 우리가 사전 처리 거라고 초기 원시 데이터입니다
  • raw_data_metadata 우리에게 열 각각의 유형을 알려줍니다 스키마 포함 raw_data . 이 경우 매우 간단합니다.
raw_data = [
      {'x': 1, 'y': 1, 's': 'hello'},
      {'x': 2, 'y': 2, 's': 'world'},
      {'x': 3, 'y': 3, 's': 'hello'}
  ]

raw_data_metadata = dataset_metadata.DatasetMetadata(
    schema_utils.schema_from_feature_spec({
        'y': tf.io.FixedLenFeature([], tf.float32),
        'x': tf.io.FixedLenFeature([], tf.float32),
        's': tf.io.FixedLenFeature([], tf.string),
    }))

변환: 전처리 함수 생성

전처리 기능은 tf.Transform의 가장 중요한 개념이다. 전처리 기능은 데이터 세트의 변환이 실제로 일어나는 곳입니다. 그것은 받아 텐서는 의미 텐서의 사전, 반환 Tensor 또는SparseTensor . 일반적으로 전처리 기능의 핵심을 형성하는 두 가지 주요 API 호출 그룹이 있습니다.

  1. TensorFlow 옵스 : 보통 TensorFlow 작전을 의미하는 텐서를 받아 반환하는 모든 기능. 이것은 원시 데이터를 변환된 데이터로 한 번에 하나의 특성 벡터로 변환하는 TensorFlow 작업을 그래프에 추가합니다. 이는 교육 및 봉사 중 모든 예에 대해 실행됩니다.
  2. 분석기 / 매퍼 변환 Tensorflow : tf.Transform에서 제공하는 분석기 / 매퍼의 모든합니다. 이들은 또한 텐서를 수락하고 반환하며 일반적으로 Tensorflow 작업과 Beam 계산의 조합을 포함하지만 TensorFlow 작업과 달리 전체 교육 데이터 세트에 대한 전체 패스가 필요한 분석 중에 Beam 파이프라인에서만 실행됩니다. Beam 계산은 훈련 중에 한 번만 실행되며 일반적으로 전체 훈련 데이터 세트를 완전히 통과합니다. 그래프에 추가되는 텐서 상수를 생성합니다. 예를 들어, tft.min은 훈련 데이터셋에 대해 텐서의 최소값을 계산하는 반면 tft.scale_by_min_max는 먼저 훈련 데이터셋에 대해 텐서의 최소값과 최대값을 계산한 다음 사용자 지정 범위 [output_min, output_max]. tf.Transform은 이러한 분석기/매퍼의 고정 세트를 제공하지만 향후 버전에서 확장될 예정입니다.
def preprocessing_fn(inputs):
    """Preprocess input columns into transformed columns."""
    x = inputs['x']
    y = inputs['y']
    s = inputs['s']
    x_centered = x - tft.mean(x)
    y_normalized = tft.scale_to_0_1(y)
    s_integerized = tft.compute_and_apply_vocabulary(s)
    x_centered_times_y_normalized = (x_centered * y_normalized)
    return {
        'x_centered': x_centered,
        'y_normalized': y_normalized,
        's_integerized': s_integerized,
        'x_centered_times_y_normalized': x_centered_times_y_normalized,
    }

함께 모아서

이제 데이터를 변환할 준비가 되었습니다. 직접 실행기와 함께 Apache Beam을 사용하고 세 가지 입력을 제공합니다.

  1. raw_data - 원시 입력 데이터 우리는 위에서 만든
  2. raw_data_metadata - 원시 데이터의 스키마
  3. preprocessing_fn - 우리는 우리의 변환을 할 만든 기능
def main():
  # Ignore the warnings
  with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
    transformed_dataset, transform_fn = (  # pylint: disable=unused-variable
        (raw_data, raw_data_metadata) | tft_beam.AnalyzeAndTransformDataset(
            preprocessing_fn))

  transformed_data, transformed_metadata = transformed_dataset  # pylint: disable=unused-variable

  print('\nRaw data:\n{}\n'.format(pprint.pformat(raw_data)))
  print('Transformed data:\n{}'.format(pprint.pformat(transformed_data)))

if __name__ == '__main__':
  main()
WARNING:tensorflow:Tensorflow version (2.3.4) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:apache_beam.runners.interactive.interactive_environment:Dependencies required for Interactive Beam PCollection visualization are not available, please use: `pip install apache-beam[interactive]` to install necessary dependencies to enable all data visualization features.
WARNING:tensorflow:Tensorflow version (2.3.4) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:Tensorflow version (2.3.4) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow_transform/tf_utils.py:218: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use ref() instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow_transform/tf_utils.py:218: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use ref() instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:201: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.
2021-10-29 00:04:02.419334: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-10-29 00:04:02.419460: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory
2021-10-29 00:04:02.420980: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2021-10-29 00:04:02.421059: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory
2021-10-29 00:04:04.823927: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:201: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:No assets to write.
WARNING:tensorflow:Issue encountered when serializing tft_analyzer_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_analyzer_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_mapper_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_mapper_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
INFO:tensorflow:SavedModel written to: /tmp/tmpk6qtylpx/tftransform_tmp/dfe7e4d0bf6c4ecca678067ca0fe04c4/saved_model.pb
INFO:tensorflow:SavedModel written to: /tmp/tmpk6qtylpx/tftransform_tmp/dfe7e4d0bf6c4ecca678067ca0fe04c4/saved_model.pb
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:No assets to write.
WARNING:tensorflow:Issue encountered when serializing tft_analyzer_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_analyzer_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_mapper_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing tft_mapper_use.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'Counter' object has no attribute 'name'
INFO:tensorflow:SavedModel written to: /tmp/tmpk6qtylpx/tftransform_tmp/4e4e2ca208a04b2a9c011b40566aea7a/saved_model.pb
INFO:tensorflow:SavedModel written to: /tmp/tmpk6qtylpx/tftransform_tmp/4e4e2ca208a04b2a9c011b40566aea7a/saved_model.pb
WARNING:tensorflow:Tensorflow version (2.3.4) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:Tensorflow version (2.3.4) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['/tmpfs/src/tf_docs_env/lib/python3.7/site-packages/ipykernel_launcher.py', '-f', '/tmp/tmp69n46_6l.json', '--HistoryManager.hist_file=:memory:']
WARNING:root:Make sure that locally built Python SDK docker image has Python 3.7 interpreter.
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
2021-10-29 00:04:06.236925: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-10-29 00:04:06.237023: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory
2021-10-29 00:04:06.237097: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2021-10-29 00:04:06.237150: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory
2021-10-29 00:04:06.237166: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets written to: /tmp/tmpk6qtylpx/tftransform_tmp/e25f5a39492146b7badb44b55f58e18e/assets
INFO:tensorflow:Assets written to: /tmp/tmpk6qtylpx/tftransform_tmp/e25f5a39492146b7badb44b55f58e18e/assets
INFO:tensorflow:SavedModel written to: /tmp/tmpk6qtylpx/tftransform_tmp/e25f5a39492146b7badb44b55f58e18e/saved_model.pb
INFO:tensorflow:SavedModel written to: /tmp/tmpk6qtylpx/tftransform_tmp/e25f5a39492146b7badb44b55f58e18e/saved_model.pb
WARNING:tensorflow:Expected binary or unicode string, got type_url: "type.googleapis.com/tensorflow.AssetFileDef"
value: "\n\013\n\tConst_3:0\022-vocab_compute_and_apply_vocabulary_vocabulary"
WARNING:tensorflow:Expected binary or unicode string, got type_url: "type.googleapis.com/tensorflow.AssetFileDef"
value: "\n\013\n\tConst_3:0\022-vocab_compute_and_apply_vocabulary_vocabulary"
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
WARNING:tensorflow:Expected binary or unicode string, got type_url: "type.googleapis.com/tensorflow.AssetFileDef"
value: "\n\013\n\tConst_3:0\022-vocab_compute_and_apply_vocabulary_vocabulary"
2021-10-29 00:04:06.879862: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-10-29 00:04:06.879975: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory
2021-10-29 00:04:06.880047: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2021-10-29 00:04:06.880103: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory
2021-10-29 00:04:06.880119: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
WARNING:tensorflow:Expected binary or unicode string, got type_url: "type.googleapis.com/tensorflow.AssetFileDef"
value: "\n\013\n\tConst_3:0\022-vocab_compute_and_apply_vocabulary_vocabulary"
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
Raw data:
[{'s': 'hello', 'x': 1, 'y': 1},
 {'s': 'world', 'x': 2, 'y': 2},
 {'s': 'hello', 'x': 3, 'y': 3}]

Transformed data:
[{'s_integerized': 0,
  'x_centered': -1.0,
  'x_centered_times_y_normalized': -0.0,
  'y_normalized': 0.0},
 {'s_integerized': 1,
  'x_centered': 0.0,
  'x_centered_times_y_normalized': 0.0,
  'y_normalized': 0.5},
 {'s_integerized': 0,
  'x_centered': 1.0,
  'x_centered_times_y_normalized': 1.0,
  'y_normalized': 1.0}]

이게 정답인가요?

이전에, 우리는 사용 tf.Transform 이렇게 :

x_centered = x - tft.mean(x)
y_normalized = tft.scale_to_0_1(y)
s_integerized = tft.compute_and_apply_vocabulary(s)
x_centered_times_y_normalized = (x_centered * y_normalized)

x_중심

입력와 [1, 2, 3] x의 평균값이 2 인, 우리는 우리의 결과 그래서 0부터 우리 X 값 중심을 X에서 빼기 [-1.0, 0.0, 1.0] 올.

y_정규화

우리은 0과 1 사이 우리 입력 우리의 y 값을 확장하고 싶었 [1, 2, 3] 의 결과 그래서 우리 [0.0, 0.5, 1.0] 올.

s_integerized

우리는 문자열을 어휘의 색인에 매핑하고 싶었고 어휘에는 2개의 단어("hello"와 "world")만 있었습니다. 그래서 입력으로 ["hello", "world", "hello"] 우리의 결과를 [0, 1, 0] 올. "hello"는 이 데이터에서 가장 자주 발생하므로 어휘의 첫 번째 항목이 됩니다.

x_center_times_y_normalized

우리는 건너 새로운 기능을 만들고 싶었 x_centeredy_normalized 곱셈을 사용. 이 곱셈의 결과가 아니라 원래 값, 그리고 새로운 결과 유의 [-0.0, 0.0, 1.0] 올.