مهاجرت ستون های ویژگی به لایه های پیش پردازش Keras TF2

آموزش یک مدل معمولاً با مقداری پیش پردازش ویژگی همراه خواهد بود، به ویژه هنگامی که با داده های ساختاریافته سروکار داریم. هنگام آموزش tf.estimator.Estimator در TF1، پیش پردازش این ویژگی معمولاً با tf.feature_column API انجام می شود. در TF2، این پیش پردازش را می توان مستقیماً با لایه های Keras انجام داد که به آنها لایه های پیش پردازش می گویند.

در این راهنمای مهاجرت، برخی از تبدیل‌های ویژگی مشترک را با استفاده از ستون‌های ویژگی و لایه‌های پیش پردازش انجام می‌دهید، و سپس یک مدل کامل را با هر دو API آموزش می‌دهید.

ابتدا با چند واردات ضروری شروع کنید

import tensorflow as tf
import tensorflow.compat.v1 as tf1
import math

و یک ابزار برای فراخوانی یک ستون ویژگی برای نمایش اضافه کنید:

def call_feature_columns(feature_columns, inputs):
  # This is a convenient way to call a `feature_column` outside of an estimator
  # to display its output.
  feature_layer = tf1.keras.layers.DenseFeatures(feature_columns)
  return feature_layer(inputs)

مدیریت ورودی

برای استفاده از ستون‌های ویژگی با تخمین‌گر، انتظار می‌رود ورودی‌های مدل همیشه فرهنگ لغت تانسورها باشند:

input_dict = {
  'foo': tf.constant([1]),
  'bar': tf.constant([0]),
  'baz': tf.constant([-1])
}

هر ستون ویژگی باید با یک کلید ایجاد شود تا در داده های منبع فهرست شود. خروجی تمام ستون های ویژگی به هم پیوسته و توسط مدل برآوردگر استفاده می شود.

columns = [
  tf1.feature_column.numeric_column('foo'),
  tf1.feature_column.numeric_column('bar'),
  tf1.feature_column.numeric_column('baz'),
]
call_feature_columns(columns, input_dict)

<tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[ 0., -1.,  1.]], dtype=float32)>

در Keras، ورودی مدل بسیار انعطاف پذیرتر است. یک tf.keras.Model می‌تواند یک ورودی تانسور، فهرستی از ویژگی‌های تانسور یا فرهنگ لغت ویژگی‌های تانسور را مدیریت کند. می‌توانید با ارسال فرهنگ لغت tf.keras.Input در ایجاد مدل، ورودی فرهنگ لغت را مدیریت کنید. ورودی ها به طور خودکار به هم متصل نمی شوند، که به آنها اجازه می دهد تا به روش های بسیار انعطاف پذیرتری استفاده شوند. آنها را می توان با tf.keras.layers.Concatenate الحاق کرد.

inputs = {
  'foo': tf.keras.Input(shape=()),
  'bar': tf.keras.Input(shape=()),
  'baz': tf.keras.Input(shape=()),
}
# Inputs are typically transformed by preprocessing layers before concatenation.
outputs = tf.keras.layers.Concatenate()(inputs.values())
model = tf.keras.Model(inputs=inputs, outputs=outputs)
model(input_dict)

<tf.Tensor: shape=(3,), dtype=float32, numpy=array([ 1.,  0., -1.], dtype=float32)>

شناسه‌های اعداد صحیح کدگذاری یک‌طرفه

یک تبدیل ویژگی رایج، ورودی‌های عدد صحیح کدگذاری یک‌طرفه یک محدوده شناخته شده است. در اینجا یک مثال با استفاده از ستون های ویژگی آورده شده است:

categorical_col = tf1.feature_column.categorical_column_with_identity(
    'type', num_buckets=3)
indicator_col = tf1.feature_column.indicator_column(categorical_col)
call_feature_columns(indicator_col, {'type': [0, 1, 2]})

<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]], dtype=float32)>

با استفاده از لایه‌های پیش‌پردازش Keras، این ستون‌ها را می‌توان با یک لایه tf.keras.layers.CategoryEncoding با output_mode روی 'one_hot' :

one_hot_layer = tf.keras.layers.CategoryEncoding(
    num_tokens=3, output_mode='one_hot')
one_hot_layer([0, 1, 2])

<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]], dtype=float32)>

عادی سازی ویژگی های عددی

هنگام مدیریت ویژگی‌های ممیز شناور پیوسته با ستون‌های ویژگی، باید از tf.feature_column.numeric_column استفاده کنید. در مواردی که ورودی از قبل نرمال شده است، تبدیل آن به Keras بی اهمیت است. همانطور که در بالا نشان داده شده است می توانید به سادگی از یک tf.keras.Input مستقیماً در مدل خود استفاده کنید.

یک numeric_column همچنین می تواند برای عادی سازی ورودی استفاده شود:

def normalize(x):
  mean, variance = (2.0, 1.0)
  return (x - mean) / math.sqrt(variance)
numeric_col = tf1.feature_column.numeric_column('col', normalizer_fn=normalize)
call_feature_columns(numeric_col, {'col': tf.constant([[0.], [1.], [2.]])})

<tf.Tensor: shape=(3, 1), dtype=float32, numpy=
array([[-2.],
       [-1.],
       [ 0.]], dtype=float32)>

در مقابل، با Keras، این عادی سازی را می توان با tf.keras.layers.Normalization انجام داد.

normalization_layer = tf.keras.layers.Normalization(mean=2.0, variance=1.0)
normalization_layer(tf.constant([[0.], [1.], [2.]]))

<tf.Tensor: shape=(3, 1), dtype=float32, numpy=
array([[-2.],
       [-1.],
       [ 0.]], dtype=float32)>

ویژگی‌های عددی سطل‌سازی و رمزگذاری یک‌طرفه

یکی دیگر از تبدیل‌های متداول ورودی‌های ممیز شناور پیوسته، تبدیل به سطل و سپس به اعداد صحیح یک محدوده ثابت است.

در ستون های ویژگی، این را می توان با یک tf.feature_column.bucketized_column به دست آورد:

numeric_col = tf1.feature_column.numeric_column('col')
bucketized_col = tf1.feature_column.bucketized_column(numeric_col, [1, 4, 5])
call_feature_columns(bucketized_col, {'col': tf.constant([1., 2., 3., 4., 5.])})

<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]], dtype=float32)>

در Keras، این می تواند با tf.keras.layers.Discretization جایگزین شود:

discretization_layer = tf.keras.layers.Discretization(bin_boundaries=[1, 4, 5])
one_hot_layer = tf.keras.layers.CategoryEncoding(
    num_tokens=4, output_mode='one_hot')
one_hot_layer(discretization_layer([1., 2., 3., 4., 5.]))

<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]], dtype=float32)>

داده های رشته ای رمزگذاری یکباره با واژگان

مدیریت ویژگی های رشته اغلب به جستجوی واژگان برای ترجمه رشته ها به شاخص نیاز دارد. در اینجا یک مثال با استفاده از ستون‌های ویژگی برای جستجوی رشته‌ها و سپس کدگذاری یک‌سری شاخص‌ها آورده شده است:

vocab_col = tf1.feature_column.categorical_column_with_vocabulary_list(
    'sizes',
    vocabulary_list=['small', 'medium', 'large'],
    num_oov_buckets=0)
indicator_col = tf1.feature_column.indicator_column(vocab_col)
call_feature_columns(indicator_col, {'sizes': ['small', 'medium', 'large']})

<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]], dtype=float32)>

با استفاده از لایه‌های پیش‌پردازش Keras، از لایه tf.keras.layers.StringLookup با output_mode روی 'one_hot' :

string_lookup_layer = tf.keras.layers.StringLookup(
    vocabulary=['small', 'medium', 'large'],
    num_oov_indices=0,
    output_mode='one_hot')
string_lookup_layer(['small', 'medium', 'large'])

<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]], dtype=float32)>

جاسازی داده های رشته ای با واژگان

برای واژگان بزرگ‌تر، اغلب برای عملکرد خوب به یک جاسازی نیاز است. در اینجا نمونه ای از تعبیه یک ویژگی رشته با استفاده از ستون های ویژگی آورده شده است:

vocab_col = tf1.feature_column.categorical_column_with_vocabulary_list(
    'col',
    vocabulary_list=['small', 'medium', 'large'],
    num_oov_buckets=0)
embedding_col = tf1.feature_column.embedding_column(vocab_col, 4)
call_feature_columns(embedding_col, {'col': ['small', 'medium', 'large']})

<tf.Tensor: shape=(3, 4), dtype=float32, numpy=
array([[-0.01798586, -0.2808677 ,  0.27639154,  0.06081508],
       [ 0.05771849,  0.02464074,  0.20080602,  0.50164527],
       [-0.9208247 , -0.40816694, -0.49132794,  0.9203153 ]],
      dtype=float32)>

با استفاده از لایه‌های پیش‌پردازش Keras، می‌توان این کار را با ترکیب یک لایه tf.keras.layers.StringLookup و یک لایه tf.keras.layers.Embedding به دست آورد. خروجی پیش‌فرض برای StringLookup ، شاخص‌های عدد صحیح است که می‌توانند مستقیماً به یک جاسازی وارد شوند.

string_lookup_layer = tf.keras.layers.StringLookup(
    vocabulary=['small', 'medium', 'large'], num_oov_indices=0)
embedding = tf.keras.layers.Embedding(3, 4)
embedding(string_lookup_layer(['small', 'medium', 'large']))

<tf.Tensor: shape=(3, 4), dtype=float32, numpy=
array([[ 0.04838837, -0.04014301,  0.02001903, -0.01150769],
       [-0.04580117, -0.04319514,  0.03725603, -0.00572466],
       [-0.0401094 ,  0.00997342,  0.00111955,  0.00132702]],
      dtype=float32)>

جمع بندی داده های طبقه بندی وزن دار

در برخی موارد، باید با داده‌های طبقه‌بندی برخورد کنید، جایی که هر رخداد یک دسته با وزن مرتبط همراه است. در ستون های ویژگی، این با tf.feature_column.weighted_categorical_column مدیریت می شود. هنگامی که با یک indicator_column جفت می‌شود، این اثر جمع وزن‌ها در هر دسته است.

ids = tf.constant([[5, 11, 5, 17, 17]])
weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])

categorical_col = tf1.feature_column.categorical_column_with_identity(
    'ids', num_buckets=20)
weighted_categorical_col = tf1.feature_column.weighted_categorical_column(
    categorical_col, 'weights')
indicator_col = tf1.feature_column.indicator_column(weighted_categorical_col)
call_feature_columns(indicator_col, {'ids': ids, 'weights': weights})

WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4203: sparse_merge (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
No similar op available at this time.
<tf.Tensor: shape=(1, 20), dtype=float32, numpy=
array([[0. , 0. , 0. , 0. , 0. , 1.2, 0. , 0. , 0. , 0. , 0. , 1.5, 0. ,

        0. , 0. , 0. , 0. , 2. , 0. , 0. ]], dtype=float32)>

در Keras، این کار را می توان با ارسال یک ورودی count_weights به tf.keras.layers.CategoryEncoding با output_mode='count' انجام داد.

ids = tf.constant([[5, 11, 5, 17, 17]])
weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])

# Using sparse output is more efficient when `num_tokens` is large.
count_layer = tf.keras.layers.CategoryEncoding(
    num_tokens=20, output_mode='count', sparse=True)
tf.sparse.to_dense(count_layer(ids, count_weights=weights))

<tf.Tensor: shape=(1, 20), dtype=float32, numpy=
array([[0. , 0. , 0. , 0. , 0. , 1.2, 0. , 0. , 0. , 0. , 0. , 1.5, 0. ,

        0. , 0. , 0. , 0. , 2. , 0. , 0. ]], dtype=float32)>

جاسازی داده های دسته بندی وزن دار

ممکن است به طور متناوب بخواهید ورودی های دسته بندی وزنی را جاسازی کنید. در ستون های ویژگی، embedding_column حاوی یک آرگومان combiner است. اگر هر نمونه حاوی چندین ورودی برای یک دسته باشد، آنها مطابق با تنظیم آرگومان ترکیب می شوند (به طور پیش فرض 'mean' ).

ids = tf.constant([[5, 11, 5, 17, 17]])
weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])

categorical_col = tf1.feature_column.categorical_column_with_identity(
    'ids', num_buckets=20)
weighted_categorical_col = tf1.feature_column.weighted_categorical_column(
    categorical_col, 'weights')
embedding_col = tf1.feature_column.embedding_column(
    weighted_categorical_col, 4, combiner='mean')
call_feature_columns(embedding_col, {'ids': ids, 'weights': weights})

<tf.Tensor: shape=(1, 4), dtype=float32, numpy=
array([[ 0.02666993,  0.289671  ,  0.18065728, -0.21045178]],
      dtype=float32)>

در Keras، هیچ گزینه ترکیبی برای combiner وجود tf.keras.layers.Embedding ، اما می توانید با tf.keras.layers.Dense به همان اثر برسید. embedding_column بالا به سادگی بردارهای تعبیه شده را بر اساس وزن دسته به صورت خطی ترکیب می کند. اگرچه در ابتدا واضح نیست، اما دقیقاً معادل نمایش ورودی‌های دسته‌بندی شما به عنوان یک بردار وزن کم اندازه (num_tokens) و تکثیر آن‌ها توسط یک هسته Dense شکل (embedding_size, num_tokens) است.

ids = tf.constant([[5, 11, 5, 17, 17]])
weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])

# For `combiner='mean'`, normalize your weights to sum to 1. Removing this line
# would be eqivalent to an `embedding_column` with `combiner='sum'`.
weights = weights / tf.reduce_sum(weights, axis=-1, keepdims=True)

count_layer = tf.keras.layers.CategoryEncoding(
    num_tokens=20, output_mode='count', sparse=True)
embedding_layer = tf.keras.layers.Dense(4, use_bias=False)
embedding_layer(count_layer(ids, count_weights=weights))

<tf.Tensor: shape=(1, 4), dtype=float32, numpy=
array([[-0.03897291, -0.27131438,  0.09332469,  0.04333957]],
      dtype=float32)>

نمونه آموزش کامل

برای نمایش یک گردش کار آموزشی کامل، ابتدا تعدادی داده با سه ویژگی از انواع مختلف آماده کنید:

features = {
    'type': [0, 1, 1],
    'size': ['small', 'small', 'medium'],
    'weight': [2.7, 1.8, 1.6],
}
labels = [1, 1, 0]
predict_features = {'type': [0], 'size': ['foo'], 'weight': [-0.7]}

برخی از ثابت های رایج را برای هر دو گردش کار TF1 و TF2 تعریف کنید:

vocab = ['small', 'medium', 'large']
one_hot_dims = 3
embedding_dims = 4
weight_mean = 2.0
weight_variance = 1.0

با ستون های ویژگی

ستون‌های ویژگی باید به‌عنوان فهرستی به تخمین‌گر در هنگام ایجاد ارسال شوند و در طول آموزش به طور ضمنی فراخوانی می‌شوند.

categorical_col = tf1.feature_column.categorical_column_with_identity(
    'type', num_buckets=one_hot_dims)
# Convert index to one-hot; e.g. [2] -> [0,0,1].
indicator_col = tf1.feature_column.indicator_column(categorical_col)

# Convert strings to indices; e.g. ['small'] -> [1].
vocab_col = tf1.feature_column.categorical_column_with_vocabulary_list(
    'size', vocabulary_list=vocab, num_oov_buckets=1)
# Embed the indices.
embedding_col = tf1.feature_column.embedding_column(vocab_col, embedding_dims)

normalizer_fn = lambda x: (x - weight_mean) / math.sqrt(weight_variance)
# Normalize the numeric inputs; e.g. [2.0] -> [0.0].
numeric_col = tf1.feature_column.numeric_column(
    'weight', normalizer_fn=normalizer_fn)

estimator = tf1.estimator.DNNClassifier(
    feature_columns=[indicator_col, embedding_col, numeric_col],
    hidden_units=[1])

def _input_fn():
  return tf1.data.Dataset.from_tensor_slices((features, labels)).batch(1)

estimator.train(_input_fn)

INFO:tensorflow:Using default config.
WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmp8lwbuor2
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmp8lwbuor2', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow/python/training/adagrad.py:77: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmp8lwbuor2/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:loss = 0.54634213, step = 0
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 3...
INFO:tensorflow:Saving checkpoints for 3 into /tmp/tmp8lwbuor2/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 3...
INFO:tensorflow:Loss for final step: 0.7308526.
<tensorflow_estimator.python.estimator.canned.dnn.DNNClassifier at 0x7f90685d53d0>

ستون های ویژگی همچنین برای تبدیل داده های ورودی هنگام اجرای استنتاج بر روی مدل استفاده خواهند شد.

def _predict_fn():
  return tf1.data.Dataset.from_tensor_slices(predict_features).batch(1)

next(estimator.predict(_predict_fn))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmp8lwbuor2/model.ckpt-3
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
{'logits': array([0.5172372], dtype=float32),
 'logistic': array([0.6265015], dtype=float32),
 'probabilities': array([0.37349847, 0.6265015 ], dtype=float32),
 'class_ids': array([1]),
 'classes': array([b'1'], dtype=object),
 'all_class_ids': array([0, 1], dtype=int32),
 'all_classes': array([b'0', b'1'], dtype=object)}

با لایه های پیش پردازش Keras

لایه های پیش پردازش Keras در جایی که می توان آنها را فراخوانی کرد انعطاف پذیرتر هستند. یک لایه را می توان مستقیماً روی تانسورها اعمال کرد، در داخل یک خط لوله ورودی tf.data استفاده کرد یا مستقیماً در یک مدل Keras قابل آموزش قرار داد.

در این مثال، لایه های پیش پردازش را در داخل یک خط لوله ورودی tf.data اعمال خواهید کرد. برای انجام این کار، می توانید یک tf.keras.Model جداگانه برای پیش پردازش ویژگی های ورودی خود تعریف کنید. این مدل قابل آموزش نیست، اما راهی مناسب برای گروه بندی لایه های پیش پردازش است.

inputs = {
  'type': tf.keras.Input(shape=(), dtype='int64'),
  'size': tf.keras.Input(shape=(), dtype='string'),
  'weight': tf.keras.Input(shape=(), dtype='float32'),
}
# Convert index to one-hot; e.g. [2] -> [0,0,1].
type_output = tf.keras.layers.CategoryEncoding(
      one_hot_dims, output_mode='one_hot')(inputs['type'])
# Convert size strings to indices; e.g. ['small'] -> [1].
size_output = tf.keras.layers.StringLookup(vocabulary=vocab)(inputs['size'])
# Normalize the numeric inputs; e.g. [2.0] -> [0.0].
weight_output = tf.keras.layers.Normalization(
      axis=None, mean=weight_mean, variance=weight_variance)(inputs['weight'])
outputs = {
  'type': type_output,
  'size': size_output,
  'weight': weight_output,
}
preprocessing_model = tf.keras.Model(inputs, outputs)

اکنون می توانید این مدل را در یک فراخوانی به tf.data.Dataset.map کنید. لطفاً توجه داشته باشید که تابع ارسال شده به map به طور خودکار به یک tf.function تبدیل می شود و هشدارهای معمول برای نوشتن کد tf.function اعمال می شود (بدون عوارض جانبی).

# Apply the preprocessing in tf.data.Dataset.map.
dataset = tf.data.Dataset.from_tensor_slices((features, labels)).batch(1)
dataset = dataset.map(lambda x, y: (preprocessing_model(x), y),
                      num_parallel_calls=tf.data.AUTOTUNE)
# Display a preprocessed input sample.
next(dataset.take(1).as_numpy_iterator())

({'type': array([[1., 0., 0.]], dtype=float32),
  'size': array([1]),
  'weight': array([0.70000005], dtype=float32)},
 array([1], dtype=int32))

در مرحله بعد، می توانید یک Model جداگانه حاوی لایه های قابل آموزش تعریف کنید. توجه داشته باشید که چگونه ورودی های این مدل اکنون انواع و اشکال ویژگی های از پیش پردازش شده را منعکس می کنند.

inputs = {
  'type': tf.keras.Input(shape=(one_hot_dims,), dtype='float32'),
  'size': tf.keras.Input(shape=(), dtype='int64'),
  'weight': tf.keras.Input(shape=(), dtype='float32'),
}
# Since the embedding is trainable, it needs to be part of the training model.
embedding = tf.keras.layers.Embedding(len(vocab), embedding_dims)
outputs = tf.keras.layers.Concatenate()([
  inputs['type'],
  embedding(inputs['size']),
  tf.expand_dims(inputs['weight'], -1),
])
outputs = tf.keras.layers.Dense(1)(outputs)
training_model = tf.keras.Model(inputs, outputs)

اکنون می توانید training_model با tf.keras.Model.fit آموزش دهید.

# Train on the preprocessed data.
training_model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True))
training_model.fit(dataset)

3/3 [==============================] - 0s 3ms/step - loss: 0.7248
<keras.callbacks.History at 0x7f9041a294d0>

در نهایت، در زمان استنتاج، ترکیب این مراحل جداگانه در یک مدل واحد که ورودی های ویژگی خام را مدیریت می کند، می تواند مفید باشد.

inputs = preprocessing_model.input
outpus = training_model(preprocessing_model(inputs))
inference_model = tf.keras.Model(inputs, outpus)

predict_dataset = tf.data.Dataset.from_tensor_slices(predict_features).batch(1)
inference_model.predict(predict_dataset)

array([[0.936637]], dtype=float32)

این مدل ترکیب شده را می توان به عنوان SavedModel برای استفاده بعدی ذخیره کرد.

inference_model.save('model')
restored_model = tf.keras.models.load_model('model')
restored_model.predict(predict_dataset)

WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.
2021-10-27 01:23:25.649967: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
INFO:tensorflow:Assets written to: model/assets
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
array([[0.936637]], dtype=float32)

توجه: لایه های پیش پردازش قابل آموزش نیستند، که به شما امکان می دهد با استفاده از tf.data آنها را به صورت ناهمزمان اعمال کنید. این مزیت‌های عملکردی دارد، زیرا هم می‌توانید دسته‌های از پیش پردازش‌شده را از قبل واکشی کنید و هم هر شتاب‌دهنده‌ای را آزاد کنید تا روی قسمت‌های قابل تمایز یک مدل تمرکز کند. همانطور که این راهنما نشان می دهد، جداسازی پیش پردازش در طول آموزش و ترکیب آن در طول استنتاج، راهی انعطاف پذیر برای افزایش این دستاوردهای عملکردی است. با این حال، اگر مدل شما کوچک است یا زمان پیش پردازش ناچیز است، ممکن است ساختن پیش پردازش در یک مدل کامل از همان ابتدا ساده تر باشد. برای این کار می‌توانید یک مدل واحد بسازید که با tf.keras.Input شروع می‌شود، سپس لایه‌های پیش‌پردازش، و سپس لایه‌های قابل آموزش.

جدول معادل ستون ویژگی

برای مرجع، در اینجا یک مطابقت تقریبی بین ستون های ویژگی و لایه های پیش پردازش وجود دارد:

ستون ویژگی	لایه کراس
`feature_column.bucketized_column`	`layers.Discretization`
`feature_column.categorical_column_with_hash_bucket`	`layers.Hashing`
`feature_column.categorical_column_with_identity`	`layers.CategoryEncoding`
`feature_column.categorical_column_with_vocabulary_file`	`layers.StringLookup` یا `layers.IntegerLookup`
`feature_column.categorical_column_with_vocabulary_list`	`layers.StringLookup` یا `layers.IntegerLookup`
`feature_column.crossed_column`	اجرا نشده.
`feature_column.embedding_column`	`layers.Embedding`
`feature_column.indicator_column`	`output_mode='one_hot'` یا `output_mode='multi_hot'` *
`feature_column.numeric_column`	`layers.Normalization`
`feature_column.sequence_categorical_column_with_hash_bucket`	`layers.Hashing`
`feature_column.sequence_categorical_column_with_identity`	`layers.CategoryEncoding`
`feature_column.sequence_categorical_column_with_vocabulary_file`	`layers.StringLookup` ، `layers.IntegerLookup` ، یا `layer.TextVectorization` †
`feature_column.sequence_categorical_column_with_vocabulary_list`	`layers.StringLookup` ، `layers.IntegerLookup` ، یا `layer.TextVectorization` †
`feature_column.sequence_numeric_column`	`layers.Normalization`
`feature_column.weighted_categorical_column`	`layers.CategoryEncoding`

* output_mode را می توان به layers.CategoryEncoding ، layers.StringLookup ، layers.IntegerLookup و layers.TextVectorization کرد.

layers.TextVectorization می‌تواند مستقیماً ورودی متن آزاد را مدیریت کند (مثلاً کل جملات یا پاراگراف‌ها). این جایگزینی یک به یک برای مدیریت توالی مقوله ای در TF1 نیست، اما ممکن است جایگزین مناسبی برای پیش پردازش متن ad-hoc باشد.

مراحل بعدی

برای اطلاعات بیشتر در مورد لایه‌های پیش‌پردازش keras، راهنمای لایه‌های پیش‌پردازش را ببینید.
برای مثال عمیق تر از اعمال لایه های پیش پردازش به داده های ساخت یافته، به آموزش داده های ساخت یافته مراجعه کنید.