หนังแนะนำ : อันดับ

ดูบน TensorFlow.org

ทำงานใน Google Colab

ดูแหล่งที่มาบน GitHub

ดาวน์โหลดโน๊ตบุ๊ค

ระบบผู้แนะนำในโลกแห่งความเป็นจริงมักประกอบด้วยสองขั้นตอน:

ขั้นตอนการดึงข้อมูลมีหน้าที่ในการเลือกชุดเริ่มต้นของผู้สมัครหลายร้อยคนจากผู้สมัครที่เป็นไปได้ทั้งหมด วัตถุประสงค์หลักของโมเดลนี้คือการกำจัดผู้สมัครทั้งหมดที่ผู้ใช้ไม่สนใจอย่างมีประสิทธิภาพ เนื่องจากโมเดลการดึงข้อมูลอาจจัดการกับผู้สมัครนับล้าน จึงจำเป็นต้องมีประสิทธิภาพในการคำนวณ
ขั้นตอนการจัดอันดับจะนำเอาผลลัพธ์ของโมเดลการดึงข้อมูลและปรับแต่งพวกมันเพื่อเลือกคำแนะนำที่ดีที่สุดเท่าที่เป็นไปได้ หน้าที่ของมันคือจำกัดชุดของรายการที่ผู้ใช้อาจสนใจให้เหลือเพียงรายชื่อผู้มีโอกาสเป็นลูกค้า

เราจะมุ่งเน้นไปที่ขั้นตอนที่สอง การจัดอันดับ หากคุณมีความสนใจในขั้นตอนการเรียกดูได้ที่เรา ดึง กวดวิชา

ในบทช่วยสอนนี้ เราจะไปที่:

รับข้อมูลของเราและแบ่งออกเป็นชุดฝึกอบรมและทดสอบ
ใช้แบบจำลองการจัดอันดับ
เหมาะสมและประเมินมัน

นำเข้า

ขั้นแรกให้นำเข้าของเราออกไปให้พ้นทาง

pip install -q tensorflow-recommenders
pip install -q --upgrade tensorflow-datasets

import os
import pprint
import tempfile

from typing import Dict, Text

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

import tensorflow_recommenders as tfrs

กำลังเตรียมชุดข้อมูล

เรากำลังจะใช้ข้อมูลเช่นเดียวกับ การดึง กวดวิชา คราวนี้ เราจะรักษาอันดับด้วย นี่คือวัตถุประสงค์ที่เราพยายามคาดการณ์

ratings = tfds.load("movielens/100k-ratings", split="train")

ratings = ratings.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"],
    "user_rating": x["user_rating"]
})

2021-10-02 11:04:25.388548: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

เช่นเคย เราจะแบ่งข้อมูลโดยใส่คะแนน 80% ในชุดรถไฟ และ 20% ในชุดทดสอบ

tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

มาคิดกันด้วย ID ผู้ใช้และชื่อภาพยนตร์ที่ไม่ซ้ำที่มีอยู่ในข้อมูล

นี่เป็นสิ่งสำคัญเพราะเราจำเป็นต้องสามารถแมปค่าดิบของคุณสมบัติตามหมวดหมู่ของเรากับการฝังเวกเตอร์ในแบบจำลองของเรา ในการทำเช่นนั้น เราจำเป็นต้องมีคำศัพท์ที่จับคู่ค่าคุณลักษณะดิบกับจำนวนเต็มในช่วงที่อยู่ติดกัน: ซึ่งช่วยให้เราสามารถค้นหาการฝังที่สอดคล้องกันในตารางการฝังของเรา

movie_titles = ratings.batch(1_000_000).map(lambda x: x["movie_title"])
user_ids = ratings.batch(1_000_000).map(lambda x: x["user_id"])

unique_movie_titles = np.unique(np.concatenate(list(movie_titles)))
unique_user_ids = np.unique(np.concatenate(list(user_ids)))

การนำแบบจำลองไปใช้

สถาปัตยกรรม

โมเดลการจัดอันดับไม่ได้เผชิญกับข้อจำกัดด้านประสิทธิภาพเช่นเดียวกับแบบจำลองการดึงข้อมูล ดังนั้นเราจึงมีอิสระมากขึ้นเล็กน้อยในการเลือกสถาปัตยกรรมของเรา

โมเดลที่ประกอบด้วยเลเยอร์หนาแน่นหลายชั้นเป็นสถาปัตยกรรมทั่วไปสำหรับงานจัดอันดับ เราสามารถนำไปปฏิบัติได้ดังนี้:

class RankingModel(tf.keras.Model):

  def __init__(self):
    super().__init__()
    embedding_dimension = 32

    # Compute embeddings for users.
    self.user_embeddings = tf.keras.Sequential([
      tf.keras.layers.StringLookup(
        vocabulary=unique_user_ids, mask_token=None),
      tf.keras.layers.Embedding(len(unique_user_ids) + 1, embedding_dimension)
    ])

    # Compute embeddings for movies.
    self.movie_embeddings = tf.keras.Sequential([
      tf.keras.layers.StringLookup(
        vocabulary=unique_movie_titles, mask_token=None),
      tf.keras.layers.Embedding(len(unique_movie_titles) + 1, embedding_dimension)
    ])

    # Compute predictions.
    self.ratings = tf.keras.Sequential([
      # Learn multiple dense layers.
      tf.keras.layers.Dense(256, activation="relu"),
      tf.keras.layers.Dense(64, activation="relu"),
      # Make rating predictions in the final layer.
      tf.keras.layers.Dense(1)
  ])

  def call(self, inputs):

    user_id, movie_title = inputs

    user_embedding = self.user_embeddings(user_id)
    movie_embedding = self.movie_embeddings(movie_title)

    return self.ratings(tf.concat([user_embedding, movie_embedding], axis=1))

โมเดลนี้ใช้ ID ผู้ใช้และชื่อภาพยนตร์ และให้คะแนนที่คาดการณ์ไว้:

RankingModel()((["42"], ["One Flew Over the Cuckoo's Nest (1975)"]))

WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor, but we receive a <class 'list'> input: ['42']
Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor, but we receive a <class 'list'> input: ['42']
Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor, but we receive a <class 'list'> input: ["One Flew Over the Cuckoo's Nest (1975)"]
Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor, but we receive a <class 'list'> input: ["One Flew Over the Cuckoo's Nest (1975)"]
Consider rewriting this model with the Functional API.
<tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[0.03740937]], dtype=float32)>

การสูญเสียและตัวชี้วัด

องค์ประกอบต่อไปคือการสูญเสียที่ใช้ในการฝึกแบบจำลองของเรา TFRS มีชั้นการสูญเสียและงานหลายอย่างเพื่อทำให้สิ่งนี้ง่ายขึ้น

ในกรณีนี้เราจะใช้ประโยชน์จาก Ranking วัตถุงาน: เสื้อคลุมที่สะดวกในการรวมกลุ่มกันฟังก์ชั่นการสูญเสียและการคำนวณตัวชี้วัด

เราจะใช้มันร่วมกับ MeanSquaredError สูญเสีย Keras เพื่อทำนายการจัดอันดับ

task = tfrs.tasks.Ranking(
  loss = tf.keras.losses.MeanSquaredError(),
  metrics=[tf.keras.metrics.RootMeanSquaredError()]
)

ตัวงานเองเป็นเลเยอร์ Keras ที่เป็นจริงและคาดการณ์ไว้เป็นอาร์กิวเมนต์ และคืนค่าการสูญเสียที่คำนวณได้ เราจะใช้สิ่งนั้นเพื่อใช้งานลูปการฝึกของโมเดล

ตัวเต็ม

ตอนนี้เราสามารถรวมทุกอย่างเข้าด้วยกันเป็นแบบจำลองได้ ฉบับที่ exposes ชั้นฐานแบบจำลอง ( tfrs.models.Model ) ซึ่งช่วยเพิ่มความคล่องตัวรุ่น bulding: ทุกสิ่งที่เราต้องทำคือการตั้งค่าส่วนประกอบใน __init__ วิธีการและดำเนินการ compute_loss วิธีการในลักษณะดิบและกลับมาคุ้มค่าการสูญเสีย .

จากนั้น โมเดลพื้นฐานจะดูแลการสร้างลูปการฝึกที่เหมาะสมเพื่อให้เข้ากับโมเดลของเรา

class MovielensModel(tfrs.models.Model):

  def __init__(self):
    super().__init__()
    self.ranking_model: tf.keras.Model = RankingModel()
    self.task: tf.keras.layers.Layer = tfrs.tasks.Ranking(
      loss = tf.keras.losses.MeanSquaredError(),
      metrics=[tf.keras.metrics.RootMeanSquaredError()]
    )

  def call(self, features: Dict[str, tf.Tensor]) -> tf.Tensor:
    return self.ranking_model(
        (features["user_id"], features["movie_title"]))

  def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    labels = features.pop("user_rating")

    rating_predictions = self(features)

    # The task computes the loss and the metrics.
    return self.task(labels=labels, predictions=rating_predictions)

การติดตั้งและการประเมิน

หลังจากกำหนดโมเดลแล้ว เราสามารถใช้ Keras fitting มาตรฐานและรูทีนการประเมินเพื่อให้พอดีและประเมินโมเดล

มาสร้างอินสแตนซ์โมเดลกันก่อน

model = MovielensModel()
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

จากนั้นสับเปลี่ยน แบทช์ และแคชข้อมูลการฝึกอบรมและการประเมิน

cached_train = train.shuffle(100_000).batch(8192).cache()
cached_test = test.batch(4096).cache()

จากนั้นฝึกโมเดล:

model.fit(cached_train, epochs=3)

Epoch 1/3
10/10 [==============================] - 2s 26ms/step - root_mean_squared_error: 2.1718 - loss: 4.3303 - regularization_loss: 0.0000e+00 - total_loss: 4.3303
Epoch 2/3
10/10 [==============================] - 0s 8ms/step - root_mean_squared_error: 1.1227 - loss: 1.2602 - regularization_loss: 0.0000e+00 - total_loss: 1.2602
Epoch 3/3
10/10 [==============================] - 0s 8ms/step - root_mean_squared_error: 1.1162 - loss: 1.2456 - regularization_loss: 0.0000e+00 - total_loss: 1.2456
<keras.callbacks.History at 0x7f28389eaa90>

การสูญเสียกำลังลดลงและตัวชี้วัด RMSE กำลังดีขึ้น

สุดท้าย เราสามารถประเมินแบบจำลองของเราในชุดทดสอบ:

model.evaluate(cached_test, return_dict=True)

5/5 [==============================] - 2s 14ms/step - root_mean_squared_error: 1.1108 - loss: 1.2287 - regularization_loss: 0.0000e+00 - total_loss: 1.2287
{'root_mean_squared_error': 1.1108061075210571,
 'loss': 1.2062578201293945,
 'regularization_loss': 0,
 'total_loss': 1.2062578201293945}

ยิ่งเมตริก RMSE ต่ำเท่าใด โมเดลของเราก็ยิ่งคาดการณ์การให้คะแนนได้แม่นยำมากขึ้นเท่านั้น

การทดสอบรูปแบบการจัดอันดับ

ตอนนี้ เราสามารถทดสอบโมเดลการจัดอันดับโดยคำนวณการทำนายสำหรับชุดของภาพยนตร์ แล้วจัดอันดับภาพยนตร์เหล่านี้ตามการคาดคะเน:

test_ratings = {}
test_movie_titles = ["M*A*S*H (1970)", "Dances with Wolves (1990)", "Speed (1994)"]
for movie_title in test_movie_titles:
  test_ratings[movie_title] = model({
      "user_id": np.array(["42"]),
      "movie_title": np.array([movie_title])
  })

print("Ratings:")
for title, score in sorted(test_ratings.items(), key=lambda x: x[1], reverse=True):
  print(f"{title}: {score}")

Ratings:
M*A*S*H (1970): [[3.584712]]
Dances with Wolves (1990): [[3.551556]]
Speed (1994): [[3.5215874]]

ส่งออกเพื่อให้บริการ

โมเดลนี้สามารถส่งออกเพื่อให้บริการได้อย่างง่ายดาย:

tf.saved_model.save(model, "export")

2021-10-02 11:04:38.235611: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
WARNING:absl:Found untraced functions such as ranking_1_layer_call_and_return_conditional_losses, ranking_1_layer_call_fn, ranking_1_layer_call_fn, ranking_1_layer_call_and_return_conditional_losses, ranking_1_layer_call_and_return_conditional_losses while saving (showing 5 of 5). These functions will not be directly callable after loading.
INFO:tensorflow:Assets written to: export/assets
INFO:tensorflow:Assets written to: export/assets

ขณะนี้เราสามารถโหลดกลับและดำเนินการคาดการณ์ได้:

loaded = tf.saved_model.load("export")

loaded({"user_id": np.array(["42"]), "movie_title": ["Speed (1994)"]}).numpy()

array([[3.5215874]], dtype=float32)

ขั้นตอนถัดไป

โมเดลด้านบนทำให้เราเริ่มต้นได้ดีในการสร้างระบบการจัดอันดับ

แน่นอนว่าการสร้างระบบการจัดอันดับที่ใช้งานได้จริงนั้นต้องใช้ความพยายามอย่างมาก

ในกรณีส่วนใหญ่ โมเดลการจัดอันดับสามารถปรับปรุงได้อย่างมากโดยใช้คุณสมบัติมากกว่าแค่ตัวระบุผู้ใช้และผู้สมัคร เพื่อดูว่าจะทำอย่างนั้นได้ดูที่ ด้านข้างมี การกวดวิชา

จำเป็นต้องมีความเข้าใจอย่างถี่ถ้วนเกี่ยวกับวัตถุประสงค์ที่ควรค่าแก่การปรับให้เหมาะสม ในการเริ่มต้นในการสร้างการแนะนำที่เพิ่มประสิทธิภาพหลายวัตถุประสงค์ที่มีลักษณะที่เรา มัลติทาสก์ กวดวิชา