word2vec

word2vec은 단일 알고리즘이 아니며 그보다는 대규모 데이터세트에서 단어 임베딩을 학습하는 데 사용할 수 있는 모델 아키텍처 및 최적화 제품군입니다. word2vec을 통해 학습한 임베딩은 여러 다운스트림 자연어 처리 작업에서 성공적인 것으로 입증되었습니다.

참고: 이 튜토리얼은 벡터 공간의 단어 표현 효율적인 평가 및 단어 및 구문의 분산된 표현 및 구성성에 기반합니다. 이것은 논문에 대한 정확한 구현은 아닙니다. 그보다는 주요 아이디어를 설명하기 위한 것입니다.

이러한 논문들은 단어 표현을 학습하는 데 다음과 같은 두 가지 메서드를 제안합니다.

지속적인 bag-of-words 모델: 주변의 콘텍스트 단어를 바탕으로 중간 단어를 예측합니다. 콘텍스트는 현재(중간) 단어의 앞과 뒤의 몇몇 단어로 구성되어 있습니다. 이 아키텍처는 콘텍스트의 단어 순서가 중요하지 않기 때문에 bag-of-words 모델이라고 불립니다.
지속적인 skip-gram 모델: 동일한 문장 내의 현재 단어의 앞과 뒤 일정 범위 내의 단어를 예측합니다. 이에 대한 작업 예제는 아래와 같습니다.

이 튜토리얼에서는 skip-gram 접근 방식을 사용할 것입니다. 우선, 묘사를 위한 단일 문장을 사용해 skip-gram과 다른 개념을 살펴보겠습니다. 다음으로, 작은 데이터세트에서 자신의 word2vec 모델을 훈련합니다. 이 튜토리얼은 또한 훈련된 임베딩을 내보내기하고 TensorFlow Embedding Projector에서 시각화하는 코드를 포함합니다.

Skip-gram 및 네거티브 샘플링

bag-of-words 모델이 주변의 콘텍스트가 주어지면 단어를 예측하는 한편, skip-gram 모델은 단어 자체가 주어지면 단어의 콘텍스트(또는 주변)을 예측합니다. 모델은 토큰을 생략할 수 있는 n-grams인 skip-grams에서 훈련됩니다(예제는 아래 다이어그램 참조). 단어의 콘텍스트는 context_word가 target_word의 주변 콘텍스트에서 나타나는 (target_word, context_word)의 일련의 skip-gram 쌍을 통해 표시될 수 있습니다.

여덟 단어의 다음 문장을 고려해 보세요.

The wide road shimmered in the hot sun.

이 문장의 여덟 단어에 대한 각각의 콘텍스트 단어는 윈도 사이즈로 정의됩니다. 윈도 사이즈는 context word로 간주할 수 있는 target_word의 각 측면의 단어 범위로 결정됩니다. 아래는 다른 윈도 사이즈를 바탕으로 한 대상 단어의 skip-grams에 대한 표입니다.

참고: 이 튜토리얼의 경우, n의 윈도 사이즈는 한 단어에 2*n+1개의 단어인 총 윈도 범위와 함께 각 측면에 n개의 단어를 내포합니다.

word2vec_skipgrams

skip-gram 모델의 훈련 오브젝티브는 주어진 대상 단어의 콘텍스트 단어를 예측하는 확률을 최대화하는 것입니다. 단어 w₁, w₂, ... w_T 시퀀스의 경우, 오브젝티브는 평균 로그 확률대로 작성될 수 있습니다.

word2vec_skipgram_objective

여기에서 c는 훈련 콘텍스트의 사이즈입니다. 기본 skip-gram 공식은 softmax 함수를 사용해 이 확률을 정의합니다.

word2vec_full_softmax

여기에서 v 및 v^'는 단어의 대상 및 콘텍스트 벡터 표현이며 W는 어휘 사이즈입니다.

이 공식에 대한 분모를 계산하는 것은 종종 큰 (10⁵-10⁷) 항인 전체 어휘 단어에 대한 전체 softmax를 수행하는 것을 포함합니다.

잡음 대조 예측(NCE) 손실 함수는 전체 softmax에 대한 효율적인 예측입니다. 단어 분포를 모델링하는 대신 단어 임베딩을 학습하기 위한 오브젝티브를 통해 NCE 손실은 단순화되어 네거티브 샘플링을 사용할 수 있습니다.

대상 단어에 대한 단순화된 네거티브 샘플링 오브젝티브는 단어의 잡음 분포 P_n(w)에서 가져온 num_ns 네거티브 샘플의 콘텍스트 단어를 구별하는 것입니다. 더 명확하게 말하자면, 어휘에 대한 전체 softmax의 효율적인 근사치는 skip-gram 쌍의 경우 콘텍스트 단어 및 num_ns 네거티브 샘플 사이의 분류 문제로 대상 단어에 대한 손실을 제기하는 것입니다.

네거티브 샘플은 (target_word, context_word) 쌍으로 정의되어 target_word의 window_size 주변에 context_word가 표시되지 않습니다. 예제 문장의 경우, 몇몇 잠재적인 네거티브 샘플이 있습니다(window_size가 2인 경우).

(hot, shimmered)
(wide, hot)
(wide, sun)

다음 섹션에서는, 단일 문장에 대한 skip-grams 및 네거티브 샘플을 생성합니다. 또한 하위 샘플링 기술에 대해 배우고 이 튜토리얼에서 추후에 포지티브 및 테거티브 훈련 예제에 대한 분류 모델을 훈련합니다.

설치

import io
import re
import string
import tqdm

import numpy as np

import tensorflow as tf
from tensorflow.keras import layers

2022-12-14 21:20:04.536976: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-12-14 21:20:04.537078: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2022-12-14 21:20:04.537088: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

# Load the TensorBoard notebook extension
%load_ext tensorboard

SEED = 42
AUTOTUNE = tf.data.AUTOTUNE

예제 문장 벡터화

다음 문장을 고려해 보세요.

The wide road shimmered in the hot sun.

문장을 토큰화합니다.

sentence = "The wide road shimmered in the hot sun"
tokens = list(sentence.lower().split())
print(len(tokens))

어휘를 생성하여 토큰에서 정수 인덱스로 매핑을 저장합니다.

vocab, index = {}, 1  # start indexing from 1
vocab['<pad>'] = 0  # add a padding token
for token in tokens:
  if token not in vocab:
    vocab[token] = index
    index += 1
vocab_size = len(vocab)
print(vocab)

{'<pad>': 0, 'the': 1, 'wide': 2, 'road': 3, 'shimmered': 4, 'in': 5, 'hot': 6, 'sun': 7}

정반대의 어휘를 생성하여 정수 인덱스에서 토큰으로 매핑을 저장합니다.

inverse_vocab = {index: token for token, index in vocab.items()}
print(inverse_vocab)

{0: '<pad>', 1: 'the', 2: 'wide', 3: 'road', 4: 'shimmered', 5: 'in', 6: 'hot', 7: 'sun'}

문장을 벡터화합니다.

example_sequence = [vocab[word] for word in tokens]
print(example_sequence)

[1, 2, 3, 4, 5, 1, 6, 7]

한 문장에서 skip-grams 생성하기

tf.keras.preprocessing.sequence 모듈은 word2vec에 대한 데이터 준비를 단순화하는 유용한 함수를 제공합니다. tf.keras.preprocessing.sequence.skipgrams를 사용해 범위 [0, vocab_size)의 토큰에서 주어진 window_size를 통해 example_sequence에서 skip-gram 쌍을 생성할 수 있습니다.

참고: 이 함수로 생성된 네거티브 샘플을 배칭하려면 약간의 코드가 필요하기 때문에 negative_samples가 여기 0에 설정되었습니다. 다음 섹션에서 네거티브 샘플링 수행을 위해 다른 함수를 사용할 것입니다.

window_size = 2
positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
      example_sequence,
      vocabulary_size=vocab_size,
      window_size=window_size,
      negative_samples=0)
print(len(positive_skip_grams))

몇몇 네거티브 skip-grams을 프린트합니다.

for target, context in positive_skip_grams[:5]:
  print(f"({target}, {context}): ({inverse_vocab[target]}, {inverse_vocab[context]})")

(2, 1): (wide, the)
(1, 7): (the, sun)
(1, 3): (the, road)
(6, 7): (hot, sun)
(3, 5): (road, in)

하나의 skip-gram에 대한 네거티브 샘플링

skipgrams 함수는 주어진 윈도 범위를 슬라이딩하여 모든 포지티브 skip-gram 쌍을 반환합니다. 훈련을 위한 네거티브 샘플 역할을 할 추가 skip-gram 쌍을 생성하려면 어휘에서 랜덤 단어를 샘플링해야 합니다. tf.random.log_uniform_candidate_sampler 함수를 사용해 윈도의 주어진 대상 단어에 대한 네거티브 샘플 num_ns개를 샘플링합니다. 하나의 skip-gram의 대상 단어에서 함수를 호출하고 true 클래스로 콘텍스트 단어를 전달해 샘플링에서 제외할 수 있습니다.

주요 포인트: [2, 5] 범위에는 num_ns가 더 큰 규모의 데이터세트에 충분한 반면 [5, 20] 범위의 num_ns (포지티브 콘텍스트 단어당 네거티브 샘플의 수)는 더 작은 규모의 데이터세트에 최적으로 작동하는 것으로 보입니다.

# Get target and context words for one positive skip-gram.
target_word, context_word = positive_skip_grams[0]

# Set the number of negative samples per positive context.
num_ns = 4

context_class = tf.reshape(tf.constant(context_word, dtype="int64"), (1, 1))
negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
    true_classes=context_class,  # class that should be sampled as 'positive'
    num_true=1,  # each positive skip-gram has 1 positive context class
    num_sampled=num_ns,  # number of negative context words to sample
    unique=True,  # all the negative samples should be unique
    range_max=vocab_size,  # pick index of the samples from [0, vocab_size]
    seed=SEED,  # seed for reproducibility
    name="negative_sampling"  # name of this operation
)
print(negative_sampling_candidates)
print([inverse_vocab[index.numpy()] for index in negative_sampling_candidates])

tf.Tensor([2 1 4 3], shape=(4,), dtype=int64)
['wide', 'the', 'shimmered', 'road']

하나의 훈련 예제 구성하기

주어진 포지티브 (target_word, context_word) skip-gram의 경우, 이제 또한 target_word의 윈도 사이즈 주변에 표시되지 않는 num_ns 네거티브 샘플링된 콘텍스트 단어도 있습니다. 1 포지티브 context_word 및 num_ns 네거티브 콘텍스트 단어를 하나의 텐서로 배치합니다. 이는 각 대상 단어에 대한 일련의 포지티브 skip-grams(1로 레이블링 됨) 및 네거티브 샘플(0으로 레이블링 됨)을 생성합니다.

# Reduce a dimension so you can use concatenation (in the next step).
squeezed_context_class = tf.squeeze(context_class, 1)

# Concatenate a positive context word with negative sampled words.
context = tf.concat([squeezed_context_class, negative_sampling_candidates], 0)

# Label the first context word as `1` (positive) followed by `num_ns` `0`s (negative).
label = tf.constant([1] + [0]*num_ns, dtype="int64")
target = target_word

위의 skip-gram 예제의 대상 단어에 대한 콘텍스트와 해당 레이블을 확인합니다.

print(f"target_index    : {target}")
print(f"target_word     : {inverse_vocab[target_word]}")
print(f"context_indices : {context}")
print(f"context_words   : {[inverse_vocab[c.numpy()] for c in context]}")
print(f"label           : {label}")

target_index    : 2
target_word     : wide
context_indices : [1 2 1 4 3]
context_words   : ['the', 'wide', 'the', 'shimmered', 'road']
label           : [1 0 0 0 0]

(target, context, label) 텐서의 튜플은 skip-gram 네거티브 샘플링 word2vec 모델 훈련을 위한 하나의 훈련 예제로 구성되어 있습니다. 콘텍스트 및 레이블의 형태는 (1+num_ns,)인 반면 대상의 형태는 (1,)인 점에 주의하세요.

print("target  :", target)
print("context :", context)
print("label   :", label)

target  : 2
context : tf.Tensor([1 2 1 4 3], shape=(5,), dtype=int64)
label   : tf.Tensor([1 0 0 0 0], shape=(5,), dtype=int64)

요약

이 다이어그램은 문장에서 훈련 예제를 생성하는 절차를 요약합니다.

word2vec_negative_sampling

단어 temperature 및 code는 입력 문장의 일부가 아닌 점에 주의하세요. 이 단어들은 위의 다이어그램에서 사용된 특정 다른 인덱스처럼 어휘에 속합니다.

모든 단계를 하나의 함수로 컴파일

Skip-gram 샘플링 표

대규모 데이터세트는 불용어와 같은 빈도가 더 높은 단어의 수가 더 많은 더 큰 규모의 어휘를 의미합니다. 흔히 발생하는 단어(예: the, is, on) 샘플링에서 얻은 예제를 훈련하는 것은 모델이 학습할 유용한 정보를 더해주지 않습니다. Mikolov 등은 임베딩 품질을 개선하기 위해 유용한 방법으로 자주 사용하는 단어의 하위 샘플링을 제안합니다.

tf.keras.preprocessing.sequence.skipgrams 함수는 샘플링 표 인수를 허용하여 모든 토큰을 샘플링 하는 확률을 인코딩합니다. tf.keras.preprocessing.sequence.make_sampling_table을 사용해 확률적 샘플링 표를 기반으로 한 단어 빈도 순위를 생성하고 이를 skipgrams 함수에 전달할 수 있습니다. 10의 vocab_size에 대한 샘플링 확률을 검사합니다.

sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(size=10)
print(sampling_table)

[0.00315225 0.00315225 0.00547597 0.00741556 0.00912817 0.01068435
 0.01212381 0.01347162 0.01474487 0.0159558 ]

sampling_table[i]은 데이터세트의 i번째로 가장 흔한 단어를 샘플링 할 확률을 의미합니다. 함수는 샘플링을 위한 단어 빈도의 Zipf 분포를 추정합니다.

주요 포인트: tf.random.log_uniform_candidate_sampler는 이미 어휘 빈도가 로그 균일(Zipf) 분포를 따른다고 가정합니다. 이러한 분포 가중 샘플링을 사용하는 것은 또한 네거티브 샘플링 오브젝티브를 훈련하는 데 더 단순한 손실 함수로 잡음 대조 추정(NCE) 손실의 근사치를 계산하는 데 도움이 됩니다.

훈련 데이터 생성하기

위에 설명된 모든 단계를 모든 텍스트 데이터세트에서 획득한 벡터화된 문장의 목록에 호출할 수 있는 함수로 컴파일합니다. 샘플링 표가 skip-gram 단어 쌍을 샘플링 하기 전에 빌드되었다는 점에 주의하세요. 이 함수는 다음 섹션에서 사용합니다.

# Generates skip-gram pairs with negative sampling for a list of sequences
# (int-encoded sentences) based on window size, number of negative samples
# and vocabulary size.
def generate_training_data(sequences, window_size, num_ns, vocab_size, seed):
  # Elements of each training example are appended to these lists.
  targets, contexts, labels = [], [], []

  # Build the sampling table for `vocab_size` tokens.
  sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)

  # Iterate over all sequences (sentences) in the dataset.
  for sequence in tqdm.tqdm(sequences):

    # Generate positive skip-gram pairs for a sequence (sentence).
    positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
          sequence,
          vocabulary_size=vocab_size,
          sampling_table=sampling_table,
          window_size=window_size,
          negative_samples=0)

    # Iterate over each positive skip-gram pair to produce training examples
    # with a positive context word and negative samples.
    for target_word, context_word in positive_skip_grams:
      context_class = tf.expand_dims(
          tf.constant([context_word], dtype="int64"), 1)
      negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
          true_classes=context_class,
          num_true=1,
          num_sampled=num_ns,
          unique=True,
          range_max=vocab_size,
          seed=seed,
          name="negative_sampling")

      # Build context and label vectors (for one target word)
      context = tf.concat([tf.squeeze(context_class,1), negative_sampling_candidates], 0)
      label = tf.constant([1] + [0]*num_ns, dtype="int64")

      # Append each element from the training example to global lists.
      targets.append(target_word)
      contexts.append(context)
      labels.append(label)

  return targets, contexts, labels

word2vec에 대한 훈련 데이터 준비하기

word2vec 모델에 기반한 skip-gram 네거티브 샘플링을 위한 하나의 문장으로 작업하는 방법을 이해하여 더 큰 규모의 문장 목록에서 훈련 예제를 생성할 수 있습니다.

텍스트 말뭉치 다운로드

이 튜토리얼에서는 Shakespeare가 작성한 텍스트 파일을 사용합니다. 다음 라인을 변경하여 자신의 데이터에 대해 이 코드를 실행하세요.

path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
1115394/1115394 [==============================] - 0s 0us/step

파일에서 텍스트를 읽고 처음 몇 개의 라인을 프린트합니다.

with open(path_to_file) as f:
  lines = f.read().splitlines()
for line in lines[:20]:
  print(line)

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.

공백이 없는 라인을 사용해 다음 단계를 위해 tf.data.TextLineDataset 객체를 구성합니다.

text_ds = tf.data.TextLineDataset(path_to_file).filter(lambda x: tf.cast(tf.strings.length(x), bool))

WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089

말뭉치에서 문장 벡터화

TextVectorization 레이어를 사용하여 말뭉치의 문장을 벡터화할 수 있습니다. 텍스트 분류 튜토리얼에서 이 레이어 사용 방법에 대해 더 자세히 알아보세요. 위의 처음 몇몇 문장에서 텍스트는 한 경우에 사용해야 하고 구두점은 없어야 한다는 점을 알 수 있습니다. 이렇게 하려면 TextVectorization 레이어에서 사용할 수 있는 custom_standardization function을 정의합니다.

# Now, create a custom standardization function to lowercase the text and
# remove punctuation.
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  return tf.strings.regex_replace(lowercase,
                                  '[%s]' % re.escape(string.punctuation), '')


# Define the vocabulary size and the number of words in a sequence.
vocab_size = 4096
sequence_length = 10

# Use the `TextVectorization` layer to normalize, split, and map strings to
# integers. Set the `output_sequence_length` length to pad all samples to the
# same length.
vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

텍스트 데이터세트에서 TextVectorization.adapt를 호출하여 어휘를 생성합니다.

vectorize_layer.adapt(text_ds.batch(1024))

레이어의 상태가 텍스트 말뭉치를 나타내기 위해 조정되면 어휘는 TextVectorization.get_vocabulary로 액세스할 수 있습니다. 이 함수는 빈도로 정렬된(내림차순) 모든 어휘 토큰의 목록을 반환합니다.

# Save the created vocabulary for reference.
inverse_vocab = vectorize_layer.get_vocabulary()
print(inverse_vocab[:20])

['', '[UNK]', 'the', 'and', 'to', 'i', 'of', 'you', 'my', 'a', 'that', 'in', 'is', 'not', 'for', 'with', 'me', 'it', 'be', 'your']

vectorize_layer는 이제 text_ds( tf.data.Dataset)의 각 요소에 대한 벡터를 생성하는 데 사용할 수 있습니다. Dataset.batch, Dataset.prefetch, Dataset.map 및 Dataset.unbatch를 적용합니다.

# Vectorize the data in text_ds.
text_vector_ds = text_ds.batch(1024).prefetch(AUTOTUNE).map(vectorize_layer).unbatch()

데이터세트에서 시퀀스 획득하기

이제 정수 인코딩된 문장의 tf.data.Dataset가 있습니다. word2vec 모델을 훈련하기 위해 데이터세트를 준비하려면 데이터세트를 문장 벡터 시퀀스 목록으로 평면화합니다. 이 단계는 데이터세트의 각 문장을 반복하여 포지티브 및 네거티브 예제를 생성하기 때문에 필요합니다.

참고: 이전에 정의된 generate_training_data()가 TensorFlow가 아닌 Python/NumPy 함수를 사용하기 때문에, tf.data.Dataset.map과 함께 tf.py_function 또는 tf.numpy_function도 사용할 수 있습니다.

sequences = list(text_vector_ds.as_numpy_iterator())
print(len(sequences))

sequences에서 몇몇 예제를 검사합니다.

for seq in sequences[:5]:
  print(f"{seq} => {[inverse_vocab[i] for i in seq]}")

[ 89 270   0   0   0   0   0   0   0   0] => ['first', 'citizen', '', '', '', '', '', '', '', '']
[138  36 982 144 673 125  16 106   0   0] => ['before', 'we', 'proceed', 'any', 'further', 'hear', 'me', 'speak', '', '']
[34  0  0  0  0  0  0  0  0  0] => ['all', '', '', '', '', '', '', '', '', '']
[106 106   0   0   0   0   0   0   0   0] => ['speak', 'speak', '', '', '', '', '', '', '', '']
[ 89 270   0   0   0   0   0   0   0   0] => ['first', 'citizen', '', '', '', '', '', '', '', '']

시퀀스에서 훈련 예제 생성하기

sequences는 이제 int로 인코딩된 문장의 목록입니다. 이전에 정의된 generate_training_data 함수를 호출하여 word2vec 모델에 대한 훈련 예제를 생성합니다. 요약하자면 함수는 각 시퀀스의 각 단어를 다시 반복하여 포지티브 및 네거티브 콘텍스트 단어를 수집합니다. 대상, 콘텍스트 및 레이블의 길이는 동일해야 하며 훈련 예제의 총수를 나타냅니다.

targets, contexts, labels = generate_training_data(
    sequences=sequences,
    window_size=2,
    num_ns=4,
    vocab_size=vocab_size,
    seed=SEED)

targets = np.array(targets)
contexts = np.array(contexts)
labels = np.array(labels)

print('\n')
print(f"targets.shape: {targets.shape}")
print(f"contexts.shape: {contexts.shape}")
print(f"labels.shape: {labels.shape}")

100%|██████████| 32777/32777 [00:51<00:00, 637.32it/s]
targets.shape: (65598,)
contexts.shape: (65598, 5)
labels.shape: (65598, 5)

성능을 높이기 위해 데이터세트 구성하기

잠재적으로 훈련 예제의 많은 수에 대한 효과적인 배치를 수행하려면 tf.data.Dataset API를 사용합니다. 이 단계 후, word2vec 모델 훈련을 위한 (target_word, context_word), (label) 요소의 tf.data.Dataset 객체를 갖게 됩니다!

BATCH_SIZE = 1024
BUFFER_SIZE = 10000
dataset = tf.data.Dataset.from_tensor_slices(((targets, contexts), labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print(dataset)

<BatchDataset element_spec=((TensorSpec(shape=(1024,), dtype=tf.int64, name=None), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None)), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None))>

Dataset.cache 및 Dataset.prefetch를 적용하여 성능을 개선합니다.

dataset = dataset.cache().prefetch(buffer_size=AUTOTUNE)
print(dataset)

<PrefetchDataset element_spec=((TensorSpec(shape=(1024,), dtype=tf.int64, name=None), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None)), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None))>

모델 및 훈련

word2vec 모델은 분류자로 구현되어 skip-grams의 True 콘텍스트 단어와 네거티브 샘플링을 통해 획득한 False 콘텍스트 단어를 식별할 수 있습니다. 대상 및 콘텍스트 단어의 임베딩 간 내적 곱셈을 수행하여 레이블에 대한 예측을 획득하고 데이터세트의 true 레이블에 대한 손실 함수를 계산할 수 있습니다.

하위 분류된 word2vec 모델

Keras 하위 클래스화 API를 사용해 다음 레이어를 통해 word2vec 모델을 정의합니다.

target_embedding: 대상 단어로 나타났을 때 단어의 임베딩을 검색하는 tf.keras.layers.Embedding 레이어. 이 레이어의 매개변수 수는 (vocab_size * embedding_dim)입니다.
context_embedding: 콘텍스트 단어로 나타났을 때 단어의 임베딩을 검색하는 다른 tf.keras.layers.Embedding 레이어. 이 레이어의 매개변수 수는 target_embedding의 매개변수의 수와 같습니다(즉, (vocab_size * embedding_dim)).
dots: 대상의 내적과 훈련 쌍의 콘텍스트 임베딩을 계산하는 tf.keras.layers.Dot 레이어입니다.
flatten: dots 레이어의 결과를 로짓으로 평면화하는 tf.keras.layers.Flatten 레이어입니다.

하위 분류된 모델로 해당 임베딩 레이어로 전달될 수 있는 (target, context) 쌍을 허용하는 call() 함수를 정의할 수 있습니다. context_embedding의 형상을 변경해 target_embedding로 내적을 수행하고 평면화된 결과를 반환합니다.

주요 포인트: target_embedding 및 context_embedding 레이어 역시 공유될 수 있습니다. 또한 최종 word2vec 임베딩으로 두 임베딩의 연결을 사용할 수도 있습니다.

class Word2Vec(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim):
    super(Word2Vec, self).__init__()
    self.target_embedding = layers.Embedding(vocab_size,
                                      embedding_dim,
                                      input_length=1,
                                      name="w2v_embedding")
    self.context_embedding = layers.Embedding(vocab_size,
                                       embedding_dim,
                                       input_length=num_ns+1)

  def call(self, pair):
    target, context = pair
    # target: (batch, dummy?)  # The dummy axis doesn't exist in TF2.7+
    # context: (batch, context)
    if len(target.shape) == 2:
      target = tf.squeeze(target, axis=1)
    # target: (batch,)
    word_emb = self.target_embedding(target)
    # word_emb: (batch, embed)
    context_emb = self.context_embedding(context)
    # context_emb: (batch, context, embed)
    dots = tf.einsum('be,bce->bc', word_emb, context_emb)
    # dots: (batch, context)
    return dots

손실 함수 정의 및 모델 컴파일

단순성을 위해, tf.keras.losses.CategoricalCrossEntropy를 네거티브 샘플링 손실에 대한 대안으로 사용할 수 있습니다. 자체 사용자 정의 손실 함수를 작성하고 싶다면 다음을 수행할 수도 있습니다.

def custom_loss(x_logit, y_true):
      return tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=y_true)

모델을 빌드할 시간입니다! 128 임베딩 차원으로 word2vec 클래스를 인스턴스화합니다(다른 값으로 실험할 수 있습니다), tf.keras.optimizers.Adam 옵티마이저로 모델을 컴파일합니다.

embedding_dim = 128
word2vec = Word2Vec(vocab_size, embedding_dim)
word2vec.compile(optimizer='adam',
                 loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
                 metrics=['accuracy'])

또한 콜백을 정의하여 TensorBoard에 대한 훈련 통계를 기록합니다.

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

얼마간의 epoch 동안 dataset에서 모델을 훈련합니다.

word2vec.fit(dataset, epochs=20, callbacks=[tensorboard_callback])

Epoch 1/20
64/64 [==============================] - 10s 147ms/step - loss: 1.6082 - accuracy: 0.2331
Epoch 2/20
64/64 [==============================] - 0s 3ms/step - loss: 1.5883 - accuracy: 0.5508
Epoch 3/20
64/64 [==============================] - 0s 3ms/step - loss: 1.5387 - accuracy: 0.5880
Epoch 4/20
64/64 [==============================] - 0s 3ms/step - loss: 1.4541 - accuracy: 0.5629
Epoch 5/20
64/64 [==============================] - 0s 3ms/step - loss: 1.3560 - accuracy: 0.5719
Epoch 6/20
64/64 [==============================] - 0s 3ms/step - loss: 1.2605 - accuracy: 0.6006
Epoch 7/20
64/64 [==============================] - 0s 3ms/step - loss: 1.1717 - accuracy: 0.6352
Epoch 8/20
64/64 [==============================] - 0s 3ms/step - loss: 1.0892 - accuracy: 0.6698
Epoch 9/20
64/64 [==============================] - 0s 3ms/step - loss: 1.0121 - accuracy: 0.7037
Epoch 10/20
64/64 [==============================] - 0s 3ms/step - loss: 0.9403 - accuracy: 0.7342
Epoch 11/20
64/64 [==============================] - 0s 3ms/step - loss: 0.8735 - accuracy: 0.7619
Epoch 12/20
64/64 [==============================] - 0s 3ms/step - loss: 0.8116 - accuracy: 0.7850
Epoch 13/20
64/64 [==============================] - 0s 3ms/step - loss: 0.7545 - accuracy: 0.8042
Epoch 14/20
64/64 [==============================] - 0s 3ms/step - loss: 0.7021 - accuracy: 0.8224
Epoch 15/20
64/64 [==============================] - 0s 3ms/step - loss: 0.6542 - accuracy: 0.8377
Epoch 16/20
64/64 [==============================] - 0s 3ms/step - loss: 0.6104 - accuracy: 0.8513
Epoch 17/20
64/64 [==============================] - 0s 3ms/step - loss: 0.5705 - accuracy: 0.8643
Epoch 18/20
64/64 [==============================] - 0s 3ms/step - loss: 0.5342 - accuracy: 0.8746
Epoch 19/20
64/64 [==============================] - 0s 3ms/step - loss: 0.5011 - accuracy: 0.8848
Epoch 20/20
64/64 [==============================] - 0s 3ms/step - loss: 0.4711 - accuracy: 0.8930
<keras.callbacks.History at 0x7f22a8241c40>

TensorBoard는 이제 word2vec 모델의 정확성과 손실을 표시합니다.

#docs_infra: no_execute
%tensorboard --logdir logs

임베딩 검색 및 분석

Model.get_layer 및 Layer.get_weights을 사용해 모델에서 가중치를 얻습니다. TextVectorization.get_vocabulary 함수는 어휘를 제공하여 라인당 하나의 토큰으로 메타데이터 파일을 빌드합니다.

weights = word2vec.get_layer('w2v_embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

벡터 및 메타데이터 파일을 생성하고 저장합니다.

out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

vectors.tsv 및 metadata.tsv를 다운로드하여 Embedding Projector에서 획득한 임베딩을 분석합니다.

try:
  from google.colab import files
  files.download('vectors.tsv')
  files.download('metadata.tsv')
except Exception:
  pass

다음 단계

이 튜토리얼은 처음부터 네거티브 샘플링으로 skip-gram word2vec 모델을 구현하고 획득한 단어 임베딩을 시각화하는 방법을 보여주었습니다.

단어 벡터와 수학적 표현에 대해 더 자세히 알아보려면 이러한 참고를 참조하세요.
고급 텍스트 처리에 대해 더 자세히 알아보려면 언어 이해를 위한 트랜스포머 모델 튜토리얼을 읽으세요.
사전 훈련된 임베딩 모델에 관심이 있다면 TF-Hub CORD-19 Swivel 임베딩 탐색 또는 다국어 범용 문장 인코더에 관심이 있을 수 있습니다.
또한 새로운 데이터세트에서 모델을 훈련하고 싶을 수도 있습니다(TensorFlow 데이터세트에서 많은 것이 가능합니다).