Approximate Nearest Neighbor(ANN) 및 텍스트 임베딩을 사용한 의미론적 검색

TensorFlow.org에서 보기 Google Colab에서 실행 GitHub에서 소스 보기 노트북 다운로드 TF Hub 모델보기

이 튜토리얼에서는 입력 데이터가 제공된 TensorFlow Hub(TF-Hub) 모듈에서 임베딩을 생성하고 추출된 임베딩을 사용하여 approximate nearest neighbour(ANN) 인덱스를 빌드하는 방법을 보여줍니다. 그런 다음 이 인덱스를 실시간 유사성 일치 및 검색에 사용할 수 있습니다.

많은 양의 데이터를 처리할 때 전체 리포지토리를 스캔하여 주어진 쿼리와 가장 유사한 항목을 실시간으로 찾는 식으로 정확한 일치 작업을 수행하는 것은 효율적이지 않습니다. 따라서 속도를 크게 높이기 위해 정확한 nearest neighbor(NN) 일치를 찾을 때 약간의 정확성을 절충할 수 있는 근사 유사성 일치 알고리즘을 사용합니다.

이 튜토리얼에서는 쿼리와 가장 유사한 헤드라인을 찾기 위해 뉴스 헤드라인 자료의 텍스트를 실시간으로 검색하는 예를 보여줍니다. 키워드 검색과 달리 이 검색으로 텍스트 임베딩에 인코딩된 의미론적 유사성이 포착됩니다.

이 튜토리얼의 단계는 다음과 같습니다.

  1. 샘플 데이터를 다운로드합니다.
  2. TF-Hub 모듈을 사용하여 데이터에 대한 임베딩을 생성합니다.
  3. 임베딩에 대한 ANN 인덱스를 빌드합니다.
  4. 유사성 일치에 인덱스를 사용합니다.

TensorFlow Transform(TF-Transform)과 함께 Apache Beam을 사용하여 TF-Hub 모듈에서 임베딩을 생성합니다. 또한 Spotify의 ANNOY 라이브러리를 사용하여 nearest neighbour(NN) 인덱스를 빌드합니다. 이 Github 리포지토리에서 ANN 프레임워크의 벤치마킹을 찾을 수 있습니다.

이 튜토리얼에서는 TensorFlow 1.0을 사용하며 TF-Hub의 TF1 Hub 모듈에서만 동작합니다. 본 튜토리얼의 TF2 업데이트 버전을 참조하세요.

설정

필요한 라이브러리를 설치합니다.

pip install -q apache_beam
pip install -q 'scikit_learn~=0.23.0'  # For gaussian_random_matrix.
pip install -q annoy

필요한 라이브러리를 가져옵니다.

import os
import sys
import pathlib
import pickle
from collections import namedtuple
from datetime import datetime

import numpy as np
import apache_beam as beam
import annoy
from sklearn.random_projection import gaussian_random_matrix

import tensorflow.compat.v1 as tf
import tensorflow_hub as hub
# TFT needs to be installed afterwards
!pip install -q tensorflow_transform==0.24
import tensorflow_transform as tft
import tensorflow_transform.beam as tft_beam
print('TF version: {}'.format(tf.__version__))
print('TF-Hub version: {}'.format(hub.__version__))
print('TF-Transform version: {}'.format(tft.__version__))
print('Apache Beam version: {}'.format(beam.__version__))
TF version: 2.6.0
TF-Hub version: 0.12.0
TF-Transform version: 0.24.0
Apache Beam version: 2.31.0

1. 샘플 데이터 다운로드하기

A Million News Headlines 데이터세트에는 평판이 좋은 Australian Broadcasting Corp. (ABC)에서 공급한 15년치의 뉴스 헤드라인이 수록되어 있습니다. 이 뉴스 데이터세트에는 호주에 보다 세분화된 초점을 두고 2003년 초부터 2017년 말까지 전 세계적으로 일어난 주목할만한 사건에 대한 역사적 기록이 요약되어 있습니다.

형식: 탭으로 구분된 2열 데이터: 1) 발행일 및 2) 헤드라인 텍스트. 여기서는 헤드라인 텍스트에만 관심이 있습니다.

wget 'https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true' -O raw.tsv
wc -l raw.tsv
head raw.tsv
--2021-08-14 00:48:53--  https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true
Resolving dataverse.harvard.edu (dataverse.harvard.edu)... 72.44.40.54, 54.162.175.159, 18.211.119.52
Connecting to dataverse.harvard.edu (dataverse.harvard.edu)|72.44.40.54|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 57600231 (55M) [text/tab-separated-values]
Saving to: ‘raw.tsv’

raw.tsv             100%[===================>]  54.93M  14.7MB/s    in 4.4s    

2021-08-14 00:48:59 (12.4 MB/s) - ‘raw.tsv’ saved [57600231/57600231]

1103664 raw.tsv
publish_date    headline_text
20030219    "aba decides against community broadcasting licence"
20030219    "act fire witnesses must be aware of defamation"
20030219    "a g calls for infrastructure protection summit"
20030219    "air nz staff in aust strike for pay rise"
20030219    "air nz strike to affect australian travellers"
20030219    "ambitious olsson wins triple jump"
20030219    "antic delighted with record breaking barca"
20030219    "aussie qualifier stosur wastes four memphis match"
20030219    "aust addresses un security council over iraq"

단순화를 위해 헤드라인 텍스트만 유지하고 발행일은 제거합니다.

!rm -r corpus
!mkdir corpus

with open('corpus/text.txt', 'w') as out_file:
  with open('raw.tsv', 'r') as in_file:
    for line in in_file:
      headline = line.split('\t')[1].strip().strip('"')
      out_file.write(headline+"\n")
rm: cannot remove 'corpus': No such file or directory
tail corpus/text.txt
severe storms forecast for nye in south east queensland
snake catcher pleads for people not to kill reptiles
south australia prepares for party to welcome new year
strikers cool off the heat with big win in adelaide
stunning images from the sydney to hobart yacht
the ashes smiths warners near miss liven up boxing day test
timelapse: brisbanes new year fireworks
what 2017 meant to the kids of australia
what the papodopoulos meeting may mean for ausus
who is george papadopoulos the former trump campaign aide

TF-Hub 모듈을 로드하는 도우미 함수

def load_module(module_url):
  embed_module = hub.Module(module_url)
  placeholder = tf.placeholder(dtype=tf.string)
  embed = embed_module(placeholder)
  session = tf.Session()
  session.run([tf.global_variables_initializer(), tf.tables_initializer()])
  print('TF-Hub module is loaded.')

  def _embeddings_fn(sentences):
    computed_embeddings = session.run(
        embed, feed_dict={placeholder: sentences})
    return computed_embeddings

  return _embeddings_fn

2. 데이터에 대한 임베딩 생성하기

이 튜토리얼에서는 Universal Sentence Encoder를 사용하여 헤드라인 데이터에 대한 임베딩을 생성합니다. 그런 다음 문장 임베딩을 사용하여 문장 수준의 의미 유사성을 쉽게 계산할 수 있습니다. Apache Beam과 TF-Transform을 사용하여 임베딩 생성 프로세스를 실행합니다.

임베딩 추출 메서드

encoder = None

def embed_text(text, module_url, random_projection_matrix):
  # Beam will run this function in different processes that need to
  # import hub and load embed_fn (if not previously loaded)
  global encoder
  if not encoder:
    encoder = hub.Module(module_url)
  embedding = encoder(text)
  if random_projection_matrix is not None:
    # Perform random projection for the embedding
    embedding = tf.matmul(
        embedding, tf.cast(random_projection_matrix, embedding.dtype))
  return embedding

TFT preprocess_fn 메서드 만들기

def make_preprocess_fn(module_url, random_projection_matrix=None):
  '''Makes a tft preprocess_fn'''

  def _preprocess_fn(input_features):
    '''tft preprocess_fn'''
    text = input_features['text']
    # Generate the embedding for the input text
    embedding = embed_text(text, module_url, random_projection_matrix)

    output_features = {
        'text': text, 
        'embedding': embedding
        }

    return output_features

  return _preprocess_fn

데이터세트 메타데이터 만들기

def create_metadata():
  '''Creates metadata for the raw data'''
  from tensorflow_transform.tf_metadata import dataset_metadata
  from tensorflow_transform.tf_metadata import schema_utils
  feature_spec = {'text': tf.FixedLenFeature([], dtype=tf.string)}
  schema = schema_utils.schema_from_feature_spec(feature_spec)
  metadata = dataset_metadata.DatasetMetadata(schema)
  return metadata

Beam 파이프라인

def run_hub2emb(args):
  '''Runs the embedding generation pipeline'''

  options = beam.options.pipeline_options.PipelineOptions(**args)
  args = namedtuple("options", args.keys())(*args.values())

  raw_metadata = create_metadata()
  converter = tft.coders.CsvCoder(
      column_names=['text'], schema=raw_metadata.schema)

  with beam.Pipeline(args.runner, options=options) as pipeline:
    with tft_beam.Context(args.temporary_dir):
      # Read the sentences from the input file
      sentences = ( 
          pipeline
          | 'Read sentences from files' >> beam.io.ReadFromText(
              file_pattern=args.data_dir)
          | 'Convert to dictionary' >> beam.Map(converter.decode)
      )

      sentences_dataset = (sentences, raw_metadata)
      preprocess_fn = make_preprocess_fn(args.module_url, args.random_projection_matrix)
      # Generate the embeddings for the sentence using the TF-Hub module
      embeddings_dataset, _ = (
          sentences_dataset
          | 'Extract embeddings' >> tft_beam.AnalyzeAndTransformDataset(preprocess_fn)
      )

      embeddings, transformed_metadata = embeddings_dataset
      # Write the embeddings to TFRecords files
      embeddings | 'Write embeddings to TFRecords' >> beam.io.tfrecordio.WriteToTFRecord(
          file_path_prefix='{}/emb'.format(args.output_dir),
          file_name_suffix='.tfrecords',
          coder=tft.coders.ExampleProtoCoder(transformed_metadata.schema))

무작위 투영 가중치 행렬 생성하기

무작위 투영은 유클리드 공간에 있는 점 집합의 차원을 줄이는 데 사용되는 간단하지만 강력한 기술입니다. 이론적 배경은 Johnson-Lindenstrauss 보조 정리를 참조하세요.

무작위 투영으로 임베딩의 차원을 줄이면 ANN 인덱스를 빌드하고 쿼리하는 데 필요한 시간이 줄어듭니다.

이 튜토리얼에서는 Scikit-learn 라이브러리의 가우스 무작위 투영을 사용합니다.

def generate_random_projection_weights(original_dim, projected_dim):
  random_projection_matrix = None
  if projected_dim and original_dim > projected_dim:
    random_projection_matrix = gaussian_random_matrix(
        n_components=projected_dim, n_features=original_dim).T
    print("A Gaussian random weight matrix was creates with shape of {}".format(random_projection_matrix.shape))
    print('Storing random projection matrix to disk...')
    with open('random_projection_matrix', 'wb') as handle:
      pickle.dump(random_projection_matrix, 
                  handle, protocol=pickle.HIGHEST_PROTOCOL)

  return random_projection_matrix

매개변수 설정하기

무작위 투영 없이 원래 임베딩 공간을 사용하여 인덱스를 빌드하려면 projected_dim 매개변수를 None으로 설정합니다. 그러면 높은 차원의 임베딩에 대한 인덱싱 스텝이 느려집니다.

파이프라인 실행하기

import tempfile

output_dir = pathlib.Path(tempfile.mkdtemp())
temporary_dir = pathlib.Path(tempfile.mkdtemp())

g = tf.Graph()
with g.as_default():
  original_dim = load_module(module_url)(['']).shape[1]
  random_projection_matrix = None

  if projected_dim:
    random_projection_matrix = generate_random_projection_weights(
        original_dim, projected_dim)

args = {
    'job_name': 'hub2emb-{}'.format(datetime.utcnow().strftime('%y%m%d-%H%M%S')),
    'runner': 'DirectRunner',
    'batch_size': 1024,
    'data_dir': 'corpus/*.txt',
    'output_dir': output_dir,
    'temporary_dir': temporary_dir,
    'module_url': module_url,
    'random_projection_matrix': random_projection_matrix,
}

print("Pipeline args are set.")
args
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
2021-08-14 00:49:22.687543: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:22.696705: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:22.697661: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:22.699546: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-14 00:49:22.700069: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:22.701210: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:22.702228: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:23.252248: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:23.253245: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:23.254261: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:23.255168: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0
TF-Hub module is loaded.
A Gaussian random weight matrix was creates with shape of (512, 64)
Storing random projection matrix to disk...
Pipeline args are set.
/tmpfs/src/tf_docs_env/lib/python3.7/site-packages/sklearn/utils/deprecation.py:86: FutureWarning: Function gaussian_random_matrix is deprecated; gaussian_random_matrix is deprecated in 0.22 and will be removed in version 0.24.
  warnings.warn(msg, category=FutureWarning)
{'job_name': 'hub2emb-210814-004928',
 'runner': 'DirectRunner',
 'batch_size': 1024,
 'data_dir': 'corpus/*.txt',
 'output_dir': PosixPath('/tmp/tmp_h614evn'),
 'temporary_dir': PosixPath('/tmp/tmpewtas0ma'),
 'module_url': 'https://tfhub.dev/google/universal-sentence-encoder/2',
 'random_projection_matrix': array([[ 0.08429973,  0.02862848,  0.0283129 , ..., -0.06366567,
         -0.07289439,  0.09433901],
        [-0.09943071, -0.00912436,  0.03518332, ..., -0.07089249,
          0.15142502, -0.25058083],
        [ 0.04935145, -0.06591867, -0.00034778, ..., -0.28690692,
         -0.14622915, -0.01798296],
        ...,
        [-0.11658382, -0.28600734, -0.21268456, ...,  0.10686914,
         -0.10561194, -0.16668612],
        [-0.22833123, -0.06371826,  0.13712559, ..., -0.19414457,
         -0.03959951,  0.16588403],
        [ 0.00710554, -0.08813146,  0.15991346, ..., -0.04155164,
          0.06329731, -0.0267125 ]])}
!rm -r {output_dir}
!rm -r {temporary_dir}

print("Running pipeline...")
%time run_hub2emb(args)
print("Pipeline is done.")
WARNING:apache_beam.runners.interactive.interactive_environment:Dependencies required for Interactive Beam PCollection visualization are not available, please use: `pip install apache-beam[interactive]` to install necessary dependencies to enable all data visualization features.
Running pipeline...
WARNING:tensorflow:Tensorflow version (2.6.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:Tensorflow version (2.6.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:Tensorflow version (2.6.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:Tensorflow version (2.6.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).

---------------------------------------------------------------------------

ModuleNotFoundError                       Traceback (most recent call last)

/tmpfs/src/tf_docs_env/lib/python3.7/site-packages/pyarrow/pandas-shim.pxi in pyarrow.lib._PandasAPIShim._check_import()


/tmpfs/src/tf_docs_env/lib/python3.7/site-packages/pyarrow/pandas-shim.pxi in pyarrow.lib._PandasAPIShim._import_pandas()


ModuleNotFoundError: No module named 'pyarrow.vendored'

Exception ignored in: 'pyarrow.lib._PandasAPIShim._have_pandas_internal'
Traceback (most recent call last):
  File "pyarrow/pandas-shim.pxi", line 110, in pyarrow.lib._PandasAPIShim._check_import
  File "pyarrow/pandas-shim.pxi", line 56, in pyarrow.lib._PandasAPIShim._import_pandas
ModuleNotFoundError: No module named 'pyarrow.vendored'
2021-08-14 00:49:30.064033: W tensorflow/core/common_runtime/graph_constructor.cc:1511] Importing a graph with a lower producer version 26 into an existing graph with producer version 808. Shape inference will have run different parts of the graph with different producer versions.
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
2021-08-14 00:49:31.632145: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:31.632728: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:31.633092: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:31.633512: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:31.633860: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:31.634185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:201: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:201: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: /tmp/tmpewtas0ma/tftransform_tmp/6af6832ff290461aafc6fea39c8f0d86/saved_model.pb
INFO:tensorflow:SavedModel written to: /tmp/tmpewtas0ma/tftransform_tmp/6af6832ff290461aafc6fea39c8f0d86/saved_model.pb
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow_transform/tf_utils.py:218: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use ref() instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow_transform/tf_utils.py:218: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use ref() instead.
WARNING:tensorflow:Tensorflow version (2.6.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:Tensorflow version (2.6.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended.
WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['-f', '/tmp/tmp_97ou08j.json', '--HistoryManager.hist_file=:memory:']
WARNING:apache_beam.options.pipeline_options:Discarding invalid overrides: {'batch_size': 1024, 'data_dir': 'corpus/*.txt', 'output_dir': PosixPath('/tmp/tmp_h614evn'), 'temporary_dir': PosixPath('/tmp/tmpewtas0ma'), 'module_url': 'https://tfhub.dev/google/universal-sentence-encoder/2', 'random_projection_matrix': array([[ 0.08429973,  0.02862848,  0.0283129 , ..., -0.06366567,
        -0.07289439,  0.09433901],
       [-0.09943071, -0.00912436,  0.03518332, ..., -0.07089249,
         0.15142502, -0.25058083],
       [ 0.04935145, -0.06591867, -0.00034778, ..., -0.28690692,
        -0.14622915, -0.01798296],
       ...,
       [-0.11658382, -0.28600734, -0.21268456, ...,  0.10686914,
        -0.10561194, -0.16668612],
       [-0.22833123, -0.06371826,  0.13712559, ..., -0.19414457,
        -0.03959951,  0.16588403],
       [ 0.00710554, -0.08813146,  0.15991346, ..., -0.04155164,
         0.06329731, -0.0267125 ]])}
WARNING:root:Make sure that locally built Python SDK docker image has Python 3.7 interpreter.
2021-08-14 00:49:35.940398: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:35.940906: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:35.941290: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:35.941754: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:35.942120: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:35.942442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0
2021-08-14 00:49:42.667032: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:42.667678: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:42.668112: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:42.668602: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:42.669066: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:49:42.669412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0
WARNING:apache_beam.io.tfrecordio:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.
CPU times: user 2min 44s, sys: 6.68 s, total: 2min 51s
Wall time: 2min 35s
Pipeline is done.
ls {output_dir}
emb-00000-of-00001.tfrecords

생성된 임베딩의 일부를 읽습니다.

import itertools

embed_file = os.path.join(output_dir, 'emb-00000-of-00001.tfrecords')
sample = 5
record_iterator =  tf.io.tf_record_iterator(path=embed_file)
for string_record in itertools.islice(record_iterator, sample):
  example = tf.train.Example()
  example.ParseFromString(string_record)
  text = example.features.feature['text'].bytes_list.value
  embedding = np.array(example.features.feature['embedding'].float_list.value)
  print("Embedding dimensions: {}".format(embedding.shape[0]))
  print("{}: {}".format(text, embedding[:10]))
WARNING:tensorflow:From /tmp/ipykernel_17013/2258356591.py:5: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
WARNING:tensorflow:From /tmp/ipykernel_17013/2258356591.py:5: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
Embedding dimensions: 64
[b'headline_text']: [-0.21006317 -0.22012915  0.2056008  -0.04907456 -0.1095423  -0.04155223
 -0.02227692  0.14556509 -0.10686573 -0.0118738 ]
Embedding dimensions: 64
[b'aba decides against community broadcasting licence']: [ 0.15086523 -0.03653343 -0.05894864 -0.11833061  0.02938757 -0.05750314
 -0.01783411 -0.07162338 -0.09147523  0.11616055]
Embedding dimensions: 64
[b'act fire witnesses must be aware of defamation']: [-0.07128062  0.05750276 -0.01026567 -0.19047701  0.06809171 -0.24444363
 -0.1406101   0.00596204  0.17275015  0.02916648]
Embedding dimensions: 64
[b'a g calls for infrastructure protection summit']: [ 0.09898694  0.01707831 -0.19118081 -0.34395882 -0.04306564 -0.14949776
 -0.23413622 -0.00929208 -0.15094939  0.02297503]
Embedding dimensions: 64
[b'air nz staff in aust strike for pay rise']: [-0.06256607  0.1949358  -0.12793431 -0.23692016  0.15559241 -0.174711
 -0.0854648  -0.04562813  0.01706587 -0.07893752]

3. 임베딩을 위한 ANN 인덱스 빌드하기

Approximate Nearest Neighbors Oh Yeah(ANNOY)는 주어진 쿼리 포인트에 가까운 공간에서 포인트를 검색하기 위한 Python 바인딩이 있는 C++ 라이브러리입니다. 또한 ANNOY는 메모리에 매핑되는 대규모 읽기 전용 파일 기반 데이터 구조를 만들며, Spotify에서 음악 추천을 위해 빌드하고 사용합니다.

def build_index(embedding_files_pattern, index_filename, vector_length, 
    metric='angular', num_trees=100):
  '''Builds an ANNOY index'''

  annoy_index = annoy.AnnoyIndex(vector_length, metric=metric)
  # Mapping between the item and its identifier in the index
  mapping = {}

  embed_files = tf.gfile.Glob(embedding_files_pattern)
  print('Found {} embedding file(s).'.format(len(embed_files)))

  item_counter = 0
  for f, embed_file in enumerate(embed_files):
    print('Loading embeddings in file {} of {}...'.format(
      f+1, len(embed_files)))
    record_iterator = tf.io.tf_record_iterator(
      path=embed_file)

    for string_record in record_iterator:
      example = tf.train.Example()
      example.ParseFromString(string_record)
      text = example.features.feature['text'].bytes_list.value[0].decode("utf-8")
      mapping[item_counter] = text
      embedding = np.array(
        example.features.feature['embedding'].float_list.value)
      annoy_index.add_item(item_counter, embedding)
      item_counter += 1
      if item_counter % 100000 == 0:
        print('{} items loaded to the index'.format(item_counter))

  print('A total of {} items added to the index'.format(item_counter))

  print('Building the index with {} trees...'.format(num_trees))
  annoy_index.build(n_trees=num_trees)
  print('Index is successfully built.')

  print('Saving index to disk...')
  annoy_index.save(index_filename)
  print('Index is saved to disk.')
  print("Index file size: {} GB".format(
    round(os.path.getsize(index_filename) / float(1024 ** 3), 2)))
  annoy_index.unload()

  print('Saving mapping to disk...')
  with open(index_filename + '.mapping', 'wb') as handle:
    pickle.dump(mapping, handle, protocol=pickle.HIGHEST_PROTOCOL)
  print('Mapping is saved to disk.')
  print("Mapping file size: {} MB".format(
    round(os.path.getsize(index_filename + '.mapping') / float(1024 ** 2), 2)))
embedding_files = "{}/emb-*.tfrecords".format(output_dir)
embedding_dimension = projected_dim
index_filename = "index"

!rm {index_filename}
!rm {index_filename}.mapping

%time build_index(embedding_files, index_filename, embedding_dimension)
rm: cannot remove 'index': No such file or directory
rm: cannot remove 'index.mapping': No such file or directory
Found 1 embedding file(s).
Loading embeddings in file 1 of 1...
100000 items loaded to the index
200000 items loaded to the index
300000 items loaded to the index
400000 items loaded to the index
500000 items loaded to the index
600000 items loaded to the index
700000 items loaded to the index
800000 items loaded to the index
900000 items loaded to the index
1000000 items loaded to the index
1100000 items loaded to the index
A total of 1103664 items added to the index
Building the index with 100 trees...
Index is successfully built.
Saving index to disk...
Index is saved to disk.
Index file size: 1.66 GB
Saving mapping to disk...
Mapping is saved to disk.
Mapping file size: 50.61 MB
CPU times: user 6min 12s, sys: 3.69 s, total: 6min 15s
Wall time: 1min 38s
ls
corpus  index.mapping         raw.tsv
index   random_projection_matrix  semantic_approximate_nearest_neighbors.ipynb

4. 유사성 일치에 인덱스 사용하기

이제 ANN 인덱스를 사용하여 의미상 입력 쿼리에 가까운 뉴스 헤드라인을 찾을 수 있습니다.

인덱스 및 매핑 파일 로드하기

index = annoy.AnnoyIndex(embedding_dimension)
index.load(index_filename, prefault=True)
print('Annoy index is loaded.')
with open(index_filename + '.mapping', 'rb') as handle:
  mapping = pickle.load(handle)
print('Mapping file is loaded.')
Annoy index is loaded.
/home/kbuilder/.local/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: The default argument for metric will be removed in future version of Annoy. Please pass metric='angular' explicitly.
  """Entry point for launching an IPython kernel.
Mapping file is loaded.

유사성 일치 메서드

def find_similar_items(embedding, num_matches=5):
  '''Finds similar items to a given embedding in the ANN index'''
  ids = index.get_nns_by_vector(
  embedding, num_matches, search_k=-1, include_distances=False)
  items = [mapping[i] for i in ids]
  return items

주어진 쿼리에서 임베딩 추출하기

# Load the TF-Hub module
print("Loading the TF-Hub module...")
g = tf.Graph()
with g.as_default():
  embed_fn = load_module(module_url)
print("TF-Hub module is loaded.")

random_projection_matrix = None
if os.path.exists('random_projection_matrix'):
  print("Loading random projection matrix...")
  with open('random_projection_matrix', 'rb') as handle:
    random_projection_matrix = pickle.load(handle)
  print('random projection matrix is loaded.')

def extract_embeddings(query):
  '''Generates the embedding for the query'''
  query_embedding =  embed_fn([query])[0]
  if random_projection_matrix is not None:
    query_embedding = query_embedding.dot(random_projection_matrix)
  return query_embedding
Loading the TF-Hub module...
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
2021-08-14 00:53:45.852257: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:53:45.852860: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:53:45.853200: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:53:45.853623: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:53:45.853944: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:53:45.854252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0
TF-Hub module is loaded.
TF-Hub module is loaded.
Loading random projection matrix...
random projection matrix is loaded.
extract_embeddings("Hello Machine Learning!")[:10]
array([ 0.12434553, -0.04973917,  0.03094718, -0.14911301, -0.02944927,
        0.061433  ,  0.08337866, -0.1080807 ,  0.0259618 ,  0.03843714])

가장 유사한 항목을 찾기 위한 쿼리 입력하기

Generating embedding for the query...
CPU times: user 23.1 ms, sys: 16.5 ms, total: 39.6 ms
Wall time: 5.07 ms

Finding relevant items in the index...
CPU times: user 5.31 ms, sys: 1.81 ms, total: 7.12 ms
Wall time: 899 µs

Results:
=========
confronting global challenges
lgbti challenges in 2015
disappearing apostles prompt marketing dilemma
world science festival food for thought
5 lgbti challenges in 2015
aust criticised over solomons mission
baillieu says challenges stem from federal turmoil
complaint to the united nations about proposed changes to victi
ndis forum to hear challenges
hopes mulesing changes will end boycott

더 자세히 알고 싶나요?

tensorflow.org에서 TensorFlow에 대해 자세히 알아보고 tensorflow.org/hub에서 TF-Hub API 설명서를 확인할 수 있습니다. 추가적인 텍스트 임베딩 모듈 및 이미지 특성 벡터 모듈을 포함해 tfhub.dev에서 사용 가능한 TensorFlow Hub 모듈을 찾아보세요.

빠르게 진행되는 Google의 머신러닝 실무 개요 과정인 머신러닝 집중 과정도 확인해 보세요.