Yerel TensorFlow Everywhere etkinliğiniz için bugün LCV!
Bu sayfa, Cloud Translation API ile çevrilmiştir.
Switch to English

Yaklaşık En Yakın Komşularla Anlamsal Arama ve Metin Gömme

TensorFlow.org'da görüntüleyin Google Colab'de çalıştırın GitHub'da görüntüle Defteri indirin TF Hub modellerine bakın

Bu öğretici, girdi verileri verilen bir TensorFlow Hub (TF-Hub) modülünden nasıl yerleştirileceğini ve çıkarılan yerleştirmeleri kullanarak yaklaşık bir en yakın komşu (YSA) dizini nasıl oluşturulacağını gösterir. Dizin daha sonra gerçek zamanlı benzerlik eşleştirmesi ve alımı için kullanılabilir.

Büyük bir veri topluluğu ile uğraşırken, belirli bir sorguya en çok benzeyen öğeleri gerçek zamanlı olarak bulmak için tüm depoyu tarayarak tam eşleştirme yapmak verimli değildir. Bu nedenle, hızda önemli bir artış için en yakın komşu eşleşmelerini bulmada biraz doğruluktan taviz vermemize izin veren yaklaşık bir benzerlik eşleştirme algoritması kullanıyoruz.

Bu eğiticide, bir sorguya en çok benzeyen başlıkları bulmak için bir haber başlıkları külliyatı üzerinde gerçek zamanlı metin arama örneği gösteriyoruz. Anahtar sözcük aramasının aksine, bu, gömülen metinde kodlanmış anlamsal benzerliği yakalar.

Bu öğreticinin adımları şunlardır:

  1. Örnek verileri indirin.
  2. Bir TF-Hub modülü kullanarak veriler için yerleştirmeler oluşturun
  3. Yerleştirmeler için YSA dizini oluşturun
  4. Benzerlik eşleşmesi için dizini kullanın

TF-Hub modülünden yerleştirmeleri oluşturmak için TensorFlow Transform (TF-Transform) ile Apache Beam kullanıyoruz. Yaklaşık en yakın komşular dizinini oluşturmak için Spotify'ın ANNOY kitaplığını da kullanıyoruz. YSA çerçevesinin karşılaştırmasını bu Github deposunda bulabilirsiniz .

Bu öğreticide TensorFlow 1.0 kullanılır ve yalnızca TF-Hub'dan TF1 Hub modülleriyle çalışır. Bu öğreticinin güncellenmiş TF2 sürümüne bakın.

Kurmak

Gerekli kitaplıkları kurun.

pip install -q apache_beam
pip install -q sklearn
pip install -q annoy

Gerekli kitaplıkları içe aktarın

import os
import sys
import pathlib
import pickle
from collections import namedtuple
from datetime import datetime

import numpy as np
import apache_beam as beam
import annoy
from sklearn.random_projection import gaussian_random_matrix

import tensorflow.compat.v1 as tf
import tensorflow_hub as hub
# TFT needs to be installed afterwards
!pip install -q tensorflow_transform==0.24
import tensorflow_transform as tft
import tensorflow_transform.beam as tft_beam
print('TF version: {}'.format(tf.__version__))
print('TF-Hub version: {}'.format(hub.__version__))
print('TF-Transform version: {}'.format(tft.__version__))
print('Apache Beam version: {}'.format(beam.__version__))
TF version: 2.3.1
TF-Hub version: 0.10.0
TF-Transform version: 0.24.0
Apache Beam version: 2.25.0

1. Örnek Verileri İndirin

Milyon Haber Başlıkları veri kümesi, saygın Australian Broadcasting Corp.'tan (ABC) temin edilen 15 yıllık bir süre boyunca yayınlanan haber başlıklarını içerir. Bu haber veri setinde, Avustralya'ya daha ayrıntılı bir odaklanma ile 2003'ün başından 2017'nin sonuna kadar dünyadaki kayda değer olayların özetlenmiş bir tarihsel kaydı var.

Biçim : Sekmeyle ayrılmış iki sütunlu veriler: 1) yayın tarihi ve 2) başlık metni. Biz sadece başlık metni ile ilgileniyoruz.

wget 'https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true' -O raw.tsv
wc -l raw.tsv
head raw.tsv
--2020-12-03 12:12:21--  https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true
Resolving dataverse.harvard.edu (dataverse.harvard.edu)... 206.191.184.198
Connecting to dataverse.harvard.edu (dataverse.harvard.edu)|206.191.184.198|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 57600231 (55M) [text/tab-separated-values]
Saving to: ‘raw.tsv’

raw.tsv             100%[===================>]  54.93M  15.1MB/s    in 4.3s    

2020-12-03 12:12:27 (12.7 MB/s) - ‘raw.tsv’ saved [57600231/57600231]

1103664 raw.tsv
publish_date    headline_text
20030219    "aba decides against community broadcasting licence"
20030219    "act fire witnesses must be aware of defamation"
20030219    "a g calls for infrastructure protection summit"
20030219    "air nz staff in aust strike for pay rise"
20030219    "air nz strike to affect australian travellers"
20030219    "ambitious olsson wins triple jump"
20030219    "antic delighted with record breaking barca"
20030219    "aussie qualifier stosur wastes four memphis match"
20030219    "aust addresses un security council over iraq"

Basit olması için, sadece başlık metnini tutuyoruz ve yayın tarihini kaldırıyoruz

!rm -r corpus
!mkdir corpus

with open('corpus/text.txt', 'w') as out_file:
  with open('raw.tsv', 'r') as in_file:
    for line in in_file:
      headline = line.split('\t')[1].strip().strip('"')
      out_file.write(headline+"\n")
rm: cannot remove 'corpus': No such file or directory

tail corpus/text.txt
severe storms forecast for nye in south east queensland
snake catcher pleads for people not to kill reptiles
south australia prepares for party to welcome new year
strikers cool off the heat with big win in adelaide
stunning images from the sydney to hobart yacht
the ashes smiths warners near miss liven up boxing day test
timelapse: brisbanes new year fireworks
what 2017 meant to the kids of australia
what the papodopoulos meeting may mean for ausus
who is george papadopoulos the former trump campaign aide

Bir TF-Hub modülünü yüklemek için yardımcı fonksiyon

def load_module(module_url):
  embed_module = hub.Module(module_url)
  placeholder = tf.placeholder(dtype=tf.string)
  embed = embed_module(placeholder)
  session = tf.Session()
  session.run([tf.global_variables_initializer(), tf.tables_initializer()])
  print('TF-Hub module is loaded.')

  def _embeddings_fn(sentences):
    computed_embeddings = session.run(
        embed, feed_dict={placeholder: sentences})
    return computed_embeddings

  return _embeddings_fn

2. Veriler için Gömme Oluşturun.

Bu eğiticide, başlık verileri için emebedings oluşturmak için Evrensel Cümle Kodlayıcıyı kullanıyoruz. Cümle yerleştirmeleri daha sonra benzerlik anlamına gelen cümle düzeyini hesaplamak için kolayca kullanılabilir. Apache Beam ve TF-Transform kullanarak gömme oluşturma sürecini çalıştırıyoruz.

Gömme çıkarma yöntemi

encoder = None

def embed_text(text, module_url, random_projection_matrix):
  # Beam will run this function in different processes that need to
  # import hub and load embed_fn (if not previously loaded)
  global encoder
  if not encoder:
    encoder = hub.Module(module_url)
  embedding = encoder(text)
  if random_projection_matrix is not None:
    # Perform random projection for the embedding
    embedding = tf.matmul(
        embedding, tf.cast(random_projection_matrix, embedding.dtype))
  return embedding

TFT preprocess_fn yöntemini yapın

def make_preprocess_fn(module_url, random_projection_matrix=None):
  '''Makes a tft preprocess_fn'''

  def _preprocess_fn(input_features):
    '''tft preprocess_fn'''
    text = input_features['text']
    # Generate the embedding for the input text
    embedding = embed_text(text, module_url, random_projection_matrix)

    output_features = {
        'text': text, 
        'embedding': embedding
        }

    return output_features

  return _preprocess_fn

Veri kümesi meta verilerini oluşturun

def create_metadata():
  '''Creates metadata for the raw data'''
  from tensorflow_transform.tf_metadata import dataset_metadata
  from tensorflow_transform.tf_metadata import schema_utils
  feature_spec = {'text': tf.FixedLenFeature([], dtype=tf.string)}
  schema = schema_utils.schema_from_feature_spec(feature_spec)
  metadata = dataset_metadata.DatasetMetadata(schema)
  return metadata

Kiriş boru hattı

def run_hub2emb(args):
  '''Runs the embedding generation pipeline'''

  options = beam.options.pipeline_options.PipelineOptions(**args)
  args = namedtuple("options", args.keys())(*args.values())

  raw_metadata = create_metadata()
  converter = tft.coders.CsvCoder(
      column_names=['text'], schema=raw_metadata.schema)

  with beam.Pipeline(args.runner, options=options) as pipeline:
    with tft_beam.Context(args.temporary_dir):
      # Read the sentences from the input file
      sentences = ( 
          pipeline
          | 'Read sentences from files' >> beam.io.ReadFromText(
              file_pattern=args.data_dir)
          | 'Convert to dictionary' >> beam.Map(converter.decode)
      )

      sentences_dataset = (sentences, raw_metadata)
      preprocess_fn = make_preprocess_fn(args.module_url, args.random_projection_matrix)
      # Generate the embeddings for the sentence using the TF-Hub module
      embeddings_dataset, _ = (
          sentences_dataset
          | 'Extract embeddings' >> tft_beam.AnalyzeAndTransformDataset(preprocess_fn)
      )

      embeddings, transformed_metadata = embeddings_dataset
      # Write the embeddings to TFRecords files
      embeddings | 'Write embeddings to TFRecords' >> beam.io.tfrecordio.WriteToTFRecord(
          file_path_prefix='{}/emb'.format(args.output_dir),
          file_name_suffix='.tfrecords',
          coder=tft.coders.ExampleProtoCoder(transformed_metadata.schema))

Rastgele Projeksiyon Ağırlık Matrisi Oluşturma

Rastgele projeksiyon , Öklid uzayında bulunan bir dizi noktanın boyutluluğunu azaltmak için kullanılan basit ama güçlü bir tekniktir. Teorik bir arka plan için Johnson-Lindenstrauss lemma bakın .

Rastgele projeksiyonla yerleştirmelerin boyutluluğunu azaltmak, YSA indeksini oluşturmak ve sorgulamak için daha az zaman gerektiği anlamına gelir.

Bu eğitimdeScikit-learn kütüphanesinden Gaussian Random Projection kullanıyoruz.

def generate_random_projection_weights(original_dim, projected_dim):
  random_projection_matrix = None
  if projected_dim and original_dim > projected_dim:
    random_projection_matrix = gaussian_random_matrix(
        n_components=projected_dim, n_features=original_dim).T
    print("A Gaussian random weight matrix was creates with shape of {}".format(random_projection_matrix.shape))
    print('Storing random projection matrix to disk...')
    with open('random_projection_matrix', 'wb') as handle:
      pickle.dump(random_projection_matrix, 
                  handle, protocol=pickle.HIGHEST_PROTOCOL)

  return random_projection_matrix

Parametreleri ayarla

Rastgele projeksiyon olmadan orijinal gömme alanını kullanarak bir dizin oluşturmak istiyorsanız, projected_dim parametresini None . Bunun yüksek boyutlu düğünler için indeksleme adımını yavaşlatacağını unutmayın.

Ardışık düzeni çalıştır

import tempfile

output_dir = pathlib.Path(tempfile.mkdtemp())
temporary_dir = pathlib.Path(tempfile.mkdtemp())

g = tf.Graph()
with g.as_default():
  original_dim = load_module(module_url)(['']).shape[1]
  random_projection_matrix = None

  if projected_dim:
    random_projection_matrix = generate_random_projection_weights(
        original_dim, projected_dim)

args = {
    'job_name': 'hub2emb-{}'.format(datetime.utcnow().strftime('%y%m%d-%H%M%S')),
    'runner': 'DirectRunner',
    'batch_size': 1024,
    'data_dir': 'corpus/*.txt',
    'output_dir': output_dir,
    'temporary_dir': temporary_dir,
    'module_url': module_url,
    'random_projection_matrix': random_projection_matrix,
}

print("Pipeline args are set.")
args
INFO:tensorflow:Saver not created because there are no variables in the graph to restore

INFO:tensorflow:Saver not created because there are no variables in the graph to restore

TF-Hub module is loaded.
A Gaussian random weight matrix was creates with shape of (512, 64)
Storing random projection matrix to disk...
Pipeline args are set.

/home/kbuilder/.local/lib/python3.6/site-packages/sklearn/utils/deprecation.py:86: FutureWarning: Function gaussian_random_matrix is deprecated; gaussian_random_matrix is deprecated in 0.22 and will be removed in version 0.24.
  warnings.warn(msg, category=FutureWarning)

{'job_name': 'hub2emb-201203-121305',
 'runner': 'DirectRunner',
 'batch_size': 1024,
 'data_dir': 'corpus/*.txt',
 'output_dir': PosixPath('/tmp/tmp3_9agsp3'),
 'temporary_dir': PosixPath('/tmp/tmp75ty7xfk'),
 'module_url': 'https://tfhub.dev/google/universal-sentence-encoder/2',
 'random_projection_matrix': array([[ 0.21470759, -0.05258816, -0.0972597 , ...,  0.04385087,
         -0.14274348,  0.11220471],
        [ 0.03580492, -0.16426251, -0.14089037, ...,  0.0101535 ,
         -0.22515438, -0.21514454],
        [-0.15639698,  0.01808027, -0.13684782, ...,  0.11841098,
         -0.04303762,  0.00745478],
        ...,
        [-0.18584684,  0.14040793,  0.18339619, ...,  0.13763638,
         -0.13028201, -0.16183348],
        [ 0.20997704, -0.2241034 , -0.12709368, ..., -0.03352462,
          0.11281993, -0.16342795],
        [-0.23761595,  0.00275779, -0.1585855 , ..., -0.08995121,
          0.1475089 , -0.26595401]])}
!rm -r {output_dir}
!rm -r {temporary_dir}

print("Running pipeline...")
%time run_hub2emb(args)
print("Pipeline is done.")
WARNING:apache_beam.runners.interactive.interactive_environment:Dependencies required for Interactive Beam PCollection visualization are not available, please use: `pip install apache-beam[interactive]` to install necessary dependencies to enable all data visualization features.

Running pipeline...

Warning:tensorflow:Tensorflow version (2.3.1) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. 

Warning:tensorflow:Tensorflow version (2.3.1) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. 

Warning:tensorflow:Tensorflow version (2.3.1) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. 

Warning:tensorflow:Tensorflow version (2.3.1) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. 

Warning:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).

Warning:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).

INFO:tensorflow:Saver not created because there are no variables in the graph to restore

INFO:tensorflow:Saver not created because there are no variables in the graph to restore

Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:201: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.

Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:201: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.

INFO:tensorflow:Assets added to graph.

INFO:tensorflow:Assets added to graph.

INFO:tensorflow:No assets to write.

INFO:tensorflow:No assets to write.

INFO:tensorflow:SavedModel written to: /tmp/tmp75ty7xfk/tftransform_tmp/0839c04b1a8d4dd0b3d2832fbe9f5904/saved_model.pb

INFO:tensorflow:SavedModel written to: /tmp/tmp75ty7xfk/tftransform_tmp/0839c04b1a8d4dd0b3d2832fbe9f5904/saved_model.pb

Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_transform/tf_utils.py:218: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use ref() instead.

Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_transform/tf_utils.py:218: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use ref() instead.

Warning:tensorflow:Tensorflow version (2.3.1) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. 

Warning:tensorflow:Tensorflow version (2.3.1) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. 

Warning:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).

Warning:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:apache_beam.io.tfrecordio:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.

CPU times: user 2min 50s, sys: 6.6 s, total: 2min 57s
Wall time: 2min 40s
Pipeline is done.

ls {output_dir}
emb-00000-of-00001.tfrecords

Oluşturulan düğünlerden bazılarını okuyun ...

import itertools

embed_file = os.path.join(output_dir, 'emb-00000-of-00001.tfrecords')
sample = 5
record_iterator =  tf.io.tf_record_iterator(path=embed_file)
for string_record in itertools.islice(record_iterator, sample):
  example = tf.train.Example()
  example.ParseFromString(string_record)
  text = example.features.feature['text'].bytes_list.value
  embedding = np.array(example.features.feature['embedding'].float_list.value)
  print("Embedding dimensions: {}".format(embedding.shape[0]))
  print("{}: {}".format(text, embedding[:10]))
WARNING:tensorflow:From <ipython-input-1-3d6f4d54c65b>:5: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`

Warning:tensorflow:From <ipython-input-1-3d6f4d54c65b>:5: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`

Embedding dimensions: 64
[b'headline_text']: [-0.04724706  0.27573067 -0.02340046  0.12461437  0.04809146  0.00246292
  0.15367804 -0.17551982 -0.02778188 -0.185176  ]
Embedding dimensions: 64
[b'aba decides against community broadcasting licence']: [-0.0466345   0.00110549 -0.08875479  0.05938878  0.01933165 -0.05704207
  0.18913773 -0.12833942  0.1816328   0.06035798]
Embedding dimensions: 64
[b'act fire witnesses must be aware of defamation']: [-0.31556517 -0.07618773 -0.14239314 -0.14500496  0.04438541 -0.00983415
  0.01349827 -0.15908629 -0.12947078  0.31871504]
Embedding dimensions: 64
[b'a g calls for infrastructure protection summit']: [ 0.15422247 -0.09829048 -0.16913125 -0.17129296  0.01204466 -0.16008876
 -0.00540507 -0.20552996  0.11388192 -0.03878446]
Embedding dimensions: 64
[b'air nz staff in aust strike for pay rise']: [ 0.13039729 -0.06921542 -0.08830801 -0.09704516 -0.05936369 -0.13036506
 -0.16644046 -0.06228216  0.00742535 -0.13592219]

3. Gömmeler için YSA Endeksini Oluşturun

ANNOY (Yaklaşık En Yakın Komşular Oh Evet), belirli bir sorgu noktasına yakın olan uzayda noktaları aramak için Python bağlamalarına sahip bir C ++ kitaplığıdır. Ayrıca, belleğe yerleştirilen büyük salt okunur dosya tabanlı veri yapıları oluşturur. Spotify tarafından müzik önerileri için oluşturulmuş ve kullanılmıştır.

def build_index(embedding_files_pattern, index_filename, vector_length, 
    metric='angular', num_trees=100):
  '''Builds an ANNOY index'''

  annoy_index = annoy.AnnoyIndex(vector_length, metric=metric)
  # Mapping between the item and its identifier in the index
  mapping = {}

  embed_files = tf.gfile.Glob(embedding_files_pattern)
  print('Found {} embedding file(s).'.format(len(embed_files)))

  item_counter = 0
  for f, embed_file in enumerate(embed_files):
    print('Loading embeddings in file {} of {}...'.format(
      f+1, len(embed_files)))
    record_iterator = tf.io.tf_record_iterator(
      path=embed_file)

    for string_record in record_iterator:
      example = tf.train.Example()
      example.ParseFromString(string_record)
      text = example.features.feature['text'].bytes_list.value[0].decode("utf-8")
      mapping[item_counter] = text
      embedding = np.array(
        example.features.feature['embedding'].float_list.value)
      annoy_index.add_item(item_counter, embedding)
      item_counter += 1
      if item_counter % 100000 == 0:
        print('{} items loaded to the index'.format(item_counter))

  print('A total of {} items added to the index'.format(item_counter))

  print('Building the index with {} trees...'.format(num_trees))
  annoy_index.build(n_trees=num_trees)
  print('Index is successfully built.')

  print('Saving index to disk...')
  annoy_index.save(index_filename)
  print('Index is saved to disk.')
  print("Index file size: {} GB".format(
    round(os.path.getsize(index_filename) / float(1024 ** 3), 2)))
  annoy_index.unload()

  print('Saving mapping to disk...')
  with open(index_filename + '.mapping', 'wb') as handle:
    pickle.dump(mapping, handle, protocol=pickle.HIGHEST_PROTOCOL)
  print('Mapping is saved to disk.')
  print("Mapping file size: {} MB".format(
    round(os.path.getsize(index_filename + '.mapping') / float(1024 ** 2), 2)))
embedding_files = "{}/emb-*.tfrecords".format(output_dir)
embedding_dimension = projected_dim
index_filename = "index"

!rm {index_filename}
!rm {index_filename}.mapping

%time build_index(embedding_files, index_filename, embedding_dimension)
rm: cannot remove 'index': No such file or directory
rm: cannot remove 'index.mapping': No such file or directory
Found 1 embedding file(s).
Loading embeddings in file 1 of 1...
100000 items loaded to the index
200000 items loaded to the index
300000 items loaded to the index
400000 items loaded to the index
500000 items loaded to the index
600000 items loaded to the index
700000 items loaded to the index
800000 items loaded to the index
900000 items loaded to the index
1000000 items loaded to the index
1100000 items loaded to the index
A total of 1103664 items added to the index
Building the index with 100 trees...
Index is successfully built.
Saving index to disk...
Index is saved to disk.
Index file size: 1.66 GB
Saving mapping to disk...
Mapping is saved to disk.
Mapping file size: 50.61 MB
CPU times: user 6min 10s, sys: 3.7 s, total: 6min 14s
Wall time: 1min 36s

ls
corpus  index.mapping         raw.tsv
index   random_projection_matrix  semantic_approximate_nearest_neighbors.ipynb

4. Benzerlik Eşleştirmesi İçin Dizini Kullanın

Artık anlamsal olarak bir girdi sorgusuna yakın haber başlıklarını bulmak için YSA dizinini kullanabiliriz.

Dizini ve eşleme dosyalarını yükleyin

index = annoy.AnnoyIndex(embedding_dimension)
index.load(index_filename, prefault=True)
print('Annoy index is loaded.')
with open(index_filename + '.mapping', 'rb') as handle:
  mapping = pickle.load(handle)
print('Mapping file is loaded.')
Annoy index is loaded.

/tmpfs/src/tf_docs_env/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: The default argument for metric will be removed in future version of Annoy. Please pass metric='angular' explicitly.
  """Entry point for launching an IPython kernel.

Mapping file is loaded.

Benzerlik eşleştirme yöntemi

def find_similar_items(embedding, num_matches=5):
  '''Finds similar items to a given embedding in the ANN index'''
  ids = index.get_nns_by_vector(
  embedding, num_matches, search_k=-1, include_distances=False)
  items = [mapping[i] for i in ids]
  return items

Belirli bir sorgudan yerleştirmeyi ayıklayın

# Load the TF-Hub module
print("Loading the TF-Hub module...")
g = tf.Graph()
with g.as_default():
  embed_fn = load_module(module_url)
print("TF-Hub module is loaded.")

random_projection_matrix = None
if os.path.exists('random_projection_matrix'):
  print("Loading random projection matrix...")
  with open('random_projection_matrix', 'rb') as handle:
    random_projection_matrix = pickle.load(handle)
  print('random projection matrix is loaded.')

def extract_embeddings(query):
  '''Generates the embedding for the query'''
  query_embedding =  embed_fn([query])[0]
  if random_projection_matrix is not None:
    query_embedding = query_embedding.dot(random_projection_matrix)
  return query_embedding
Loading the TF-Hub module...
INFO:tensorflow:Saver not created because there are no variables in the graph to restore

INFO:tensorflow:Saver not created because there are no variables in the graph to restore

TF-Hub module is loaded.
TF-Hub module is loaded.
Loading random projection matrix...
random projection matrix is loaded.

extract_embeddings("Hello Machine Learning!")[:10]
array([-0.06277051,  0.14012653, -0.15893948,  0.15775941, -0.1226441 ,
       -0.11202384,  0.07953477, -0.08003543,  0.03763271,  0.0302215 ])

En benzer öğeleri bulmak için bir sorgu girin

Generating embedding for the query...
CPU times: user 32.9 ms, sys: 19.8 ms, total: 52.7 ms
Wall time: 6.96 ms

Finding relevant items in the index...
CPU times: user 7.19 ms, sys: 370 µs, total: 7.56 ms
Wall time: 953 µs

Results:
=========
confronting global challenges
downer challenges un to follow aust example
fairfax loses oshane challenge
jericho social media and the border farce
territory on search for raw comedy talent
interview gred jericho
interview: josh frydenberg; environment and energy
interview: josh frydenberg; environment and energy
world science festival music and climate change
interview with aussie bobsledder

Daha fazla öğrenmek ister misiniz?

Tensorflow.org adresinde TensorFlow hakkında daha fazla bilgi edinebilir ve tensorflow.org/hub adresindeki TF-Hub API belgelerine bakabilirsiniz . Mevcut TensorFlow Hub modüllerini bul tfhub.dev fazla metin gömme modülleri ve görüntü özelliği vektör modülleri dahil.

Ayrıca, Google'ın makine öğrenimine hızlı ve pratik bir giriş niteliğindeki Makine Öğrenimi Hızlı Kursu'na da göz atın.