Esta página foi traduzida pela API Cloud Translation.
Switch to English

Pesquisa semântica com vizinhos mais próximos aproximados e embeddings de texto

Ver no TensorFlow.org Executar no Google Colab Ver fonte no GitHub Baixar caderno

Este tutorial ilustra como gerar embeddings de um módulo TensorFlow Hub (TF-Hub) com dados de entrada e construir um índice de vizinhos mais próximos (ANN) aproximado usando os embeddings extraídos. O índice pode então ser usado para correspondência e recuperação por similaridade em tempo real.

Ao lidar com um grande corpus de dados, não é eficiente realizar a correspondência exata examinando todo o repositório para encontrar os itens mais semelhantes a uma determinada consulta em tempo real. Assim, usamos um algoritmo de correspondência por similaridade aproximada que nos permite compensar um pouco de precisão na localização de correspondências exatas do vizinho mais próximo por um aumento significativo na velocidade.

Neste tutorial, mostramos um exemplo de pesquisa de texto em tempo real em um corpus de manchetes de notícias para encontrar as manchetes mais semelhantes a uma consulta. Ao contrário da pesquisa por palavra-chave, isso captura a similaridade semântica codificada na incorporação do texto.

As etapas deste tutorial são:

  1. Baixe dados de amostra.
  2. Gere embeddings para os dados usando um módulo TF-Hub
  3. Construir um índice ANN para os embeddings
  4. Use o índice para correspondência de similaridade

Usamos o Apache Beam com TensorFlow Transform (TF-Transform) para gerar os embeddings do módulo TF-Hub. Também usamos a biblioteca ANNOY do Spotify para construir o índice de vizinhos mais próximos aproximados. Você pode encontrar benchmarking do framework ANN neste repositório Github .

Este tutorial usa TensorFlow 1.0 e funciona apenas com módulos TF1 Hub do TF-Hub. Veja a versão TF2 atualizada deste tutorial .

Configuração

Instale as bibliotecas necessárias.

pip install -q tensorflow_transform
pip install -q apache_beam
pip install -q sklearn
pip install -q annoy
ERROR: tfx-bsl 0.22.0 has requirement pyarrow<0.17,>=0.16.0, but you'll have pyarrow 0.17.1 which is incompatible.

Importe as bibliotecas necessárias

import os
import sys
import pathlib
import pickle
from collections import namedtuple
from datetime import datetime

import numpy as np
import apache_beam as beam
import annoy
from sklearn.random_projection import gaussian_random_matrix

import tensorflow.compat.v1 as tf
import tensorflow_transform as tft
import tensorflow_hub as hub
import tensorflow_transform.beam as tft_beam
Error importing tfx_bsl_extension.arrow.array_util. Some tfx_bsl functionalities are not available
print('TF version: {}'.format(tf.__version__))
print('TF-Hub version: {}'.format(hub.__version__))
print('TF-Transform version: {}'.format(tft.__version__))
print('Apache Beam version: {}'.format(beam.__version__))
TF version: 2.2.0
TF-Hub version: 0.8.0
TF-Transform version: 0.22.0
Apache Beam version: 2.22.0

1. Baixe os dados de amostra

Um conjunto de dados do Million News Headlines contém manchetes de notícias publicadas durante um período de 15 anos, provenientes da conceituada Australian Broadcasting Corp. (ABC). Este conjunto de dados de notícias tem um registro histórico resumido de eventos notáveis ​​no mundo do início de 2003 ao final de 2017, com um foco mais granular na Austrália.

Formato : dados de duas colunas separados por tabulação: 1) data de publicação e 2) texto do título. Estamos interessados ​​apenas no texto do título.

wget 'https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true' -O raw.tsv
wc -l raw.tsv
head raw.tsv
--2020-06-12 12:17:13--  https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true
Resolving dataverse.harvard.edu (dataverse.harvard.edu)... 206.191.184.198
Connecting to dataverse.harvard.edu (dataverse.harvard.edu)|206.191.184.198|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 57600231 (55M) [text/tab-separated-values]
Saving to: ‘raw.tsv’

raw.tsv             100%[===================>]  54.93M  14.7MB/s    in 4.5s    

2020-06-12 12:17:19 (12.3 MB/s) - ‘raw.tsv’ saved [57600231/57600231]

1103664 raw.tsv
publish_date    headline_text
20030219    "aba decides against community broadcasting licence"
20030219    "act fire witnesses must be aware of defamation"
20030219    "a g calls for infrastructure protection summit"
20030219    "air nz staff in aust strike for pay rise"
20030219    "air nz strike to affect australian travellers"
20030219    "ambitious olsson wins triple jump"
20030219    "antic delighted with record breaking barca"
20030219    "aussie qualifier stosur wastes four memphis match"
20030219    "aust addresses un security council over iraq"

Para simplificar, apenas mantemos o texto do título e removemos a data de publicação

!rm -r corpus
!mkdir corpus

with open('corpus/text.txt', 'w') as out_file:
  with open('raw.tsv', 'r') as in_file:
    for line in in_file:
      headline = line.split('\t')[1].strip().strip('"')
      out_file.write(headline+"\n")
rm: cannot remove 'corpus': No such file or directory

tail corpus/text.txt
severe storms forecast for nye in south east queensland
snake catcher pleads for people not to kill reptiles
south australia prepares for party to welcome new year
strikers cool off the heat with big win in adelaide
stunning images from the sydney to hobart yacht
the ashes smiths warners near miss liven up boxing day test
timelapse: brisbanes new year fireworks
what 2017 meant to the kids of australia
what the papodopoulos meeting may mean for ausus
who is george papadopoulos the former trump campaign aide

Função auxiliar para carregar um módulo TF-Hub

def load_module(module_url):
  embed_module = hub.Module(module_url)
  placeholder = tf.placeholder(dtype=tf.string)
  embed = embed_module(placeholder)
  session = tf.Session()
  session.run([tf.global_variables_initializer(), tf.tables_initializer()])
  print('TF-Hub module is loaded.')

  def _embeddings_fn(sentences):
    computed_embeddings = session.run(
        embed, feed_dict={placeholder: sentences})
    return computed_embeddings

  return _embeddings_fn

2. Gere embeddings para os dados.

Neste tutorial, usamos o Codificador de Sentença Universal para gerar emebeddings para os dados do título. Os embeddings de frase podem então ser facilmente usados ​​para calcular a similaridade de significado no nível da frase. Executamos o processo de geração de incorporação usando Apache Beam e TF-Transform.

Método de extração de incorporação

encoder = None

def embed_text(text, module_url, random_projection_matrix):
  # Beam will run this function in different processes that need to
  # import hub and load embed_fn (if not previously loaded)
  global encoder
  if not encoder:
    encoder = hub.Module(module_url)
  embedding = encoder(text)
  if random_projection_matrix is not None:
    # Perform random projection for the embedding
    embedding = tf.matmul(
        embedding, tf.cast(random_projection_matrix, embedding.dtype))
  return embedding

Faça o método preprocess_fn TFT

def make_preprocess_fn(module_url, random_projection_matrix=None):
  '''Makes a tft preprocess_fn'''

  def _preprocess_fn(input_features):
    '''tft preprocess_fn'''
    text = input_features['text']
    # Generate the embedding for the input text
    embedding = embed_text(text, module_url, random_projection_matrix)
    
    output_features = {
        'text': text, 
        'embedding': embedding
        }
        
    return output_features
  
  return _preprocess_fn

Crie metadados de conjunto de dados

def create_metadata():
  '''Creates metadata for the raw data'''
  from tensorflow_transform.tf_metadata import dataset_metadata
  from tensorflow_transform.tf_metadata import schema_utils
  feature_spec = {'text': tf.FixedLenFeature([], dtype=tf.string)}
  schema = schema_utils.schema_from_feature_spec(feature_spec)
  metadata = dataset_metadata.DatasetMetadata(schema)
  return metadata

Beam pipeline

def run_hub2emb(args):
  '''Runs the embedding generation pipeline'''

  options = beam.options.pipeline_options.PipelineOptions(**args)
  args = namedtuple("options", args.keys())(*args.values())

  raw_metadata = create_metadata()
  converter = tft.coders.CsvCoder(
      column_names=['text'], schema=raw_metadata.schema)

  with beam.Pipeline(args.runner, options=options) as pipeline:
    with tft_beam.Context(args.temporary_dir):
      # Read the sentences from the input file
      sentences = ( 
          pipeline
          | 'Read sentences from files' >> beam.io.ReadFromText(
              file_pattern=args.data_dir)
          | 'Convert to dictionary' >> beam.Map(converter.decode)
      )

      sentences_dataset = (sentences, raw_metadata)
      preprocess_fn = make_preprocess_fn(args.module_url, args.random_projection_matrix)
      # Generate the embeddings for the sentence using the TF-Hub module
      embeddings_dataset, _ = (
          sentences_dataset
          | 'Extract embeddings' >> tft_beam.AnalyzeAndTransformDataset(preprocess_fn)
      )

      embeddings, transformed_metadata = embeddings_dataset
      # Write the embeddings to TFRecords files
      embeddings | 'Write embeddings to TFRecords' >> beam.io.tfrecordio.WriteToTFRecord(
          file_path_prefix='{}/emb'.format(args.output_dir),
          file_name_suffix='.tfrecords',
          coder=tft.coders.ExampleProtoCoder(transformed_metadata.schema))

Gerando Matriz de Peso de Projeção Aleatória

A projeção aleatória é uma técnica simples, mas poderosa, usada para reduzir a dimensionalidade de um conjunto de pontos que se encontram no espaço euclidiano. Para uma base teórica, consulte o lema de Johnson-Lindenstrauss .

Reduzir a dimensionalidade dos embeddings com projeção aleatória significa menos tempo necessário para construir e consultar o índice ANN.

Neste tutorial, usamos a projeção aleatória gaussiana da biblioteca Scikit-learn .

def generate_random_projection_weights(original_dim, projected_dim):
  random_projection_matrix = None
  if projected_dim and original_dim > projected_dim:
    random_projection_matrix = gaussian_random_matrix(
        n_components=projected_dim, n_features=original_dim).T
    print("A Gaussian random weight matrix was creates with shape of {}".format(random_projection_matrix.shape))
    print('Storing random projection matrix to disk...')
    with open('random_projection_matrix', 'wb') as handle:
      pickle.dump(random_projection_matrix, 
                  handle, protocol=pickle.HIGHEST_PROTOCOL)
        
  return random_projection_matrix

Definir parâmetros

Se você deseja construir um índice usando o espaço de incorporação original sem projeção aleatória, defina o parâmetro projected_dim como None . Observe que isso tornará a etapa de indexação mais lenta para embeddings de alta dimensão.

module_url = 'https://tfhub.dev/google/universal-sentence-encoder/2' 
projected_dim = 64  

Executar pipeline

import tempfile

output_dir = pathlib.Path(tempfile.mkdtemp())
temporary_dir = pathlib.Path(tempfile.mkdtemp())

g = tf.Graph()
with g.as_default():
  original_dim = load_module(module_url)(['']).shape[1]
  random_projection_matrix = None

  if projected_dim:
    random_projection_matrix = generate_random_projection_weights(
        original_dim, projected_dim)

args = {
    'job_name': 'hub2emb-{}'.format(datetime.utcnow().strftime('%y%m%d-%H%M%S')),
    'runner': 'DirectRunner',
    'batch_size': 1024,
    'data_dir': 'corpus/*.txt',
    'output_dir': output_dir,
    'temporary_dir': temporary_dir,
    'module_url': module_url,
    'random_projection_matrix': random_projection_matrix,
}

print("Pipeline args are set.")
args
INFO:tensorflow:Saver not created because there are no variables in the graph to restore

INFO:tensorflow:Saver not created because there are no variables in the graph to restore

TF-Hub module is loaded.
A Gaussian random weight matrix was creates with shape of (512, 64)
Storing random projection matrix to disk...
Pipeline args are set.

/home/kbuilder/.local/lib/python3.6/site-packages/sklearn/utils/deprecation.py:86: FutureWarning: Function gaussian_random_matrix is deprecated; gaussian_random_matrix is deprecated in 0.22 and will be removed in version 0.24.
  warnings.warn(msg, category=FutureWarning)

{'job_name': 'hub2emb-200612-121749',
 'runner': 'DirectRunner',
 'batch_size': 1024,
 'data_dir': 'corpus/*.txt',
 'output_dir': PosixPath('/tmp/tmpl_s9_vix'),
 'temporary_dir': PosixPath('/tmp/tmp2mxbymec'),
 'module_url': 'https://tfhub.dev/google/universal-sentence-encoder/2',
 'random_projection_matrix': array([[ 3.64308063e-02,  1.67993103e-01,  1.81644938e-02, ...,
         -4.27425755e-02, -4.77527668e-02,  9.50011149e-02],
        [ 6.46911879e-02,  1.56634683e-02, -1.47471781e-01, ...,
          8.80000727e-03,  7.24554998e-02, -7.17707834e-02],
        [ 7.78754019e-02,  1.01745325e-01, -5.42349991e-05, ...,
         -6.33309663e-02,  4.59838647e-02, -2.06637975e-02],
        ...,
        [-1.23998935e-01, -7.31216982e-02, -4.92907896e-02, ...,
          1.83462424e-03,  8.75368271e-02, -1.21434298e-01],
        [-1.78228447e-01,  1.37973188e-01, -1.78144539e-01, ...,
         -7.57251835e-02,  9.75196613e-02,  5.08420970e-02],
        [ 2.28748179e-02,  9.88932902e-02, -2.06511900e-02, ...,
         -4.58982022e-02,  6.85891550e-02,  2.79062075e-02]])}
!rm -r {output_dir}
!rm -r {temporary_dir}

print("Running pipeline...")
%time run_hub2emb(args)
print("Pipeline is done.")
WARNING:apache_beam.runners.interactive.interactive_environment:Dependencies required for Interactive Beam PCollection visualization are not available, please use: `pip install apache-beam[interactive]` to install necessary dependencies to enable all data visualization features.

Running pipeline...

Warning:tensorflow:Tensorflow version (2.2.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. 

Warning:tensorflow:Tensorflow version (2.2.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. 

Warning:tensorflow:Tensorflow version (2.2.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. 

Warning:tensorflow:Tensorflow version (2.2.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. 

INFO:tensorflow:Saver not created because there are no variables in the graph to restore

INFO:tensorflow:Saver not created because there are no variables in the graph to restore

Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:201: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.

Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:201: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.

INFO:tensorflow:Assets added to graph.

INFO:tensorflow:Assets added to graph.

INFO:tensorflow:No assets to write.

INFO:tensorflow:No assets to write.

INFO:tensorflow:SavedModel written to: /tmp/tmp2mxbymec/tftransform_tmp/370eb7e3cf9748c3b4f84f81a02d3a43/saved_model.pb

INFO:tensorflow:SavedModel written to: /tmp/tmp2mxbymec/tftransform_tmp/370eb7e3cf9748c3b4f84f81a02d3a43/saved_model.pb

Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_transform/tf_utils.py:220: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use ref() instead.

Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_transform/tf_utils.py:220: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use ref() instead.

Warning:tensorflow:Tensorflow version (2.2.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. 

Warning:tensorflow:Tensorflow version (2.2.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. 
WARNING:apache_beam.io.tfrecordio:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.

CPU times: user 2min 2s, sys: 5.94 s, total: 2min 8s
Wall time: 1min 56s
Pipeline is done.

ls {output_dir}
emb-00000-of-00001.tfrecords

Leia alguns dos embeddings gerados ...

import itertools

embed_file = os.path.join(output_dir, 'emb-00000-of-00001.tfrecords')
sample = 5
record_iterator =  tf.io.tf_record_iterator(path=embed_file)
for string_record in itertools.islice(record_iterator, sample):
  example = tf.train.Example()
  example.ParseFromString(string_record)
  text = example.features.feature['text'].bytes_list.value
  embedding = np.array(example.features.feature['embedding'].float_list.value)
  print("Embedding dimensions: {}".format(embedding.shape[0]))
  print("{}: {}".format(text, embedding[:10]))

WARNING:tensorflow:From <ipython-input-18-3d6f4d54c65b>:5: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`

Warning:tensorflow:From <ipython-input-18-3d6f4d54c65b>:5: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`

Embedding dimensions: 64
[b'headline_text']: [ 0.06753581  0.16017342 -0.17897609  0.24942188  0.29144922  0.1098392
 -0.1444453   0.18496816 -0.17940071  0.06810162]
Embedding dimensions: 64
[b'aba decides against community broadcasting licence']: [-0.09007814 -0.0245999  -0.00829813 -0.04423399 -0.02233609  0.01151131
  0.07343563  0.10767256  0.16025199  0.34027126]
Embedding dimensions: 64
[b'act fire witnesses must be aware of defamation']: [-0.11213189  0.06324816  0.06318524 -0.13811822  0.07434208  0.05129745
  0.03772369  0.22248887  0.03524181 -0.07145336]
Embedding dimensions: 64
[b'a g calls for infrastructure protection summit']: [-0.02322459  0.03161157 -0.16749126 -0.16776723  0.03975003  0.11790593
 -0.00100162  0.0499187  -0.09803969  0.1521408 ]
Embedding dimensions: 64
[b'air nz staff in aust strike for pay rise']: [-0.11791844  0.08575241 -0.09476452 -0.1153833   0.19146344 -0.05015895
  0.02001415  0.04246878  0.06008246  0.01942072]

3. Construir o Índice ANN para os Embeddings

ANNOY (Approximate Nearest Neighbors Oh Yeah) é uma biblioteca C ++ com ligações Python para pesquisar pontos no espaço próximos a um determinado ponto de consulta. Ele também cria grandes estruturas de dados baseadas em arquivo somente leitura que são mapeadas na memória. Ele é construído e usado pelo Spotify para recomendações musicais.

def build_index(embedding_files_pattern, index_filename, vector_length, 
    metric='angular', num_trees=100):
  '''Builds an ANNOY index'''

  annoy_index = annoy.AnnoyIndex(vector_length, metric=metric)
  # Mapping between the item and its identifier in the index
  mapping = {}

  embed_files = tf.gfile.Glob(embedding_files_pattern)
  print('Found {} embedding file(s).'.format(len(embed_files)))

  item_counter = 0
  for f, embed_file in enumerate(embed_files):
    print('Loading embeddings in file {} of {}...'.format(
      f+1, len(embed_files)))
    record_iterator = tf.io.tf_record_iterator(
      path=embed_file)

    for string_record in record_iterator:
      example = tf.train.Example()
      example.ParseFromString(string_record)
      text = example.features.feature['text'].bytes_list.value[0].decode("utf-8")
      mapping[item_counter] = text
      embedding = np.array(
        example.features.feature['embedding'].float_list.value)
      annoy_index.add_item(item_counter, embedding)
      item_counter += 1
      if item_counter % 100000 == 0:
        print('{} items loaded to the index'.format(item_counter))

  print('A total of {} items added to the index'.format(item_counter))

  print('Building the index with {} trees...'.format(num_trees))
  annoy_index.build(n_trees=num_trees)
  print('Index is successfully built.')
  
  print('Saving index to disk...')
  annoy_index.save(index_filename)
  print('Index is saved to disk.')
  print("Index file size: {} GB".format(
    round(os.path.getsize(index_filename) / float(1024 ** 3), 2)))
  annoy_index.unload()

  print('Saving mapping to disk...')
  with open(index_filename + '.mapping', 'wb') as handle:
    pickle.dump(mapping, handle, protocol=pickle.HIGHEST_PROTOCOL)
  print('Mapping is saved to disk.')
  print("Mapping file size: {} MB".format(
    round(os.path.getsize(index_filename + '.mapping') / float(1024 ** 2), 2)))
embedding_files = "{}/emb-*.tfrecords".format(output_dir)
embedding_dimension = projected_dim
index_filename = "index"

!rm {index_filename}
!rm {index_filename}.mapping

%time build_index(embedding_files, index_filename, embedding_dimension)
rm: cannot remove 'index': No such file or directory
rm: cannot remove 'index.mapping': No such file or directory
Found 1 embedding file(s).
Loading embeddings in file 1 of 1...
100000 items loaded to the index
200000 items loaded to the index
300000 items loaded to the index
400000 items loaded to the index
500000 items loaded to the index
600000 items loaded to the index
700000 items loaded to the index
800000 items loaded to the index
900000 items loaded to the index
1000000 items loaded to the index
1100000 items loaded to the index
A total of 1103664 items added to the index
Building the index with 100 trees...
Index is successfully built.
Saving index to disk...
Index is saved to disk.
Index file size: 1.7 GB
Saving mapping to disk...
Mapping is saved to disk.
Mapping file size: 50.61 MB
CPU times: user 4min, sys: 3.07 s, total: 4min 3s
Wall time: 4min 2s

ls
corpus  index.mapping         raw.tsv
index   random_projection_matrix  semantic_approximate_nearest_neighbors.ipynb

4. Use o índice para correspondência de similaridade

Agora podemos usar o índice ANN para encontrar manchetes de notícias semanticamente próximas a uma consulta de entrada.

Carregue o índice e os arquivos de mapeamento

index = annoy.AnnoyIndex(embedding_dimension)
index.load(index_filename, prefault=True)
print('Annoy index is loaded.')
with open(index_filename + '.mapping', 'rb') as handle:
  mapping = pickle.load(handle)
print('Mapping file is loaded.')

Annoy index is loaded.

/tmpfs/src/tf_docs_env/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: The default argument for metric will be removed in future version of Annoy. Please pass metric='angular' explicitly.
  """Entry point for launching an IPython kernel.

Mapping file is loaded.

Método de correspondência por similaridade

def find_similar_items(embedding, num_matches=5):
  '''Finds similar items to a given embedding in the ANN index'''
  ids = index.get_nns_by_vector(
  embedding, num_matches, search_k=-1, include_distances=False)
  items = [mapping[i] for i in ids]
  return items

Extraia a incorporação de uma determinada consulta

# Load the TF-Hub module
print("Loading the TF-Hub module...")
g = tf.Graph()
with g.as_default():
  embed_fn = load_module(module_url)
print("TF-Hub module is loaded.")

random_projection_matrix = None
if os.path.exists('random_projection_matrix'):
  print("Loading random projection matrix...")
  with open('random_projection_matrix', 'rb') as handle:
    random_projection_matrix = pickle.load(handle)
  print('random projection matrix is loaded.')

def extract_embeddings(query):
  '''Generates the embedding for the query'''
  query_embedding =  embed_fn([query])[0]
  if random_projection_matrix is not None:
    query_embedding = query_embedding.dot(random_projection_matrix)
  return query_embedding
Loading the TF-Hub module...
INFO:tensorflow:Saver not created because there are no variables in the graph to restore

INFO:tensorflow:Saver not created because there are no variables in the graph to restore

TF-Hub module is loaded.
TF-Hub module is loaded.
Loading random projection matrix...
random projection matrix is loaded.

extract_embeddings("Hello Machine Learning!")[:10]
array([ 0.01509227,  0.01280743,  0.06226483,  0.28911482,  0.00435164,
        0.0887818 , -0.16593867, -0.04508635,  0.0015209 ,  0.11720917])

Insira uma consulta para encontrar os itens mais semelhantes


query = "confronting global challenges" 
print("Generating embedding for the query...")
%time query_embedding = extract_embeddings(query)

print("")
print("Finding relevant items in the index...")
%time items = find_similar_items(query_embedding, 10)

print("")
print("Results:")
print("=========")
for item in items:
  print(item)
Generating embedding for the query...
CPU times: user 27.2 ms, sys: 36.1 ms, total: 63.3 ms
Wall time: 8.16 ms

Finding relevant items in the index...
CPU times: user 1.43 ms, sys: 5.07 ms, total: 6.5 ms
Wall time: 873 µs

Results:
=========
confronting global challenges
challenges to austs future
ir changes unethical salvos
costello promises funding for scientific research
objections to exploration
business leaders reveal mostly positive outlook
using science to reframe the reconciliation agenda
concerns over veracity of working with vulnerable people system
review slams act approach to major events
armidale tests disaster response

Quer saber mais?

Você pode aprender mais sobre TensorFlow em tensorflow.org e ver a documentação da API TF-Hub em tensorflow.org/hub . Encontre os módulos do TensorFlow Hub disponíveis em tfhub.dev, incluindo mais módulos de incorporação de texto e módulos de vetor de recursos de imagem.

Verifique também o Curso intensivo de aprendizado de máquina, que é a introdução prática e rápida do Google ao aprendizado de máquina.