Semantic Search with Approximate Nearest Neighbors and Text Embeddings

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

This tutorial illustrates how to generate embeddings from a TensorFlow Hub (TF-Hub) module given input data, and build an approximate nearest neighbours (ANN) index using the extracted embeddings. The index can then be used for real-time similarity matching and retrieval.

When dealing with a large corpus of data, it's not efficient to perform exact matching by scanning the whole repository to find the most similar items to a given query in real-time. Thus, we use an approximate similarity matching algorithm which allows us to trade off a little bit of accuracy in finding exact nearest neighbor matches for a significant boost in speed.

In this tutorial, we show an example of real-time text search over a corpus of news headlines to find the headlines that are most similar to a query. Unlike keyword search, this captures the semantic similarity encoded in the text embedding.

The steps of this tutorial are:

  1. Download sample data.
  2. Generate embeddings for the data using a TF-Hub module
  3. Build an ANN index for the embeddings
  4. Use the index for similarity matching

We use Apache Beam to generate the embeddings from the TF-Hub module. We also use Spotify's ANNOY library to build the approximate nearest neighbours index.

Setup

Install the required libraries.

pip install -q apache_beam
pip install -q sklearn
pip install -q annoy

Import the required libraries

import os
import sys
import pickle
from collections import namedtuple
from datetime import datetime
import numpy as np
import apache_beam as beam
from apache_beam.transforms import util
import tensorflow as tf
import tensorflow_hub as hub
import annoy
from sklearn.random_projection import gaussian_random_matrix
print('TF version: {}'.format(tf.__version__))
print('TF-Hub version: {}'.format(hub.__version__))
print('Apache Beam version: {}'.format(beam.__version__))
TF version: 2.2.0
TF-Hub version: 0.8.0
Apache Beam version: 2.22.0

1. Download Sample Data

A Million News Headlines dataset contains news headlines published over a period of 15 years sourced from the reputable Australian Broadcasting Corp. (ABC). This news dataset has a summarised historical record of noteworthy events in the globe from early-2003 to end-2017 with a more granular focus on Australia.

Format: Tab-separated two-column data: 1) publication date and 2) headline text. We are only interested in the headline text.

wget 'https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true' -O raw.tsv
wc -l raw.tsv
head raw.tsv
--2020-06-12 11:59:40--  https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true
Resolving dataverse.harvard.edu (dataverse.harvard.edu)... 206.191.184.198
Connecting to dataverse.harvard.edu (dataverse.harvard.edu)|206.191.184.198|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 57600231 (55M) [text/tab-separated-values]
Saving to: ‘raw.tsv’

raw.tsv             100%[===================>]  54.93M  14.7MB/s    in 4.4s    

2020-06-12 11:59:46 (12.4 MB/s) - ‘raw.tsv’ saved [57600231/57600231]

1103664 raw.tsv
publish_date    headline_text
20030219    "aba decides against community broadcasting licence"
20030219    "act fire witnesses must be aware of defamation"
20030219    "a g calls for infrastructure protection summit"
20030219    "air nz staff in aust strike for pay rise"
20030219    "air nz strike to affect australian travellers"
20030219    "ambitious olsson wins triple jump"
20030219    "antic delighted with record breaking barca"
20030219    "aussie qualifier stosur wastes four memphis match"
20030219    "aust addresses un security council over iraq"

For simplicity, we only keep the headline text and remove the publication date

!rm -r corpus
!mkdir corpus

with open('corpus/text.txt', 'w') as out_file:
  with open('raw.tsv', 'r') as in_file:
    for line in in_file:
      headline = line.split('\t')[1].strip().strip('"')
      out_file.write(headline+"\n")
rm: cannot remove 'corpus': No such file or directory

tail corpus/text.txt
severe storms forecast for nye in south east queensland
snake catcher pleads for people not to kill reptiles
south australia prepares for party to welcome new year
strikers cool off the heat with big win in adelaide
stunning images from the sydney to hobart yacht
the ashes smiths warners near miss liven up boxing day test
timelapse: brisbanes new year fireworks
what 2017 meant to the kids of australia
what the papodopoulos meeting may mean for ausus
who is george papadopoulos the former trump campaign aide

2. Generate Embeddings for the Data.

In this tutorial, we use the Neural Network Language Model (NNLM) to generate embeddings for the headline data. The sentence embeddings can then be easily used to compute sentence level meaning similarity. We run the embedding generation process using Apache Beam.

Embedding extraction method

embed_fn = None

def generate_embeddings(text, module_url, random_projection_matrix=None):
  # Beam will run this function in different processes that need to
  # import hub and load embed_fn (if not previously loaded)
  global embed_fn
  if embed_fn is None:
    embed_fn = hub.load(module_url)
  embedding = embed_fn(text).numpy()
  if random_projection_matrix is not None:
    embedding = embedding.dot(random_projection_matrix)
  return text, embedding

Convert to tf.Example method

def to_tf_example(entries):
  examples = []

  text_list, embedding_list = entries
  for i in range(len(text_list)):
    text = text_list[i]
    embedding = embedding_list[i]

    features = {
        'text': tf.train.Feature(
            bytes_list=tf.train.BytesList(value=[text.encode('utf-8')])),
        'embedding': tf.train.Feature(
            float_list=tf.train.FloatList(value=embedding.tolist()))
    }
  
    example = tf.train.Example(
        features=tf.train.Features(
            feature=features)).SerializeToString(deterministic=True)
  
    examples.append(example)
  
  return examples

Beam pipeline

def run_hub2emb(args):
  '''Runs the embedding generation pipeline'''

  options = beam.options.pipeline_options.PipelineOptions(**args)
  args = namedtuple("options", args.keys())(*args.values())

  with beam.Pipeline(args.runner, options=options) as pipeline:
    (
        pipeline
        | 'Read sentences from files' >> beam.io.ReadFromText(
            file_pattern=args.data_dir)
        | 'Batch elements' >> util.BatchElements(
            min_batch_size=args.batch_size, max_batch_size=args.batch_size)
        | 'Generate embeddings' >> beam.Map(
            generate_embeddings, args.module_url, args.random_projection_matrix)
        | 'Encode to tf example' >> beam.FlatMap(to_tf_example)
        | 'Write to TFRecords files' >> beam.io.WriteToTFRecord(
            file_path_prefix='{}/emb'.format(args.output_dir),
            file_name_suffix='.tfrecords')
    )

Generaring Random Projection Weight Matrix

Random projection is a simple, yet powerfull technique used to reduce the dimensionality of a set of points which lie in Euclidean space. For a theoretical background, see the Johnson-Lindenstrauss lemma.

Reducing the dimensionality of the embeddings with random projection means less time needed to build and query the ANN index.

In this tutorial we use Gaussian Random Projection from the Scikit-learn library.

def generate_random_projection_weights(original_dim, projected_dim):
  random_projection_matrix = None
  random_projection_matrix = gaussian_random_matrix(
      n_components=projected_dim, n_features=original_dim).T
  print("A Gaussian random weight matrix was creates with shape of {}".format(random_projection_matrix.shape))
  print('Storing random projection matrix to disk...')
  with open('random_projection_matrix', 'wb') as handle:
    pickle.dump(random_projection_matrix, 
                handle, protocol=pickle.HIGHEST_PROTOCOL)
        
  return random_projection_matrix

Set parameters

If you want to build an index using the original embedding space without random projection, set the projected_dim parameter to None. Note that this will slow down the indexing step for high-dimensional embeddings.

module_url = 'https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1' 
projected_dim = 64  

Run pipeline

import tempfile

output_dir = tempfile.mkdtemp()
original_dim = hub.load(module_url)(['']).shape[1]
random_projection_matrix = None

if projected_dim:
  random_projection_matrix = generate_random_projection_weights(
      original_dim, projected_dim)

args = {
    'job_name': 'hub2emb-{}'.format(datetime.utcnow().strftime('%y%m%d-%H%M%S')),
    'runner': 'DirectRunner',
    'batch_size': 1024,
    'data_dir': 'corpus/*.txt',
    'output_dir': output_dir,
    'module_url': module_url,
    'random_projection_matrix': random_projection_matrix,
}

print("Pipeline args are set.")
args
A Gaussian random weight matrix was creates with shape of (128, 64)
Storing random projection matrix to disk...
Pipeline args are set.

/home/kbuilder/.local/lib/python3.6/site-packages/sklearn/utils/deprecation.py:86: FutureWarning: Function gaussian_random_matrix is deprecated; gaussian_random_matrix is deprecated in 0.22 and will be removed in version 0.24.
  warnings.warn(msg, category=FutureWarning)

{'job_name': 'hub2emb-200612-120003',
 'runner': 'DirectRunner',
 'batch_size': 1024,
 'data_dir': 'corpus/*.txt',
 'output_dir': '/tmp/tmpms6ms0_c',
 'module_url': 'https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1',
 'random_projection_matrix': array([[ 0.16854957, -0.03325581,  0.02865259, ...,  0.21883682,
         -0.09481203,  0.32354785],
        [-0.20655215,  0.11371441,  0.34903099, ...,  0.06248171,
         -0.147435  , -0.10130263],
        [-0.00954473, -0.12573907, -0.10185224, ..., -0.22453107,
          0.0753951 ,  0.01065091],
        ...,
        [-0.1518803 ,  0.07516912, -0.13614433, ..., -0.21580604,
         -0.28232779, -0.07034141],
        [-0.23765885, -0.08143739, -0.10982153, ...,  0.29401845,
         -0.31201466, -0.24825458],
        [ 0.01055139,  0.15181688, -0.01028288, ...,  0.06804364,
          0.09340422,  0.00996166]])}
print("Running pipeline...")
%time run_hub2emb(args)
print("Pipeline is done.")
WARNING:apache_beam.runners.interactive.interactive_environment:Dependencies required for Interactive Beam PCollection visualization are not available, please use: `pip install apache-beam[interactive]` to install necessary dependencies to enable all data visualization features.

Running pipeline...

Warning:apache_beam.io.tfrecordio:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.

CPU times: user 9min 7s, sys: 9min 58s, total: 19min 5s
Wall time: 2min 26s
Pipeline is done.

ls {output_dir}
emb-00000-of-00001.tfrecords

Read some of the generated embeddings...

embed_file = os.path.join(output_dir, 'emb-00000-of-00001.tfrecords')
sample = 5

# Create a description of the features.
feature_description = {
    'text': tf.io.FixedLenFeature([], tf.string),
    'embedding': tf.io.FixedLenFeature([projected_dim], tf.float32)
}

def _parse_example(example):
  # Parse the input `tf.Example` proto using the dictionary above.
  return tf.io.parse_single_example(example, feature_description)

dataset = tf.data.TFRecordDataset(embed_file)
for record in dataset.take(sample).map(_parse_example):
  print("{}: {}".format(record['text'].numpy().decode('utf-8'), record['embedding'].numpy()[:10]))

headline_text: [ 0.16609916 -0.03438292  0.18651083 -0.11887558  0.12700227 -0.00260543
 -0.1694575   0.03915567 -0.20600714 -0.08033914]
aba decides against community broadcasting licence: [ 0.00164881  0.03420356  0.20437077 -0.15911008  0.01062698 -0.19206303
  0.01367363  0.04079304 -0.06455765 -0.11617825]
act fire witnesses must be aware of defamation: [ 0.03333034  0.13002941 -0.07350241 -0.10036179 -0.12197241 -0.20040572
  0.01972399  0.056002   -0.09918054 -0.10233164]
a g calls for infrastructure protection summit: [ 0.07078221  0.08030576 -0.27882683  0.00746678  0.14450745 -0.05204003
  0.08401233  0.3760463  -0.04126653 -0.15640926]
air nz staff in aust strike for pay rise: [ 0.07462363  0.0608657  -0.2512632  -0.27842686 -0.06503167 -0.04509769
  0.15789445  0.03054392  0.06708738 -0.3316408 ]

3. Build the ANN Index for the Embeddings

ANNOY (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory. It is built and used by Spotify for music recommendations.

def build_index(embedding_files_pattern, index_filename, vector_length, 
    metric='angular', num_trees=100):
  '''Builds an ANNOY index'''

  annoy_index = annoy.AnnoyIndex(vector_length, metric=metric)
  # Mapping between the item and its identifier in the index
  mapping = {}

  embed_files = tf.io.gfile.glob(embedding_files_pattern)
  num_files = len(embed_files)
  print('Found {} embedding file(s).'.format(num_files))

  item_counter = 0
  for i, embed_file in enumerate(embed_files):
    print('Loading embeddings in file {} of {}...'.format(i+1, num_files))
    dataset = tf.data.TFRecordDataset(embed_file)
    for record in dataset.map(_parse_example):
      text = record['text'].numpy().decode("utf-8")
      embedding = record['embedding'].numpy()
      mapping[item_counter] = text
      annoy_index.add_item(item_counter, embedding)
      item_counter += 1
      if item_counter % 100000 == 0:
        print('{} items loaded to the index'.format(item_counter))

  print('A total of {} items added to the index'.format(item_counter))

  print('Building the index with {} trees...'.format(num_trees))
  annoy_index.build(n_trees=num_trees)
  print('Index is successfully built.')
  
  print('Saving index to disk...')
  annoy_index.save(index_filename)
  print('Index is saved to disk.')
  print("Index file size: {} GB".format(
    round(os.path.getsize(index_filename) / float(1024 ** 3), 2)))
  annoy_index.unload()

  print('Saving mapping to disk...')
  with open(index_filename + '.mapping', 'wb') as handle:
    pickle.dump(mapping, handle, protocol=pickle.HIGHEST_PROTOCOL)
  print('Mapping is saved to disk.')
  print("Mapping file size: {} MB".format(
    round(os.path.getsize(index_filename + '.mapping') / float(1024 ** 2), 2)))
embedding_files = "{}/emb-*.tfrecords".format(output_dir)
embedding_dimension = projected_dim
index_filename = "index"

!rm {index_filename}
!rm {index_filename}.mapping

%time build_index(embedding_files, index_filename, embedding_dimension)
rm: cannot remove 'index': No such file or directory
rm: cannot remove 'index.mapping': No such file or directory
Found 1 embedding file(s).
Loading embeddings in file 1 of 1...
100000 items loaded to the index
200000 items loaded to the index
300000 items loaded to the index
400000 items loaded to the index
500000 items loaded to the index
600000 items loaded to the index
700000 items loaded to the index
800000 items loaded to the index
900000 items loaded to the index
1000000 items loaded to the index
1100000 items loaded to the index
A total of 1103664 items added to the index
Building the index with 100 trees...
Index is successfully built.
Saving index to disk...
Index is saved to disk.
Index file size: 1.61 GB
Saving mapping to disk...
Mapping is saved to disk.
Mapping file size: 50.61 MB
CPU times: user 7min 45s, sys: 52 s, total: 8min 37s
Wall time: 7min 37s

ls
corpus         random_projection_matrix
index          raw.tsv
index.mapping  tf2_semantic_approximate_nearest_neighbors.ipynb

4. Use the Index for Similarity Matching

Now we can use the ANN index to find news headlines that are semantically close to an input query.

Load the index and the mapping files

index = annoy.AnnoyIndex(embedding_dimension)
index.load(index_filename, prefault=True)
print('Annoy index is loaded.')
with open(index_filename + '.mapping', 'rb') as handle:
  mapping = pickle.load(handle)
print('Mapping file is loaded.')

Annoy index is loaded.

/tmpfs/src/tf_docs_env/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: The default argument for metric will be removed in future version of Annoy. Please pass metric='angular' explicitly.
  """Entry point for launching an IPython kernel.

Mapping file is loaded.

Similarity matching method

def find_similar_items(embedding, num_matches=5):
  '''Finds similar items to a given embedding in the ANN index'''
  ids = index.get_nns_by_vector(
  embedding, num_matches, search_k=-1, include_distances=False)
  items = [mapping[i] for i in ids]
  return items

Extract embedding from a given query

# Load the TF-Hub module
print("Loading the TF-Hub module...")
%time embed_fn = hub.load(module_url)
print("TF-Hub module is loaded.")

random_projection_matrix = None
if os.path.exists('random_projection_matrix'):
  print("Loading random projection matrix...")
  with open('random_projection_matrix', 'rb') as handle:
    random_projection_matrix = pickle.load(handle)
  print('random projection matrix is loaded.')

def extract_embeddings(query):
  '''Generates the embedding for the query'''
  query_embedding =  embed_fn([query])[0].numpy()
  if random_projection_matrix is not None:
    query_embedding = query_embedding.dot(random_projection_matrix)
  return query_embedding

Loading the TF-Hub module...
CPU times: user 726 ms, sys: 621 ms, total: 1.35 s
Wall time: 1.34 s
TF-Hub module is loaded.
Loading random projection matrix...
random projection matrix is loaded.

extract_embeddings("Hello Machine Learning!")[:10]
array([-0.09563696, -0.05436445,  0.16022189,  0.04329931,  0.2186187 ,
        0.09864157, -0.10316328, -0.06747011,  0.0178634 ,  0.02822987])

Enter a query to find the most similar items


query = "confronting global challenges" 

print("Generating embedding for the query...")
%time query_embedding = extract_embeddings(query)

print("")
print("Finding relevant items in the index...")
%time items = find_similar_items(query_embedding, 10)

print("")
print("Results:")
print("=========")
for item in items:
  print(item)
Generating embedding for the query...
CPU times: user 625 µs, sys: 4.54 ms, total: 5.17 ms
Wall time: 2.95 ms

Finding relevant items in the index...
CPU times: user 572 µs, sys: 370 µs, total: 942 µs
Wall time: 643 µs

Results:
=========
confronting global challenges
territory to face unique climate challenges
g20 vital to easing global market turmoil
conference examines challenges facing major cities
indian economy faces continued challenges
global markets edge higher despite greece jitters
world vision warns of worsening drought
advertising faces modern challenges
global markets take fresh battering amid european concerns
global response

Want to learn more?

You can learn more about TensorFlow at tensorflow.org and see the TF-Hub API documentation at tensorflow.org/hub. Find available TensorFlow Hub modules at tfhub.dev including more text embedding modules and image feature vector modules.

Also check out the Machine Learning Crash Course which is Google's fast-paced, practical introduction to machine learning.