Multilingual Universal Sentence Encoder Q&A Retrieval

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

This is a demo for using Univeral Encoder Multilingual Q&A model for question-answer retrieval of text, illustrating the use of question_encoder and response_encoder of the model. We use sentences from SQuAD paragraphs as the demo dataset, each sentence and its context (the text surrounding the sentence) is encoded into high dimension embeddings with the response_encoder. These embeddings are stored in an index built using the simpleneighbors library for question-answer retrieval.

On retrieval a random question is selected from the SQuAD dataset and encoded into high dimension embedding with the question_encoder and query the simpleneighbors index returning a list of approximate nearest neighbors in semantic space.

Setup

%%capture

# Install the latest Tensorflow version.
!pip install -q tensorflow_text
!pip install -q simpleneighbors[annoy]
!pip install -q nltk
!pip install -q tqdm

import json
import nltk
import os
import pprint
import random
import simpleneighbors
import urllib
from IPython.display import HTML, display
from tqdm.notebook import tqdm

import tensorflow.compat.v2 as tf
import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer

nltk.download('punkt')


def download_squad(url):
  return json.load(urllib.request.urlopen(url))

def extract_sentences_from_squad_json(squad):
  all_sentences = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      sentences = nltk.tokenize.sent_tokenize(paragraph['context'])
      all_sentences.extend(zip(sentences, [paragraph['context']] * len(sentences)))
  return list(set(all_sentences)) # remove duplicates

def extract_questions_from_squad_json(squad):
  questions = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      for qas in paragraph['qas']:
        if qas['answers']:
          questions.append((qas['question'], qas['answers'][0]['text']))
  return list(set(questions))

def output_with_highlight(text, highlight):
  output = "<li> "
  i = text.find(highlight)
  while True:
    if i == -1:
      output += text
      break
    output += text[0:i]
    output += '<b>'+text[i:i+len(highlight)]+'</b>'
    text = text[i+len(highlight):]
    i = text.find(highlight)
  return output + "</li>\n"

def display_nearest_neighbors(query_text, answer_text=None):
  query_embedding = model.signatures['question_encoder'](tf.constant([query_text]))['outputs'][0]
  search_results = index.nearest(query_embedding, n=num_results)

  if answer_text:
    result_md = '''
    <p>Random Question from SQuAD:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    <p>Answer:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    ''' % (query_text , answer_text)
  else:
    result_md = '''
    <p>Question:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    ''' % query_text

  result_md += '''
    <p>Retrieved sentences :
    <ol>
  '''

  if answer_text:
    for s in search_results:
      result_md += output_with_highlight(s, answer_text)
  else:
    for s in search_results:
      result_md += '<li>' + s + '</li>\n'

  result_md += "</ol>"
  display(HTML(result_md))
[nltk_data] Downloading package punkt to /home/kbuilder/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

Run the following code block to download and extract the SQuAD dataset into:

  • sentences is a list of (text, context) tuples - each paragraph from the SQuAD dataset are splitted into sentences using nltk library and the sentence and paragraph text forms the (text, context) tuple.
  • questions is a list of (question, answer) tuples.

squad_url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json' 

squad_json = download_squad(squad_url)
sentences = extract_sentences_from_squad_json(squad_json)
questions = extract_questions_from_squad_json(squad_json)
print("%s sentences, %s questions extracted from SQuAD %s" % (len(sentences), len(questions), squad_url))

print("\nExample sentence and context:\n")
sentence = random.choice(sentences)
print("sentence:\n")
pprint.pprint(sentence[0])
print("\ncontext:\n")
pprint.pprint(sentence[1])
print()
10455 sentences, 10552 questions extracted from SQuAD https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

Example sentence and context:

sentence:

('In the words of the Secretary General of the United Nations Ban Ki-Moon: '
 '"While economic growth is necessary, it is not sufficient for progress on '
 'reducing poverty."')

context:

('While acknowledging the central role economic growth can potentially play in '
 'human development, poverty reduction and the achievement of the Millennium '
 'Development Goals, it is becoming widely understood amongst the development '
 'community that special efforts must be made to ensure poorer sections of '
 'society are able to participate in economic growth. The effect of economic '
 'growth on poverty reduction – the growth elasticity of poverty – can depend '
 'on the existing level of inequality. For instance, with low inequality a '
 'country with a growth rate of 2% per head and 40% of its population living '
 'in poverty, can halve poverty in ten years, but a country with high '
 'inequality would take nearly 60 years to achieve the same reduction. In the '
 'words of the Secretary General of the United Nations Ban Ki-Moon: "While '
 'economic growth is necessary, it is not sufficient for progress on reducing '
 'poverty."')


The following code block setup the tensorflow graph g and session with the Univeral Encoder Multilingual Q&A model's question_encoder and response_encoder signatures.


module_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3" 
model = hub.load(module_url)

The following code block compute the embeddings for all the text, context tuples and store them in a simpleneighbors index using the response_encoder.


batch_size = 100

encodings = model.signatures['response_encoder'](
  input=tf.constant([sentences[0][0]]),
  context=tf.constant([sentences[0][1]]))
index = simpleneighbors.SimpleNeighbors(
    len(encodings['outputs'][0]), metric='angular')

print('Computing embeddings for %s sentences' % len(sentences))
slices = zip(*(iter(sentences),) * batch_size)
num_batches = int(len(sentences) / batch_size)
for s in tqdm(slices, total=num_batches):
  response_batch = list([r for r, c in s])
  context_batch = list([c for r, c in s])
  encodings = model.signatures['response_encoder'](
    input=tf.constant(response_batch),
    context=tf.constant(context_batch)
  )
  for batch_index, batch in enumerate(response_batch):
    index.add_one(batch, encodings['outputs'][batch_index])

index.build()
print('simpleneighbors index for %s sentences built.' % len(sentences))

Computing embeddings for 10455 sentences

HBox(children=(FloatProgress(value=0.0, max=104.0), HTML(value='')))

simpleneighbors index for 10455 sentences built.

On retrieval, the question is encoded using the question_encoder and the question embedding is used to query the simpleneighbors index.


num_results = 25 

query = random.choice(questions)
display_nearest_neighbors(query[0], query[1])