![]() |
![]() |
![]() |
![]() |
![]() |
This is a demo for using Universal Encoder Multilingual Q&A model for question-answer retrieval of text, illustrating the use of question_encoder and response_encoder of the model. We use sentences from SQuAD paragraphs as the demo dataset, each sentence and its context (the text surrounding the sentence) is encoded into high dimension embeddings with the response_encoder. These embeddings are stored in an index built using the simpleneighbors library for question-answer retrieval.
On retrieval a random question is selected from the SQuAD dataset and encoded into high dimension embedding with the question_encoder and query the simpleneighbors index returning a list of approximate nearest neighbors in semantic space.
More models
You can find all currently hosted text embedding models here and all models that have been trained on SQuAD as well here.
Setup
Setup Environment
%%capture
# Install the latest Tensorflow version.
!pip install -q "tensorflow-text==2.11.*"
!pip install -q simpleneighbors[annoy]
!pip install -q nltk
!pip install -q tqdm
Setup common imports and functions
import json
import nltk
import os
import pprint
import random
import simpleneighbors
import urllib
from IPython.display import HTML, display
from tqdm.notebook import tqdm
import tensorflow.compat.v2 as tf
import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer
nltk.download('punkt')
def download_squad(url):
return json.load(urllib.request.urlopen(url))
def extract_sentences_from_squad_json(squad):
all_sentences = []
for data in squad['data']:
for paragraph in data['paragraphs']:
sentences = nltk.tokenize.sent_tokenize(paragraph['context'])
all_sentences.extend(zip(sentences, [paragraph['context']] * len(sentences)))
return list(set(all_sentences)) # remove duplicates
def extract_questions_from_squad_json(squad):
questions = []
for data in squad['data']:
for paragraph in data['paragraphs']:
for qas in paragraph['qas']:
if qas['answers']:
questions.append((qas['question'], qas['answers'][0]['text']))
return list(set(questions))
def output_with_highlight(text, highlight):
output = "<li> "
i = text.find(highlight)
while True:
if i == -1:
output += text
break
output += text[0:i]
output += '<b>'+text[i:i+len(highlight)]+'</b>'
text = text[i+len(highlight):]
i = text.find(highlight)
return output + "</li>\n"
def display_nearest_neighbors(query_text, answer_text=None):
query_embedding = model.signatures['question_encoder'](tf.constant([query_text]))['outputs'][0]
search_results = index.nearest(query_embedding, n=num_results)
if answer_text:
result_md = '''
<p>Random Question from SQuAD:</p>
<p> <b>%s</b></p>
<p>Answer:</p>
<p> <b>%s</b></p>
''' % (query_text , answer_text)
else:
result_md = '''
<p>Question:</p>
<p> <b>%s</b></p>
''' % query_text
result_md += '''
<p>Retrieved sentences :
<ol>
'''
if answer_text:
for s in search_results:
result_md += output_with_highlight(s, answer_text)
else:
for s in search_results:
result_md += '<li>' + s + '</li>\n'
result_md += "</ol>"
display(HTML(result_md))
2023-11-16 12:40:37.225420: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2023-11-16 12:40:37.968808: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory 2023-11-16 12:40:37.968912: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory 2023-11-16 12:40:37.968923: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. [nltk_data] Downloading package punkt to /home/kbuilder/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip.
Run the following code block to download and extract the SQuAD dataset into:
- sentences is a list of (text, context) tuples - each paragraph from the SQuAD dataset are split into sentences using nltk library and the sentence and paragraph text forms the (text, context) tuple.
- questions is a list of (question, answer) tuples.
Download and extract SQuAD data
squad_url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json'
squad_json = download_squad(squad_url)
sentences = extract_sentences_from_squad_json(squad_json)
questions = extract_questions_from_squad_json(squad_json)
print("%s sentences, %s questions extracted from SQuAD %s" % (len(sentences), len(questions), squad_url))
print("\nExample sentence and context:\n")
sentence = random.choice(sentences)
print("sentence:\n")
pprint.pprint(sentence[0])
print("\ncontext:\n")
pprint.pprint(sentence[1])
print()
10455 sentences, 10552 questions extracted from SQuAD https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json Example sentence and context: sentence: ('In this cycle a pump is used to pressurize the working fluid which is ' 'received from the condenser as a liquid not as a gas.') context: ('The Rankine cycle is sometimes referred to as a practical Carnot cycle ' 'because, when an efficient turbine is used, the TS diagram begins to ' 'resemble the Carnot cycle. The main difference is that heat addition (in the ' 'boiler) and rejection (in the condenser) are isobaric (constant pressure) ' 'processes in the Rankine cycle and isothermal (constant temperature) ' 'processes in the theoretical Carnot cycle. In this cycle a pump is used to ' 'pressurize the working fluid which is received from the condenser as a ' 'liquid not as a gas. Pumping the working fluid in liquid form during the ' 'cycle requires a small fraction of the energy to transport it compared to ' 'the energy needed to compress the working fluid in gaseous form in a ' 'compressor (as in the Carnot cycle). The cycle of a reciprocating steam ' 'engine differs from that of turbines because of condensation and ' 're-evaporation occurring in the cylinder or in the steam inlet passages.')
The following code block setup the tensorflow graph g and session with the Universal Encoder Multilingual Q&A model's question_encoder and response_encoder signatures.
Load model from tensorflow hub
module_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3"
model = hub.load(module_url)
2023-11-16 12:40:46.230586: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
The following code block compute the embeddings for all the text, context tuples and store them in a simpleneighbors index using the response_encoder.
Compute embeddings and build simpleneighbors index
batch_size = 100
encodings = model.signatures['response_encoder'](
input=tf.constant([sentences[0][0]]),
context=tf.constant([sentences[0][1]]))
index = simpleneighbors.SimpleNeighbors(
len(encodings['outputs'][0]), metric='angular')
print('Computing embeddings for %s sentences' % len(sentences))
slices = zip(*(iter(sentences),) * batch_size)
num_batches = int(len(sentences) / batch_size)
for s in tqdm(slices, total=num_batches):
response_batch = list([r for r, c in s])
context_batch = list([c for r, c in s])
encodings = model.signatures['response_encoder'](
input=tf.constant(response_batch),
context=tf.constant(context_batch)
)
for batch_index, batch in enumerate(response_batch):
index.add_one(batch, encodings['outputs'][batch_index])
index.build()
print('simpleneighbors index for %s sentences built.' % len(sentences))
Computing embeddings for 10455 sentences 0%| | 0/104 [00:00<?, ?it/s] simpleneighbors index for 10455 sentences built.
On retrieval, the question is encoded using the question_encoder and the question embedding is used to query the simpleneighbors index.
Retrieve nearest neighbors for a random question from SQuAD
num_results = 25
query = random.choice(questions)
display_nearest_neighbors(query[0], query[1])