|View on TensorFlow.org||Run in Google Colab||View on GitHub||Download notebook||See TF Hub models|
This is a demo for using Universal Encoder Multilingual Q&A model for question-answer retrieval of text, illustrating the use of question_encoder and response_encoder of the model. We use sentences from SQuAD paragraphs as the demo dataset, each sentence and its context (the text surrounding the sentence) is encoded into high dimension embeddings with the response_encoder. These embeddings are stored in an index built using the simpleneighbors library for question-answer retrieval.
On retrieval a random question is selected from the SQuAD dataset and encoded into high dimension embedding with the question_encoder and query the simpleneighbors index returning a list of approximate nearest neighbors in semantic space.
%%capture # Install the latest Tensorflow version. !pip install -q "tensorflow-text==2.8.*" !pip install -q simpleneighbors[annoy] !pip install -q nltk !pip install -q tqdm
Setup common imports and functions
import json import nltk import os import pprint import random import simpleneighbors import urllib from IPython.display import HTML, display from tqdm.notebook import tqdm import tensorflow.compat.v2 as tf import tensorflow_hub as hub from tensorflow_text import SentencepieceTokenizer nltk.download('punkt') def download_squad(url): return json.load(urllib.request.urlopen(url)) def extract_sentences_from_squad_json(squad): all_sentences =  for data in squad['data']: for paragraph in data['paragraphs']: sentences = nltk.tokenize.sent_tokenize(paragraph['context']) all_sentences.extend(zip(sentences, [paragraph['context']] * len(sentences))) return list(set(all_sentences)) # remove duplicates def extract_questions_from_squad_json(squad): questions =  for data in squad['data']: for paragraph in data['paragraphs']: for qas in paragraph['qas']: if qas['answers']: questions.append((qas['question'], qas['answers']['text'])) return list(set(questions)) def output_with_highlight(text, highlight): output = "<li> " i = text.find(highlight) while True: if i == -1: output += text break output += text[0:i] output += '<b>'+text[i:i+len(highlight)]+'</b>' text = text[i+len(highlight):] i = text.find(highlight) return output + "</li>\n" def display_nearest_neighbors(query_text, answer_text=None): query_embedding = model.signatures['question_encoder'](tf.constant([query_text]))['outputs'] search_results = index.nearest(query_embedding, n=num_results) if answer_text: result_md = ''' <p>Random Question from SQuAD:</p> <p> <b>%s</b></p> <p>Answer:</p> <p> <b>%s</b></p> ''' % (query_text , answer_text) else: result_md = ''' <p>Question:</p> <p> <b>%s</b></p> ''' % query_text result_md += ''' <p>Retrieved sentences : <ol> ''' if answer_text: for s in search_results: result_md += output_with_highlight(s, answer_text) else: for s in search_results: result_md += '<li>' + s + '</li>\n' result_md += "</ol>" display(HTML(result_md))
[nltk_data] Downloading package punkt to /home/kbuilder/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip.
Run the following code block to download and extract the SQuAD dataset into:
- sentences is a list of (text, context) tuples - each paragraph from the SQuAD dataset are splitted into sentences using nltk library and the sentence and paragraph text forms the (text, context) tuple.
- questions is a list of (question, answer) tuples.
Download and extract SQuAD data
squad_url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json' squad_json = download_squad(squad_url) sentences = extract_sentences_from_squad_json(squad_json) questions = extract_questions_from_squad_json(squad_json) print("%s sentences, %s questions extracted from SQuAD %s" % (len(sentences), len(questions), squad_url)) print("\nExample sentence and context:\n") sentence = random.choice(sentences) print("sentence:\n") pprint.pprint(sentence) print("\ncontext:\n") pprint.pprint(sentence) print()
10452 sentences, 10552 questions extracted from SQuAD https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json Example sentence and context: sentence: ('Deke Slayton, the grounded Mercury astronaut who became Director of Flight ' 'Crew Operations for the Gemini and Apollo programs, selected the first ' 'Apollo crew in January 1966, with Grissom as Command Pilot, White as Senior ' 'Pilot, and rookie Donn F. Eisele as Pilot.') context: ('Deke Slayton, the grounded Mercury astronaut who became Director of Flight ' 'Crew Operations for the Gemini and Apollo programs, selected the first ' 'Apollo crew in January 1966, with Grissom as Command Pilot, White as Senior ' 'Pilot, and rookie Donn F. Eisele as Pilot. But Eisele dislocated his ' 'shoulder twice aboard the KC135 weightlessness training aircraft, and had to ' 'undergo surgery on January 27. Slayton replaced him with Chaffee. NASA ' 'announced the final crew selection for AS-204 on March 21, 1966, with the ' 'backup crew consisting of Gemini veterans James McDivitt and David Scott, ' 'with rookie Russell L. "Rusty" Schweickart. Mercury/Gemini veteran Wally ' 'Schirra, Eisele, and rookie Walter Cunningham were announced on September 29 ' 'as the prime crew for AS-205.')
The following code block setup the tensorflow graph g and session with the Universal Encoder Multilingual Q&A model's question_encoder and response_encoder signatures.
Load model from tensorflow hub
module_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3" model = hub.load(module_url)
The following code block compute the embeddings for all the text, context tuples and store them in a simpleneighbors index using the response_encoder.
Compute embeddings and build simpleneighbors index
batch_size = 100 encodings = model.signatures['response_encoder']( input=tf.constant([sentences]), context=tf.constant([sentences])) index = simpleneighbors.SimpleNeighbors( len(encodings['outputs']), metric='angular') print('Computing embeddings for %s sentences' % len(sentences)) slices = zip(*(iter(sentences),) * batch_size) num_batches = int(len(sentences) / batch_size) for s in tqdm(slices, total=num_batches): response_batch = list([r for r, c in s]) context_batch = list([c for r, c in s]) encodings = model.signatures['response_encoder']( input=tf.constant(response_batch), context=tf.constant(context_batch) ) for batch_index, batch in enumerate(response_batch): index.add_one(batch, encodings['outputs'][batch_index]) index.build() print('simpleneighbors index for %s sentences built.' % len(sentences))
Computing embeddings for 10452 sentences 0%| | 0/104 [00:00<?, ?it/s] simpleneighbors index for 10452 sentences built.
On retrieval, the question is encoded using the question_encoder and the question embedding is used to query the simpleneighbors index.
Retrieve nearest neighbors for a random question from SQuAD
num_results = 25 query = random.choice(questions) display_nearest_neighbors(query, query)