Lihat di TensorFlow.org | Jalankan di Google Colab | Lihat di GitHub | Unduh buku catatan | Lihat model TF Hub |
Ini adalah demo untuk menggunakan Universal Encoder multibahasa Q & A Model untuk pengambilan tanya jawab teks, menggambarkan penggunaan question_encoder dan response_encoder model. Kami menggunakan kalimat dari skuad paragraf sebagai dataset demo, setiap kalimat dan konteksnya (teks sekitarnya kalimat) dikodekan ke embeddings dimensi tinggi dengan response_encoder tersebut. Embeddings ini disimpan dalam indeks dibangun menggunakan simpleneighbors perpustakaan untuk pengambilan tanya jawab.
Pada pengambilan pertanyaan acak dipilih dari skuad dataset dan dikodekan ke dimensi tinggi embedding dengan question_encoder dan permintaan indeks simpleneighbors kembali daftar tetangga terdekat perkiraan dalam ruang semantik.
Lebih banyak model
Anda dapat menemukan semua teks saat host embedding model di sini dan semua model yang telah dilatih di skuad juga di sini .
Mempersiapkan
Lingkungan Pengaturan
%%capture
# Install the latest Tensorflow version.
!pip install -q tensorflow_text
!pip install -q simpleneighbors[annoy]
!pip install -q nltk
!pip install -q tqdm
Siapkan impor dan fungsi umum
import json
import nltk
import os
import pprint
import random
import simpleneighbors
import urllib
from IPython.display import HTML, display
from tqdm.notebook import tqdm
import tensorflow.compat.v2 as tf
import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer
nltk.download('punkt')
def download_squad(url):
return json.load(urllib.request.urlopen(url))
def extract_sentences_from_squad_json(squad):
all_sentences = []
for data in squad['data']:
for paragraph in data['paragraphs']:
sentences = nltk.tokenize.sent_tokenize(paragraph['context'])
all_sentences.extend(zip(sentences, [paragraph['context']] * len(sentences)))
return list(set(all_sentences)) # remove duplicates
def extract_questions_from_squad_json(squad):
questions = []
for data in squad['data']:
for paragraph in data['paragraphs']:
for qas in paragraph['qas']:
if qas['answers']:
questions.append((qas['question'], qas['answers'][0]['text']))
return list(set(questions))
def output_with_highlight(text, highlight):
output = "<li> "
i = text.find(highlight)
while True:
if i == -1:
output += text
break
output += text[0:i]
output += '<b>'+text[i:i+len(highlight)]+'</b>'
text = text[i+len(highlight):]
i = text.find(highlight)
return output + "</li>\n"
def display_nearest_neighbors(query_text, answer_text=None):
query_embedding = model.signatures['question_encoder'](tf.constant([query_text]))['outputs'][0]
search_results = index.nearest(query_embedding, n=num_results)
if answer_text:
result_md = '''
<p>Random Question from SQuAD:</p>
<p> <b>%s</b></p>
<p>Answer:</p>
<p> <b>%s</b></p>
''' % (query_text , answer_text)
else:
result_md = '''
<p>Question:</p>
<p> <b>%s</b></p>
''' % query_text
result_md += '''
<p>Retrieved sentences :
<ol>
'''
if answer_text:
for s in search_results:
result_md += output_with_highlight(s, answer_text)
else:
for s in search_results:
result_md += '<li>' + s + '</li>\n'
result_md += "</ol>"
display(HTML(result_md))
[nltk_data] Downloading package punkt to /home/kbuilder/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip.
Jalankan blok kode berikut untuk mengunduh dan mengekstrak dataset SQuAD ke:
- kalimat adalah daftar (teks, konteks) tupel - setiap paragraf dari skuad dataset yang terbagi menjadi kalimat menggunakan perpustakaan NLTK dan kalimat dan ayat bentuk teks (text, konteks) tupel.
- pertanyaan adalah daftar (pertanyaan, jawaban) tupel.
Unduh dan ekstrak data SQuAD
squad_url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json'
squad_json = download_squad(squad_url)
sentences = extract_sentences_from_squad_json(squad_json)
questions = extract_questions_from_squad_json(squad_json)
print("%s sentences, %s questions extracted from SQuAD %s" % (len(sentences), len(questions), squad_url))
print("\nExample sentence and context:\n")
sentence = random.choice(sentences)
print("sentence:\n")
pprint.pprint(sentence[0])
print("\ncontext:\n")
pprint.pprint(sentence[1])
print()
10455 sentences, 10552 questions extracted from SQuAD https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json Example sentence and context: sentence: ('The Mongol Emperors had built large palaces and pavilions, but some still ' 'continued to live as nomads at times.') context: ("Since its invention in 1269, the 'Phags-pa script, a unified script for " 'spelling Mongolian, Tibetan, and Chinese languages, was preserved in the ' 'court until the end of the dynasty. Most of the Emperors could not master ' 'written Chinese, but they could generally converse well in the language. The ' 'Mongol custom of long standing quda/marriage alliance with Mongol clans, the ' 'Onggirat, and the Ikeres, kept the imperial blood purely Mongol until the ' 'reign of Tugh Temur, whose mother was a Tangut concubine. The Mongol ' 'Emperors had built large palaces and pavilions, but some still continued to ' 'live as nomads at times. Nevertheless, a few other Yuan emperors actively ' 'sponsored cultural activities; an example is Tugh Temur (Emperor Wenzong), ' 'who wrote poetry, painted, read Chinese classical texts, and ordered the ' 'compilation of books.')
Berikut kode blok setup tensorflow grafik g dan sesi dengan Universal Encoder multibahasa Q & A Model 's question_encoder dan tanda tangan response_encoder.
Muat model dari hub tensorflow
module_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3"
model = hub.load(module_url)
Blok kode berikut menghitung embeddings untuk semua teks, tupel konteks dan menyimpannya dalam simpleneighbors indeks menggunakan response_encoder tersebut.
Hitung embeddings dan buat indeks tetangga sederhana
batch_size = 100
encodings = model.signatures['response_encoder'](
input=tf.constant([sentences[0][0]]),
context=tf.constant([sentences[0][1]]))
index = simpleneighbors.SimpleNeighbors(
len(encodings['outputs'][0]), metric='angular')
print('Computing embeddings for %s sentences' % len(sentences))
slices = zip(*(iter(sentences),) * batch_size)
num_batches = int(len(sentences) / batch_size)
for s in tqdm(slices, total=num_batches):
response_batch = list([r for r, c in s])
context_batch = list([c for r, c in s])
encodings = model.signatures['response_encoder'](
input=tf.constant(response_batch),
context=tf.constant(context_batch)
)
for batch_index, batch in enumerate(response_batch):
index.add_one(batch, encodings['outputs'][batch_index])
index.build()
print('simpleneighbors index for %s sentences built.' % len(sentences))
Computing embeddings for 10455 sentences 0%| | 0/104 [00:00<?, ?it/s] simpleneighbors index for 10455 sentences built.
Pada pengambilan, pertanyaannya adalah dikodekan menggunakan question_encoder dan embedding pertanyaan digunakan untuk query indeks simpleneighbors.
Ambil tetangga terdekat untuk pertanyaan acak dari SQuAD
num_results = 25
query = random.choice(questions)
display_nearest_neighbors(query[0], query[1])