多言語ユニバーサルセンテンスエンコーダーの Q&A の取得

TensorFlow.org で表示

Google Colab で実行

GitHub でソースを表示

ノートブックをダウンロード

TF Hub モデルを参照

これは、テキストの質問と回答の取得に使用する多言語ユニバーサルエンコーダー Q&A モデルを使用して、モデルの question_encoder と response_encoder の使用方法を説明するデモです。デモ用データセットとして SQuAD 段落の文を使用します。各文とその文脈（その文の前後にあるテキスト）は、response_encoder を使って高次元埋め込みにエンコードされています。この埋め込みは、質問と回答の取得に使用できるように simpleneighbors ライブラリを使用して構築されたインデックスに格納されています。

取得時、質問は SQuAD データセットからランダムに選択され、question_encoder を使って高次元埋め込みにエンコードされます。simpleneighbors インデックスをクエリすると、セマンティック空間の最近傍のリストが返されます。

その他のモデル

現在ホストされているテキスト埋め込みモデルはこちらを、SQuAD でもトレーニングされたすべてのモデルはこちらをご覧ください。

セットアップ

Setup Environment

%%capture
# Install the latest Tensorflow version.
!pip install -q "tensorflow-text==2.11.*"
!pip install -q simpleneighbors[annoy]
!pip install -q nltk
!pip install -q tqdm

Setup common imports and functions

import json
import nltk
import os
import pprint
import random
import simpleneighbors
import urllib
from IPython.display import HTML, display
from tqdm.notebook import tqdm

import tensorflow.compat.v2 as tf
import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer

nltk.download('punkt')


def download_squad(url):
  return json.load(urllib.request.urlopen(url))

def extract_sentences_from_squad_json(squad):
  all_sentences = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      sentences = nltk.tokenize.sent_tokenize(paragraph['context'])
      all_sentences.extend(zip(sentences, [paragraph['context']] * len(sentences)))
  return list(set(all_sentences)) # remove duplicates

def extract_questions_from_squad_json(squad):
  questions = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      for qas in paragraph['qas']:
        if qas['answers']:
          questions.append((qas['question'], qas['answers'][0]['text']))
  return list(set(questions))

def output_with_highlight(text, highlight):
  output = " "
  i = text.find(highlight)
  while True:
    if i == -1:
      output += text
      break
    output += text[0:i]
    output += ''+text[i:i+len(highlight)]+''
    text = text[i+len(highlight):]
    i = text.find(highlight)
  return output + "\n"

def display_nearest_neighbors(query_text, answer_text=None):
  query_embedding = model.signatures['question_encoder'](tf.constant([query_text]))['outputs'][0]
  search_results = index.nearest(query_embedding, n=num_results)

  if answer_text:
    result_md = '''
    Random Question from SQuAD:
      %s
    Answer:
      %s
    ''' % (query_text , answer_text)
  else:
    result_md = '''
    Question:
      %s
    ''' % query_text

  result_md += '''
    Retrieved sentences :

  '''

  if answer_text:
    for s in search_results:
      result_md += output_with_highlight(s, answer_text)
  else:
    for s in search_results:
      result_md += '' + s + '\n'

  result_md += ""
  display(HTML(result_md))

2024-01-11 19:25:22.190880: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-01-11 19:25:22.859136: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2024-01-11 19:25:22.859257: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2024-01-11 19:25:22.859268: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[nltk_data] Downloading package punkt to /home/kbuilder/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

次のコードブロックを実行し、SQuAD データセットをダウンロードして次のように抽出します。

sentences: (text, context) のタプル式のリストです。SQuAD データセットの各段落は nltk ライブラリを使って文章ごとに分割され、その文章と段落のテキストによって (text, context) タプル式を形成します。
questions: (question, answer) タプル式のリストです。

注意: 以下の squad_url を選択すると、このデモを使用して、SQuAD の train データセットまたはより小規模な dev データセット（1.1 または 2.0）のインデックスを作成できます。

Download and extract SQuAD data

squad_url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json'

squad_json = download_squad(squad_url)
sentences = extract_sentences_from_squad_json(squad_json)
questions = extract_questions_from_squad_json(squad_json)
print("%s sentences, %s questions extracted from SQuAD %s" % (len(sentences), len(questions), squad_url))

print("\nExample sentence and context:\n")
sentence = random.choice(sentences)
print("sentence:\n")
pprint.pprint(sentence[0])
print("\ncontext:\n")
pprint.pprint(sentence[1])
print()

10455 sentences, 10552 questions extracted from SQuAD https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

Example sentence and context:

sentence:

('The internationalist tendencies of the early revolution would be abandoned '
 'until they returned in the framework of a client state in competition with '
 'the Americans during the Cold War.')

context:

('Trotsky, and others, believed that the revolution could only succeed in '
 'Russia as part of a world revolution. Lenin wrote extensively on the matter '
 'and famously declared that Imperialism was the highest stage of capitalism. '
 "However, after Lenin's death, Joseph Stalin established 'socialism in one "
 "country' for the Soviet Union, creating the model for subsequent inward "
 'looking Stalinist states and purging the early Internationalist elements. '
 'The internationalist tendencies of the early revolution would be abandoned '
 'until they returned in the framework of a client state in competition with '
 'the Americans during the Cold War. With the beginning of the new era, the '
 'after Stalin period called the "thaw", in the late 1950s, the new political '
 'leader Nikita Khrushchev put even more pressure on the Soviet-American '
 'relations starting a new wave of anti-imperialist propaganda. In his speech '
 'on the UN conference in 1960, he announced the continuation of the war on '
 'imperialism, stating that soon the people of different countries will come '
 'together and overthrow their imperialist leaders. Although the Soviet Union '
 'declared itself anti-imperialist, critics argue that it exhibited tendencies '
 'common to historic empires. Some scholars hold that the Soviet Union was a '
 'hybrid entity containing elements common to both multinational empires and '
 'nation states. It has also been argued that the USSR practiced colonialism '
 'as did other imperial powers and was carrying on the old Russian tradition '
 'of expansion and control. Mao Zedong once argued that the Soviet Union had '
 'itself become an imperialist power while maintaining a socialist façade. '
 'Moreover, the ideas of imperialism were widely spread in action on the '
 'higher levels of government. Non Russian Marxists within the Russian '
 'Federation and later the USSR, like Sultan Galiev and Vasyl Shakhrai, '
 'considered the Soviet Regime a renewed version of the Russian imperialism '
 'and colonialism.')

次のコードブロックは、多言語ユニバーサルエンコーダ Q&A モデルの question_encoder と response_encoder シグネチャを使用して、TensorFlow グラフ g と session をセットアップします。

Load model from tensorflow hub

module_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3"
model = hub.load(module_url)

2024-01-11 19:25:29.628545: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-01-11 19:25:29.628652: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2024-01-11 19:25:29.628720: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2024-01-11 19:25:29.628792: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2024-01-11 19:25:29.686245: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2024-01-11 19:25:29.686454: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

次のコードブロックは、すべての (text, context) タプル式の埋め込みを計算し、response_encoder を使って simpleneighbors インデックスに格納します。

Compute embeddings and build simpleneighbors index

batch_size = 100

encodings = model.signatures['response_encoder'](
  input=tf.constant([sentences[0][0]]),
  context=tf.constant([sentences[0][1]]))
index = simpleneighbors.SimpleNeighbors(
    len(encodings['outputs'][0]), metric='angular')

print('Computing embeddings for %s sentences' % len(sentences))
slices = zip(*(iter(sentences),) * batch_size)
num_batches = int(len(sentences) / batch_size)
for s in tqdm(slices, total=num_batches):
  response_batch = list([r for r, c in s])
  context_batch = list([c for r, c in s])
  encodings = model.signatures['response_encoder'](
    input=tf.constant(response_batch),
    context=tf.constant(context_batch)
  )
  for batch_index, batch in enumerate(response_batch):
    index.add_one(batch, encodings['outputs'][batch_index])

index.build()
print('simpleneighbors index for %s sentences built.' % len(sentences))

Computing embeddings for 10455 sentences
0%|          | 0/104 [00:00<?, ?it/s]
simpleneighbors index for 10455 sentences built.

取得時、質問は question_encoder でエンコードされ、質問の埋め込みを使って simpleneighbors インデックスがクエリされます。

Retrieve nearest neighbors for a random question from SQuAD

num_results = 25

query = random.choice(questions)
display_nearest_neighbors(query[0], query[1])

多言語ユニバーサルセンテンスエンコーダーの Q&A の取得 コレクションでコンテンツを整理 必要に応じて、コンテンツの保存と分類を行います。