Aprenda o que há de mais recente em aprendizado de máquina, IA generativa e muito mais no WiML Symposium 2023 Registre-se

Recuperação de perguntas e respostas do codificador de frases universais multilíngue

Ver no TensorFlow.org

Executar no Google Colab

Ver no GitHub

Baixar caderno

Veja os modelos TF Hub

Esta é uma demonstração para a utilização Universal Encoder Multilingual Q & A modelo para recuperação de pergunta-resposta de texto, ilustrando o uso de question_encoder e response_encoder do modelo. Nós usamos frases do esquadrão parágrafos como o conjunto de dados demo, cada frase e seu contexto (o texto em torno da frase) é codificado em embeddings altos dimensão com o response_encoder. Estes mergulhos são armazenados em um índice construído usando o simpleneighbors biblioteca para recuperação de pergunta-resposta.

Em recuperação de uma pergunta aleatória é selecionado a partir do esquadrão conjunto de dados e codificado em dimensão elevada incorporação com o question_encoder e consulta o índice simpleneighbors retornar uma lista de vizinhos mais próximos aproximados no espaço semântico.

Mais modelos

Você pode encontrar todo o texto atualmente hospedado incorporar modelos aqui e todos os modelos que foram treinados no esquadrão bem aqui .

Configurar

Ambiente de Configuração

%%capture
# Install the latest Tensorflow version.
!pip install -q tensorflow_text
!pip install -q simpleneighbors[annoy]
!pip install -q nltk
!pip install -q tqdm

Configurar importações e funções comuns

import json
import nltk
import os
import pprint
import random
import simpleneighbors
import urllib
from IPython.display import HTML, display
from tqdm.notebook import tqdm

import tensorflow.compat.v2 as tf
import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer

nltk.download('punkt')


def download_squad(url):
  return json.load(urllib.request.urlopen(url))

def extract_sentences_from_squad_json(squad):
  all_sentences = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      sentences = nltk.tokenize.sent_tokenize(paragraph['context'])
      all_sentences.extend(zip(sentences, [paragraph['context']] * len(sentences)))
  return list(set(all_sentences)) # remove duplicates

def extract_questions_from_squad_json(squad):
  questions = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      for qas in paragraph['qas']:
        if qas['answers']:
          questions.append((qas['question'], qas['answers'][0]['text']))
  return list(set(questions))

def output_with_highlight(text, highlight):
  output = "<li> "
  i = text.find(highlight)
  while True:
    if i == -1:
      output += text
      break
    output += text[0:i]
    output += '<b>'+text[i:i+len(highlight)]+'</b>'
    text = text[i+len(highlight):]
    i = text.find(highlight)
  return output + "</li>\n"

def display_nearest_neighbors(query_text, answer_text=None):
  query_embedding = model.signatures['question_encoder'](tf.constant([query_text]))['outputs'][0]
  search_results = index.nearest(query_embedding, n=num_results)

  if answer_text:
    result_md = '''
    <p>Random Question from SQuAD:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    <p>Answer:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    ''' % (query_text , answer_text)
  else:
    result_md = '''
    <p>Question:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    ''' % query_text

  result_md += '''
    <p>Retrieved sentences :
    <ol>
  '''

  if answer_text:
    for s in search_results:
      result_md += output_with_highlight(s, answer_text)
  else:
    for s in search_results:
      result_md += '<li>' + s + '</li>\n'

  result_md += "</ol>"
  display(HTML(result_md))

[nltk_data] Downloading package punkt to /home/kbuilder/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

Execute o seguinte bloco de código para baixar e extrair o conjunto de dados SQuAD para:

frases é uma lista de (texto, contexto) tuplas - cada parágrafo do esquadrão conjunto de dados está dividida em frases usando biblioteca nltk ea sentença e parágrafo formas de texto (texto, contexto) tupla.
perguntas é uma lista de (pergunta, resposta) tuplas.

Baixe e extraia dados SQuAD

squad_url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json'

squad_json = download_squad(squad_url)
sentences = extract_sentences_from_squad_json(squad_json)
questions = extract_questions_from_squad_json(squad_json)
print("%s sentences, %s questions extracted from SQuAD %s" % (len(sentences), len(questions), squad_url))

print("\nExample sentence and context:\n")
sentence = random.choice(sentences)
print("sentence:\n")
pprint.pprint(sentence[0])
print("\ncontext:\n")
pprint.pprint(sentence[1])
print()

10455 sentences, 10552 questions extracted from SQuAD https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

Example sentence and context:

sentence:

('The Mongol Emperors had built large palaces and pavilions, but some still '
 'continued to live as nomads at times.')

context:

("Since its invention in 1269, the 'Phags-pa script, a unified script for "
 'spelling Mongolian, Tibetan, and Chinese languages, was preserved in the '
 'court until the end of the dynasty. Most of the Emperors could not master '
 'written Chinese, but they could generally converse well in the language. The '
 'Mongol custom of long standing quda/marriage alliance with Mongol clans, the '
 'Onggirat, and the Ikeres, kept the imperial blood purely Mongol until the '
 'reign of Tugh Temur, whose mother was a Tangut concubine. The Mongol '
 'Emperors had built large palaces and pavilions, but some still continued to '
 'live as nomads at times. Nevertheless, a few other Yuan emperors actively '
 'sponsored cultural activities; an example is Tugh Temur (Emperor Wenzong), '
 'who wrote poetry, painted, read Chinese classical texts, and ordered the '
 'compilation of books.')

A configuração seguinte bloco de código do tensorflow gráfico g e sessão com o Universal Encoder Multilingual Q & A modelo question_encoder 's e assinaturas response_encoder.

Carregar modelo do hub tensorflow

module_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3"
model = hub.load(module_url)

O bloco de código a seguir calcular os mergulhos para todos o texto, tuplas de contexto e armazená-los em um simpleneighbors índice usando a response_encoder.

Compute embeddings e crie índice de vizinhos simples

batch_size = 100

encodings = model.signatures['response_encoder'](
  input=tf.constant([sentences[0][0]]),
  context=tf.constant([sentences[0][1]]))
index = simpleneighbors.SimpleNeighbors(
    len(encodings['outputs'][0]), metric='angular')

print('Computing embeddings for %s sentences' % len(sentences))
slices = zip(*(iter(sentences),) * batch_size)
num_batches = int(len(sentences) / batch_size)
for s in tqdm(slices, total=num_batches):
  response_batch = list([r for r, c in s])
  context_batch = list([c for r, c in s])
  encodings = model.signatures['response_encoder'](
    input=tf.constant(response_batch),
    context=tf.constant(context_batch)
  )
  for batch_index, batch in enumerate(response_batch):
    index.add_one(batch, encodings['outputs'][batch_index])

index.build()
print('simpleneighbors index for %s sentences built.' % len(sentences))

Computing embeddings for 10455 sentences
0%|          | 0/104 [00:00<?, ?it/s]
simpleneighbors index for 10455 sentences built.

Na recuperação, a questão é codificado utilizando o question_encoder e a incorporação de questão é usado para consultar o índice simpleneighbors.

Recupere os vizinhos mais próximos para uma pergunta aleatória do SQuAD

num_results = 25

query = random.choice(questions)
display_nearest_neighbors(query[0], query[1])