Multilingual Universal Sentence Encoder Q&A 检索

在 TensorFlow.org 上查看 在 Google Colab 中运行 在 GitHub 上查看源代码 下载笔记本 查看 TF Hub 模型

这是使用 Univeral Encoder Multilingual Q&A 模型进行文本问答检索的演示,其中对模型的 question_encoderresponse_encoder 的用法进行了说明。我们使用来自 SQuAD 段落的句子作为演示数据集,每个句子及其上下文(句子周围的文本)都使用 response_encoder 编码为高维嵌入向量。这些嵌入向量存储在使用 simpleneighbors 库构建的索引中,用于问答检索。

检索时,从 SQuAD 数据集中随机选择一个问题,并使用 question_encoder 将其编码为高维嵌入向量,然后查询 simpleneighbors 索引会返回语义空间中最近邻的列表。

更多模型

您可以在此处找到所有当前托管的文本嵌入向量模型,还可以在此处找到所有在 SQuADYou 上训练过的模型。

安装

Setup Environment

Setup common imports and functions

[nltk_data] Downloading package punkt to /home/kbuilder/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

运行以下代码块,下载并将 SQuAD 数据集提取到:

  • 句子是(文本, 上下文)元组的列表,SQuAD 数据集中的每个段落都用 NLTK 库拆分成句子,并且句子和段落文本构成(文本, 上下文)元组。
  • 问题是(问题, 答案)元组的列表。

注:您可以选择下面的 squad_url,使用本演示为 SQuAD 训练数据集或较小的 dev 数据集(1.1 或 2.0)建立索引。

Download and extract SQuAD data

10455 sentences, 10552 questions extracted from SQuAD https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

Example sentence and context:

sentence:

('Students at the University of Chicago run over 400 clubs and organizations '
 'known as Recognized Student Organizations (RSOs).')

context:

('Students at the University of Chicago run over 400 clubs and organizations '
 'known as Recognized Student Organizations (RSOs). These include cultural and '
 'religious groups, academic clubs and teams, and common-interest '
 'organizations. Notable extracurricular groups include the University of '
 'Chicago College Bowl Team, which has won 118 tournaments and 15 national '
 "championships, leading both categories internationally. The university's "
 'competitive Model United Nations team was the top ranked team in North '
 "America in 2013-14 and 2014-2015. Among notable RSOs are the nation's "
 'longest continuously running student film society Doc Films, organizing '
 'committee for the University of Chicago Scavenger Hunt, the twice-weekly '
 'student newspaper The Chicago Maroon, the alternative weekly student '
 "newspaper South Side Weekly, the nation's second oldest continuously running "
 'student improvisational theater troupe Off-Off Campus, and the '
 'university-owned radio station WHPK.')

以下代码块使用 Universal Encoder Multilingual Q&A 模型question_encoderresponse_encoder 签名对 TensorFlow 计算图 g会话进行设置。

Load model from tensorflow hub

以下代码块计算所有文本的嵌入向量和上下文元组,并使用 response_encoder 将它们存储在 simpleneighbors 索引中。

Compute embeddings and build simpleneighbors index

Computing embeddings for 10455 sentences
0%|          | 0/104 [00:00<?, ?it/s]
simpleneighbors index for 10455 sentences built.

检索时,使用 question_encoder 对问题进行编码,而问题嵌入向量用于查询 simpleneighbors 索引。

Retrieve nearest neighbors for a random question from SQuAD