Giải quyết các nhiệm vụ GLUE bằng cách sử dụng BERT trên TPU

Xem trên TensorFlow.org

Chạy trong Google Colab

Xem trên GitHub

Tải xuống sổ ghi chép

Xem mô hình TF Hub

BERT có thể được sử dụng để giải quyết nhiều vấn đề trong xử lý ngôn ngữ tự nhiên. Bạn sẽ học cách để tinh chỉnh Bert cho nhiều nhiệm vụ từ benchmark KEO :

Cola (Corpus của ngôn ngữ Sự chấp nhận): Là câu đúng ngữ pháp?
SST-2 (Stanford Niềm tin Treebank): Nhiệm vụ là để dự đoán tình cảm của một câu nhất định.
MRPC (Microsoft Research diễn giải Corpus): Xác định xem một cặp câu là ngữ nghĩa tương đương.
QQP (Quora Câu hỏi Pairs2): Xác định xem một cặp câu hỏi ngữ nghĩa tương đương.
MNLI (Multi-Thể loại Ngôn ngữ tự nhiên suy luận): Cho một câu tiền đề và một câu giả thuyết, nhiệm vụ là để dự đoán liệu tiền đề đòi hỏi giả thuyết (entailment), mâu thuẫn với giả thuyết (mâu thuẫn), hay không (trung tính).
QNLI (Câu hỏi-trả lời Ngôn ngữ tự nhiên suy luận): Nhiệm vụ là để xác định xem câu bối cảnh chứa câu trả lời cho câu hỏi này.
RTE (Nhận Văn bản Entailment): Xác định xem một câu đòi hỏi một giả thuyết được đưa ra hay không.
WNLI (Winograd Ngôn ngữ tự nhiên suy luận): Nhiệm vụ là để dự đoán nếu câu với đại từ thay thế được đòi hỏi bởi câu gốc.

Hướng dẫn này chứa mã end-to-end hoàn chỉnh để đào tạo các mô hình này trên TPU. Bạn cũng có thể chạy máy tính xách tay này trên GPU, bằng cách thay đổi một dòng (mô tả bên dưới).

Trong sổ tay này, bạn sẽ:

Tải mô hình BERT từ TensorFlow Hub
Chọn một trong các tác vụ GLUE và tải xuống tập dữ liệu
Xử lý trước văn bản
Tinh chỉnh BERT (các ví dụ được đưa ra cho tập dữ liệu một câu và nhiều câu)
Lưu mô hình được đào tạo và sử dụng nó

Thành lập

Bạn sẽ sử dụng một mô hình riêng biệt để xử lý trước văn bản trước khi sử dụng nó để tinh chỉnh BERT. Mô hình này phụ thuộc vào tensorflow / văn bản , mà bạn sẽ cài đặt bên dưới.

pip install -q -U tensorflow-text

Bạn sẽ sử dụng tối ưu hóa AdamW từ tensorflow / mô hình để Bert tinh chỉnh, mà bạn sẽ cài đặt là tốt.

pip install -q -U tf-models-official

pip install -U tfds-nightly

import os
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
import tensorflow_text as text  # A dependency of the preprocessing model
import tensorflow_addons as tfa
from official.nlp import optimization
import numpy as np

tf.get_logger().setLevel('ERROR')

/tmpfs/src/tf_docs_env/lib/python3.6/site-packages/requests/__init__.py:104: RequestsDependencyWarning: urllib3 (1.26.7) or chardet (2.3.0)/charset_normalizer (2.0.7) doesn't match a supported version!
  RequestsDependencyWarning)

Tiếp theo, định cấu hình TFHub để đọc các điểm kiểm tra trực tiếp từ nhóm Bộ nhớ đám mây của TFHub. Điều này chỉ được khuyến nghị khi chạy các mô hình TFHub trên TPU.

Nếu không có cài đặt này, TFHub sẽ tải xuống tệp nén và giải nén cục bộ trạm kiểm soát. Cố gắng tải từ các tệp cục bộ này sẽ không thành công với lỗi sau:

InvalidArgumentError: Unimplemented: File system scheme '[local]' not implemented

Điều này là do TPU chỉ có thể đọc trực tiếp từ xô Cloud Storage .

os.environ["TFHUB_MODEL_LOAD_FORMAT"]="UNCOMPRESSED"

Kết nối với nhân viên TPU

Đoạn mã sau kết nối với TPU worker và thay đổi thiết bị mặc định của TensorFlow thành thiết bị CPU trên TPU worker. Nó cũng xác định chiến lược phân phối TPU mà bạn sẽ sử dụng để phân phối đào tạo mô hình lên 8 lõi TPU riêng biệt có sẵn trên một nhân viên TPU này. Xem TensorFlow của hướng dẫn TPU để biết thêm thông tin.

import os

if os.environ['COLAB_TPU_ADDR']:
  cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
  tf.config.experimental_connect_to_cluster(cluster_resolver)
  tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
  strategy = tf.distribute.TPUStrategy(cluster_resolver)
  print('Using TPU')
elif tf.config.list_physical_devices('GPU'):
  strategy = tf.distribute.MirroredStrategy()
  print('Using GPU')
else:
  raise ValueError('Running on CPU is not recommended.')

Using TPU

Đang tải mô hình từ TensorFlow Hub

Tại đây, bạn có thể chọn mô hình BERT mà bạn sẽ tải từ TensorFlow Hub và tinh chỉnh. Có nhiều mô hình BERT có sẵn để lựa chọn.

Bert-Base , nhồi và hơn bảy mô hình với trọng lượng đào tạo phát hành bởi các tác giả Bert gốc.
BERTS nhỏ có kiến trúc tương tự nói chung nhưng ít hơn và / hoặc khối Transformer nhỏ hơn, cho phép bạn khám phá cân bằng giữa tốc độ, quy mô và chất lượng.
ALBERT : bốn kích cỡ khác nhau của "Một Lite Bert" làm giảm kích thước mô hình (nhưng không phải thời gian tính toán) bằng cách chia sẻ các thông số giữa các lớp.
Bert Các chuyên gia : Tám mô hình mà tất cả đều có kiến trúc Bert-base nhưng đưa ra một sự lựa chọn giữa các lĩnh vực pre-đào tạo khác nhau, để gắn kết chặt chẽ hơn với các nhiệm vụ mục tiêu.
Electra có kiến trúc giống như Bert (trong ba kích cỡ khác nhau), nhưng bị trước được đào tạo như một phân biệt trong một thiết lập tương tự như một đối nghịch Mạng Generative (GAN).
Bert với Talking Heads-Attention và Cổng Gelu [ cơ sở , lớn ] có hai cải tiến cho cốt lõi của kiến trúc Transformer.

Xem tài liệu mô hình được liên kết ở trên để biết thêm chi tiết.

Trong hướng dẫn này, bạn sẽ bắt đầu với BERT-base. Bạn có thể sử dụng các mô hình lớn hơn và gần đây hơn để có độ chính xác cao hơn hoặc các mô hình nhỏ hơn để có thời gian đào tạo nhanh hơn. Để thay đổi mô hình, bạn chỉ cần chuyển một dòng mã duy nhất (hiển thị bên dưới). Tất cả sự khác biệt được gói gọn trong SavedModel mà bạn sẽ tải xuống từ TensorFlow Hub.

Chọn một mô hình BERT để tinh chỉnh

bert_model_name = 'bert_en_uncased_L-12_H-768_A-12' 

map_name_to_handle = {
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3',
    'bert_en_uncased_L-24_H-1024_A-16':
        'https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/3',
    'bert_en_wwm_uncased_L-24_H-1024_A-16':
        'https://tfhub.dev/tensorflow/bert_en_wwm_uncased_L-24_H-1024_A-16/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/3',
    'bert_en_cased_L-24_H-1024_A-16':
        'https://tfhub.dev/tensorflow/bert_en_cased_L-24_H-1024_A-16/3',
    'bert_en_wwm_cased_L-24_H-1024_A-16':
        'https://tfhub.dev/tensorflow/bert_en_wwm_cased_L-24_H-1024_A-16/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-768_A-12/1',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_base/2',
    'albert_en_large':
        'https://tfhub.dev/tensorflow/albert_en_large/2',
    'albert_en_xlarge':
        'https://tfhub.dev/tensorflow/albert_en_xlarge/2',
    'albert_en_xxlarge':
        'https://tfhub.dev/tensorflow/albert_en_xxlarge/2',
    'electra_small':
        'https://tfhub.dev/google/electra_small/2',
    'electra_base':
        'https://tfhub.dev/google/electra_base/2',
    'experts_pubmed':
        'https://tfhub.dev/google/experts/bert/pubmed/2',
    'experts_wiki_books':
        'https://tfhub.dev/google/experts/bert/wiki_books/2',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/1',
    'talking-heads_large':
        'https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_large/1',
}

map_model_to_preprocess = {
    'bert_en_uncased_L-24_H-1024_A-16':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_en_wwm_cased_L-24_H-1024_A-16':
        'https://tfhub.dev/tensorflow/bert_en_cased_preprocess/3',
    'bert_en_cased_L-24_H-1024_A-16':
        'https://tfhub.dev/tensorflow/bert_en_cased_preprocess/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_preprocess/3',
    'bert_en_wwm_uncased_L-24_H-1024_A-16':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_preprocess/3',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_preprocess/3',
    'albert_en_large':
        'https://tfhub.dev/tensorflow/albert_en_preprocess/3',
    'albert_en_xlarge':
        'https://tfhub.dev/tensorflow/albert_en_preprocess/3',
    'albert_en_xxlarge':
        'https://tfhub.dev/tensorflow/albert_en_preprocess/3',
    'electra_small':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'electra_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_pubmed':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_wiki_books':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'talking-heads_large':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
}

tfhub_handle_encoder = map_name_to_handle[bert_model_name]
tfhub_handle_preprocess = map_model_to_preprocess[bert_model_name]

print('BERT model selected           :', tfhub_handle_encoder)
print('Preprocessing model auto-selected:', tfhub_handle_preprocess)

BERT model selected           : https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3
Preprocessing model auto-selected: https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3

Xử lý trước văn bản

Trên văn bản với Bert Phân loại colab mô hình tiền xử lý được sử dụng trực tiếp nhúng với bộ mã hóa Bert.

Hướng dẫn này trình bày cách thực hiện tiền xử lý như một phần của đường dẫn đầu vào để đào tạo, sử dụng Dataset.map, sau đó hợp nhất nó vào mô hình được xuất ra để suy luận. Bằng cách đó, cả đào tạo và suy luận đều có thể hoạt động từ đầu vào văn bản thô, mặc dù bản thân TPU yêu cầu đầu vào số.

Yêu cầu TPU sang một bên, nó có thể giúp hiệu suất có tiền xử lý thực hiện không đồng bộ trong một đường ống đầu vào (bạn có thể tìm hiểu thêm trong hướng dẫn thực hiện tf.data ).

Hướng dẫn này cũng trình bày cách xây dựng các mô hình đa đầu vào và cách điều chỉnh độ dài trình tự của các đầu vào thành BERT.

Hãy chứng minh mô hình tiền xử lý.

bert_preprocess = hub.load(tfhub_handle_preprocess)
tok = bert_preprocess.tokenize(tf.constant(['Hello TensorFlow!']))
print(tok)

<tf.RaggedTensor [[[7592], [23435, 12314], [999]]]>

Mỗi mô hình tiền xử lý cũng cung cấp một phương pháp, .bert_pack_inputs(tensors, seq_length) , trong đó có một danh sách các thẻ (như tok trên) và một đối số chiều dài chuỗi. Điều này đóng gói các đầu vào để tạo ra một từ điển các tenxơ ở định dạng mà mô hình BERT mong đợi.

text_preprocessed = bert_preprocess.bert_pack_inputs([tok, tok], tf.constant(20))

print('Shape Word Ids : ', text_preprocessed['input_word_ids'].shape)
print('Word Ids       : ', text_preprocessed['input_word_ids'][0, :16])
print('Shape Mask     : ', text_preprocessed['input_mask'].shape)
print('Input Mask     : ', text_preprocessed['input_mask'][0, :16])
print('Shape Type Ids : ', text_preprocessed['input_type_ids'].shape)
print('Type Ids       : ', text_preprocessed['input_type_ids'][0, :16])

Shape Word Ids :  (1, 20)
Word Ids       :  tf.Tensor(
[  101  7592 23435 12314   999   102  7592 23435 12314   999   102     0
     0     0     0     0], shape=(16,), dtype=int32)
Shape Mask     :  (1, 20)
Input Mask     :  tf.Tensor([1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0], shape=(16,), dtype=int32)
Shape Type Ids :  (1, 20)
Type Ids       :  tf.Tensor([0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0], shape=(16,), dtype=int32)

Dưới đây là một số chi tiết cần chú ý:

input_mask Mặt nạ cho phép các mô hình để phân biệt sạch giữa nội dung và padding. Mặt nạ có hình dạng giống như input_word_ids , và chứa 1 bất cứ nơi nào input_word_ids không đệm.
input_type_ids có hình dạng giống như input_mask , nhưng bên trong khu vực phi độn, chứa 0 hoặc 1 chỉ ra câu nào token là một phần của.

Tiếp theo, bạn sẽ tạo một mô hình tiền xử lý đóng gói tất cả logic này. Mô hình của bạn sẽ lấy chuỗi làm đầu vào và trả về các đối tượng được định dạng thích hợp có thể được chuyển tới BERT.

Mỗi mô hình BERT có một mô hình tiền xử lý cụ thể, hãy đảm bảo sử dụng mô hình thích hợp được mô tả trong tài liệu về mô hình của BERT.

def make_bert_preprocess_model(sentence_features, seq_length=128):
  """Returns Model mapping string features to BERT inputs.

  Args:
    sentence_features: a list with the names of string-valued features.
    seq_length: an integer that defines the sequence length of BERT inputs.

  Returns:
    A Keras Model that can be called on a list or dict of string Tensors
    (with the order or names, resp., given by sentence_features) and
    returns a dict of tensors for input to BERT.
  """

  input_segments = [
      tf.keras.layers.Input(shape=(), dtype=tf.string, name=ft)
      for ft in sentence_features]

  # Tokenize the text to word pieces.
  bert_preprocess = hub.load(tfhub_handle_preprocess)
  tokenizer = hub.KerasLayer(bert_preprocess.tokenize, name='tokenizer')
  segments = [tokenizer(s) for s in input_segments]

  # Optional: Trim segments in a smart way to fit seq_length.
  # Simple cases (like this example) can skip this step and let
  # the next step apply a default truncation to approximately equal lengths.
  truncated_segments = segments

  # Pack inputs. The details (start/end token ids, dict of output tensors)
  # are model-dependent, so this gets loaded from the SavedModel.
  packer = hub.KerasLayer(bert_preprocess.bert_pack_inputs,
                          arguments=dict(seq_length=seq_length),
                          name='packer')
  model_inputs = packer(truncated_segments)
  return tf.keras.Model(input_segments, model_inputs)

Hãy chứng minh mô hình tiền xử lý. Bạn sẽ tạo một bài kiểm tra với hai câu đầu vào (input1 và input2). Kết quả là những gì một người mẫu Bert mong chờ như là đầu vào: input_word_ids , input_masks và input_type_ids .

test_preprocess_model = make_bert_preprocess_model(['my_input1', 'my_input2'])
test_text = [np.array(['some random test sentence']),
             np.array(['another sentence'])]
text_preprocessed = test_preprocess_model(test_text)

print('Keys           : ', list(text_preprocessed.keys()))
print('Shape Word Ids : ', text_preprocessed['input_word_ids'].shape)
print('Word Ids       : ', text_preprocessed['input_word_ids'][0, :16])
print('Shape Mask     : ', text_preprocessed['input_mask'].shape)
print('Input Mask     : ', text_preprocessed['input_mask'][0, :16])
print('Shape Type Ids : ', text_preprocessed['input_type_ids'].shape)
print('Type Ids       : ', text_preprocessed['input_type_ids'][0, :16])

Keys           :  ['input_word_ids', 'input_mask', 'input_type_ids']
Shape Word Ids :  (1, 128)
Word Ids       :  tf.Tensor(
[ 101 2070 6721 3231 6251  102 2178 6251  102    0    0    0    0    0
    0    0], shape=(16,), dtype=int32)
Shape Mask     :  (1, 128)
Input Mask     :  tf.Tensor([1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0], shape=(16,), dtype=int32)
Shape Type Ids :  (1, 128)
Type Ids       :  tf.Tensor([0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0], shape=(16,), dtype=int32)

Hãy xem cấu trúc của mô hình, chú ý đến hai đầu vào mà bạn vừa xác định.

tf.keras.utils.plot_model(test_preprocess_model, show_shapes=True, show_dtype=True)

('You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) ', 'for plot_model/model_to_dot to work.')

Để áp dụng tiền xử lý trong tất cả các nguyên liệu đầu vào từ các tập dữ liệu, bạn sẽ sử dụng map chức năng từ tập dữ liệu. Kết quả khi đó được lưu lại để thực hiện .

AUTOTUNE = tf.data.AUTOTUNE


def load_dataset_from_tfds(in_memory_ds, info, split, batch_size,
                           bert_preprocess_model):
  is_training = split.startswith('train')
  dataset = tf.data.Dataset.from_tensor_slices(in_memory_ds[split])
  num_examples = info.splits[split].num_examples

  if is_training:
    dataset = dataset.shuffle(num_examples)
    dataset = dataset.repeat()
  dataset = dataset.batch(batch_size)
  dataset = dataset.map(lambda ex: (bert_preprocess_model(ex), ex['label']))
  dataset = dataset.cache().prefetch(buffer_size=AUTOTUNE)
  return dataset, num_examples

Xác định mô hình của bạn

Giờ đây, bạn đã sẵn sàng xác định mô hình của mình để phân loại câu hoặc cặp câu bằng cách cung cấp các đầu vào được xử lý trước thông qua bộ mã hóa BERT và đặt bộ phân loại tuyến tính lên trên (hoặc sắp xếp các lớp khác tùy thích) và sử dụng tính năng bỏ qua để chính quy hóa.

def build_classifier_model(num_classes):

  class Classifier(tf.keras.Model):
    def __init__(self, num_classes):
      super(Classifier, self).__init__(name="prediction")
      self.encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True)
      self.dropout = tf.keras.layers.Dropout(0.1)
      self.dense = tf.keras.layers.Dense(num_classes)

    def call(self, preprocessed_text):
      encoder_outputs = self.encoder(preprocessed_text)
      pooled_output = encoder_outputs["pooled_output"]
      x = self.dropout(pooled_output)
      x = self.dense(x)
      return x

  model = Classifier(num_classes)
  return model

Hãy thử chạy mô hình trên một số đầu vào được xử lý trước.

test_classifier_model = build_classifier_model(2)
bert_raw_result = test_classifier_model(text_preprocessed)
print(tf.sigmoid(bert_raw_result))

tf.Tensor([[0.29329836 0.44367802]], shape=(1, 2), dtype=float32)

Chọn một nhiệm vụ từ GLUE

Bạn sẽ sử dụng một TensorFlow DataSet từ KEO bộ tiêu chuẩn.

Colab cho phép bạn tải các tập dữ liệu nhỏ này xuống hệ thống tệp cục bộ và đoạn mã bên dưới đọc chúng hoàn toàn vào bộ nhớ, vì máy chủ lưu trữ công nhân TPU riêng biệt không thể truy cập hệ thống tệp cục bộ của thời gian chạy colab.

Đối với các tập dữ liệu lớn hơn, bạn sẽ cần phải tạo riêng của bạn Google Cloud Storage xô và có người lao động TPU đọc dữ liệu từ đó. Bạn có thể tìm hiểu thêm trong hướng dẫn TPU .

Bạn nên bắt đầu với tập dữ liệu CoLa (cho câu đơn) hoặc MRPC (cho nhiều câu) vì chúng nhỏ và không mất nhiều thời gian để tinh chỉnh.

tfds_name = 'glue/cola' 

tfds_info = tfds.builder(tfds_name).info

sentence_features = list(tfds_info.features.keys())
sentence_features.remove('idx')
sentence_features.remove('label')

available_splits = list(tfds_info.splits.keys())
train_split = 'train'
validation_split = 'validation'
test_split = 'test'
if tfds_name == 'glue/mnli':
  validation_split = 'validation_matched'
  test_split = 'test_matched'

num_classes = tfds_info.features['label'].num_classes
num_examples = tfds_info.splits.total_num_examples

print(f'Using {tfds_name} from TFDS')
print(f'This dataset has {num_examples} examples')
print(f'Number of classes: {num_classes}')
print(f'Features {sentence_features}')
print(f'Splits {available_splits}')

with tf.device('/job:localhost'):
  # batch_size=-1 is a way to load the dataset into memory
  in_memory_ds = tfds.load(tfds_name, batch_size=-1, shuffle_files=True)

# The code below is just to show some samples from the selected dataset
print(f'Here are some sample rows from {tfds_name} dataset')
sample_dataset = tf.data.Dataset.from_tensor_slices(in_memory_ds[train_split])

labels_names = tfds_info.features['label'].names
print(labels_names)
print()

sample_i = 1
for sample_row in sample_dataset.take(5):
  samples = [sample_row[feature] for feature in sentence_features]
  print(f'sample row {sample_i}')
  for sample in samples:
    print(sample.numpy())
  sample_label = sample_row['label']

  print(f'label: {sample_label} ({labels_names[sample_label]})')
  print()
  sample_i += 1

Using glue/cola from TFDS
This dataset has 10657 examples
Number of classes: 2
Features ['sentence']
Splits ['train', 'validation', 'test']
Here are some sample rows from glue/cola dataset
['unacceptable', 'acceptable']

sample row 1
b'It is this hat that it is certain that he was wearing.'
label: 1 (acceptable)

sample row 2
b'Her efficient looking up of the answer pleased the boss.'
label: 1 (acceptable)

sample row 3
b'Both the workers will wear carnations.'
label: 1 (acceptable)

sample row 4
b'John enjoyed drawing trees for his syntax homework.'
label: 1 (acceptable)

sample row 5
b'We consider Leslie rather foolish, and Lou a complete idiot.'
label: 1 (acceptable)

Bộ dữ liệu cũng xác định loại vấn đề (phân loại hoặc hồi quy) và hàm tổn thất thích hợp cho việc huấn luyện.

def get_configuration(glue_task):

  loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

  if glue_task == 'glue/cola':
    metrics = tfa.metrics.MatthewsCorrelationCoefficient(num_classes=2)
  else:
    metrics = tf.keras.metrics.SparseCategoricalAccuracy(
        'accuracy', dtype=tf.float32)

  return metrics, loss

Đào tạo mô hình của bạn

Cuối cùng, bạn có thể đào tạo mô hình end-to-end trên tập dữ liệu bạn đã chọn.

Phân bổ

Nhớ lại mã thiết lập ở trên cùng, mã này đã kết nối thời gian chạy colab với một công nhân TPU với nhiều thiết bị TPU. Để phân phối đào tạo cho họ, bạn sẽ tạo và biên dịch mô hình Keras chính của mình trong phạm vi của chiến lược phân phối TPU. (Để biết chi tiết, xem Distributed đào tạo với Keras .)

Mặt khác, tiền xử lý chạy trên CPU của máy chủ công nhân, không phải TPU, do đó, mô hình Keras cho tiền xử lý cũng như tập dữ liệu đào tạo và xác thực được ánh xạ với nó được xây dựng bên ngoài phạm vi chiến lược phân phối. Các cuộc gọi đến Model.fit() sẽ chăm sóc phân phối thông qua trong bộ dữ liệu với mô hình bản sao.

Trình tối ưu hóa

Tinh chỉnh sau ưu thiết lập từ Bert pre-đào tạo (như trong Phân loại văn bản với Bert ): Nó sử dụng tối ưu hóa với một phân rã tuyến tính của một tỷ lệ học ban đầu nghĩa AdamW, bắt đầu bằng một giai đoạn khởi động tuyến tính trên người đầu tiên 10% các bước đào tạo ( num_warmup_steps ). Phù hợp với bài báo BERT, tốc độ học ban đầu nhỏ hơn để tinh chỉnh (tốt nhất là 5e-5, 3e-5, 2e-5).

epochs = 3
batch_size = 32
init_lr = 2e-5

print(f'Fine tuning {tfhub_handle_encoder} model')
bert_preprocess_model = make_bert_preprocess_model(sentence_features)

with strategy.scope():

  # metric have to be created inside the strategy scope
  metrics, loss = get_configuration(tfds_name)

  train_dataset, train_data_size = load_dataset_from_tfds(
      in_memory_ds, tfds_info, train_split, batch_size, bert_preprocess_model)
  steps_per_epoch = train_data_size // batch_size
  num_train_steps = steps_per_epoch * epochs
  num_warmup_steps = num_train_steps // 10

  validation_dataset, validation_data_size = load_dataset_from_tfds(
      in_memory_ds, tfds_info, validation_split, batch_size,
      bert_preprocess_model)
  validation_steps = validation_data_size // batch_size

  classifier_model = build_classifier_model(num_classes)

  optimizer = optimization.create_optimizer(
      init_lr=init_lr,
      num_train_steps=num_train_steps,
      num_warmup_steps=num_warmup_steps,
      optimizer_type='adamw')

  classifier_model.compile(optimizer=optimizer, loss=loss, metrics=[metrics])

  classifier_model.fit(
      x=train_dataset,
      validation_data=validation_dataset,
      steps_per_epoch=steps_per_epoch,
      epochs=epochs,
      validation_steps=validation_steps)

Fine tuning https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3 model
/tmpfs/src/tf_docs_env/lib/python3.6/site-packages/keras/engine/functional.py:585: UserWarning: Input dict contained keys ['idx', 'label'] which did not match any model input. They will be ignored by the model.
  [n for n in tensors.keys() if n not in ref_input_names])
Epoch 1/3
/tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/framework/indexed_slices.py:449: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("AdamWeightDecay/gradients/StatefulPartitionedCall:1", shape=(None,), dtype=int32), values=Tensor("clip_by_global_norm/clip_by_global_norm/_0:0", dtype=float32), dense_shape=Tensor("AdamWeightDecay/gradients/StatefulPartitionedCall:2", shape=(None,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "shape. This may consume a large amount of memory." % value)
267/267 [==============================] - 86s 81ms/step - loss: 0.6092 - MatthewsCorrelationCoefficient: 0.0000e+00 - val_loss: 0.4846 - val_MatthewsCorrelationCoefficient: 0.0000e+00
Epoch 2/3
267/267 [==============================] - 14s 53ms/step - loss: 0.3774 - MatthewsCorrelationCoefficient: 0.0000e+00 - val_loss: 0.5322 - val_MatthewsCorrelationCoefficient: 0.0000e+00
Epoch 3/3
267/267 [==============================] - 14s 53ms/step - loss: 0.2623 - MatthewsCorrelationCoefficient: 0.0000e+00 - val_loss: 0.6469 - val_MatthewsCorrelationCoefficient: 0.0000e+00

Xuất để suy luận

Bạn sẽ tạo một mô hình cuối cùng có phần tiền xử lý và BERT tinh chỉnh mà chúng tôi vừa tạo.

Tại thời điểm suy luận, tiền xử lý cần phải là một phần của mô hình (vì không còn hàng đợi đầu vào riêng biệt như đối với dữ liệu huấn luyện thực hiện điều đó). Tiền xử lý không chỉ là tính toán; nó có các tài nguyên riêng (bảng vocab) phải được gắn vào Mô hình Keras được lưu để xuất. Lắp ráp cuối cùng này là những gì sẽ được lưu.

Bạn sẽ tiết kiệm được mô hình trên colab và sau đó bạn có thể tải về để giữ nó cho tương lai (View -> Mục lục -> Files).

main_save_path = './my_models'
bert_type = tfhub_handle_encoder.split('/')[-2]
saved_model_name = f'{tfds_name.replace("/", "_")}_{bert_type}'

saved_model_path = os.path.join(main_save_path, saved_model_name)

preprocess_inputs = bert_preprocess_model.inputs
bert_encoder_inputs = bert_preprocess_model(preprocess_inputs)
bert_outputs = classifier_model(bert_encoder_inputs)
model_for_export = tf.keras.Model(preprocess_inputs, bert_outputs)

print('Saving', saved_model_path)

# Save everything on the Colab host (even the variables from TPU memory)
save_options = tf.saved_model.SaveOptions(experimental_io_device='/job:localhost')
model_for_export.save(saved_model_path, include_optimizer=False,
                      options=save_options)

Saving ./my_models/glue_cola_bert_en_uncased_L-12_H-768_A-12
WARNING:absl:Found untraced functions such as restored_function_body, restored_function_body, restored_function_body, restored_function_body, restored_function_body while saving (showing 5 of 910). These functions will not be directly callable after loading.

Kiểm tra mô hình

Bước cuối cùng là kiểm tra kết quả của mô hình đã xuất của bạn.

Chỉ để thực hiện một số so sánh, hãy tải lại mô hình và kiểm tra nó bằng cách sử dụng một số đầu vào từ phần tách thử nghiệm từ tập dữ liệu.

with tf.device('/job:localhost'):
  reloaded_model = tf.saved_model.load(saved_model_path)

Các phương pháp tiện ích

def prepare(record):
  model_inputs = [[record[ft]] for ft in sentence_features]
  return model_inputs


def prepare_serving(record):
  model_inputs = {ft: record[ft] for ft in sentence_features}
  return model_inputs


def print_bert_results(test, bert_result, dataset_name):

  bert_result_class = tf.argmax(bert_result, axis=1)[0]

  if dataset_name == 'glue/cola':
    print('sentence:', test[0].numpy())
    if bert_result_class == 1:
      print('This sentence is acceptable')
    else:
      print('This sentence is unacceptable')

  elif dataset_name == 'glue/sst2':
    print('sentence:', test[0])
    if bert_result_class == 1:
      print('This sentence has POSITIVE sentiment')
    else:
      print('This sentence has NEGATIVE sentiment')

  elif dataset_name == 'glue/mrpc':
    print('sentence1:', test[0])
    print('sentence2:', test[1])
    if bert_result_class == 1:
      print('Are a paraphrase')
    else:
      print('Are NOT a paraphrase')

  elif dataset_name == 'glue/qqp':
    print('question1:', test[0])
    print('question2:', test[1])
    if bert_result_class == 1:
      print('Questions are similar')
    else:
      print('Questions are NOT similar')

  elif dataset_name == 'glue/mnli':
    print('premise   :', test[0])
    print('hypothesis:', test[1])
    if bert_result_class == 1:
      print('This premise is NEUTRAL to the hypothesis')
    elif bert_result_class == 2:
      print('This premise CONTRADICTS the hypothesis')
    else:
      print('This premise ENTAILS the hypothesis')

  elif dataset_name == 'glue/qnli':
    print('question:', test[0])
    print('sentence:', test[1])
    if bert_result_class == 1:
      print('The question is NOT answerable by the sentence')
    else:
      print('The question is answerable by the sentence')

  elif dataset_name == 'glue/rte':
    print('sentence1:', test[0])
    print('sentence2:', test[1])
    if bert_result_class == 1:
      print('Sentence1 DOES NOT entails sentence2')
    else:
      print('Sentence1 entails sentence2')

  elif dataset_name == 'glue/wnli':
    print('sentence1:', test[0])
    print('sentence2:', test[1])
    if bert_result_class == 1:
      print('Sentence1 DOES NOT entails sentence2')
    else:
      print('Sentence1 entails sentence2')

  print('BERT raw results:', bert_result[0])
  print()

Bài kiểm tra

with tf.device('/job:localhost'):
  test_dataset = tf.data.Dataset.from_tensor_slices(in_memory_ds[test_split])
  for test_row in test_dataset.shuffle(1000).map(prepare).take(5):
    if len(sentence_features) == 1:
      result = reloaded_model(test_row[0])
    else:
      result = reloaded_model(list(test_row))

    print_bert_results(test_row, result, tfds_name)

sentence: [b'An old woman languished in the forest.']
This sentence is acceptable
BERT raw results: tf.Tensor([-1.7032353  3.3714833], shape=(2,), dtype=float32)

sentence: [b"I went to the movies and didn't pick up the shirts."]
This sentence is acceptable
BERT raw results: tf.Tensor([-0.73970896  1.0806316 ], shape=(2,), dtype=float32)

sentence: [b"Every essay that she's written and which I've read is on that pile."]
This sentence is acceptable
BERT raw results: tf.Tensor([-0.7034159  0.6236454], shape=(2,), dtype=float32)

sentence: [b'Either Bill ate the peaches, or Harry.']
This sentence is unacceptable
BERT raw results: tf.Tensor([ 0.05972151 -0.08620442], shape=(2,), dtype=float32)

sentence: [b'I ran into the baker from whom I bought these bagels.']
This sentence is acceptable
BERT raw results: tf.Tensor([-1.6862067  3.285925 ], shape=(2,), dtype=float32)

Nếu bạn muốn sử dụng mô hình của bạn trên TF Phục vụ , hãy nhớ rằng nó sẽ gọi SavedModel của bạn thông qua một trong những chữ ký tên của nó. Lưu ý rằng có một số khác biệt nhỏ trong đầu vào. Trong Python, bạn có thể kiểm tra chúng như sau:

with tf.device('/job:localhost'):
  serving_model = reloaded_model.signatures['serving_default']
  for test_row in test_dataset.shuffle(1000).map(prepare_serving).take(5):
    result = serving_model(**test_row)
    # The 'prediction' key is the classifier's defined model name.
    print_bert_results(list(test_row.values()), result['prediction'], tfds_name)

sentence: b'Everyone attended more than two seminars.'
This sentence is acceptable
BERT raw results: tf.Tensor([-1.5594155  2.862155 ], shape=(2,), dtype=float32)

sentence: b'Most columnists claim that a senior White House official has been briefing them.'
This sentence is acceptable
BERT raw results: tf.Tensor([-1.6298996  3.3155093], shape=(2,), dtype=float32)

sentence: b"That my father, he's lived here all his life is well known to those cops."
This sentence is acceptable
BERT raw results: tf.Tensor([-1.2048947  1.8589772], shape=(2,), dtype=float32)

sentence: b'Ourselves like us.'
This sentence is acceptable
BERT raw results: tf.Tensor([-1.2723312  2.0494034], shape=(2,), dtype=float32)

sentence: b'John is clever.'
This sentence is acceptable
BERT raw results: tf.Tensor([-1.6516167  3.3147635], shape=(2,), dtype=float32)

Bạn làm được rồi! Mô hình đã lưu của bạn có thể được sử dụng để phục vụ hoặc suy luận đơn giản trong một quy trình, với một api đơn giản hơn với ít mã hơn và dễ bảo trì hơn.

Bước tiếp theo

Bây giờ bạn đã thử một trong các mô hình BERT cơ sở, bạn có thể thử các mô hình khác để đạt được độ chính xác hơn hoặc có thể với các phiên bản mô hình nhỏ hơn.

Bạn cũng có thể thử trong các bộ dữ liệu khác.