BERT Preprocessing with TF Text

View on TensorFlow.org

Run in Google Colab

View on GitHub

Download notebook

Overview

Text preprocessing is the end-to-end transformation of raw text into a model’s integer inputs. NLP models are often accompanied by several hundreds (if not thousands) of lines of Python code for preprocessing text. Text preprocessing is often a challenge for models because:

Training-serving skew. It becomes increasingly difficult to ensure that the preprocessing logic of the model's inputs are consistent at all stages of model development (e.g. pretraining, fine-tuning, evaluation, inference). Using different hyperparameters, tokenization, string preprocessing algorithms or simply packaging model inputs inconsistently at different stages could yield hard-to-debug and disastrous effects to the model.
Efficiency and flexibility. While preprocessing can be done offline (e.g. by writing out processed outputs to files on disk and then reconsuming said preprocessed data in the input pipeline), this method incurs an additional file read and write cost. Preprocessing offline is also inconvenient if there are preprocessing decisions that need to happen dynamically. Experimenting with a different option would require regenerating the dataset again.
Complex model interface. Text models are much more understandable when their inputs are pure text. It's hard to understand a model when its inputs require an extra, indirect encoding step. Reducing the preprocessing complexity is especially appreciated for model debugging, serving, and evaluation.

Additionally, simpler model interfaces also make it more convenient to try the model (e.g. inference or training) on different, unexplored datasets.

Text preprocessing with TF.Text

Using TF.Text's text preprocessing APIs, we can construct a preprocessing function that can transform a user's text dataset into the model's integer inputs. Users can package preprocessing directly as part of their model to alleviate the above mentioned problems.

This tutorial will show how to use TF.Text preprocessing ops to transform text data into inputs for the BERT model and inputs for language masking pretraining task described in "Masked LM and Masking Procedure" of BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. The process involves tokenizing text into subword units, combining sentences, trimming content to a fixed size and extracting labels for the masked language modeling task.

Setup

Let's import the packages and libraries we need first.

pip install -q -U "tensorflow-text==2.11.*"

import tensorflow as tf
import tensorflow_text as text
import functools
print("TensorFlow version: ", tf.__version__)

2024-07-19 12:59:58.075059: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-07-19 12:59:58.884883: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2024-07-19 12:59:58.884964: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2024-07-19 12:59:58.884973: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
TensorFlow version:  2.11.1

Our data contains two text features and we can create a example tf.data.Dataset. Our goal is to create a function that we can supply Dataset.map() with to be used in training.

examples = {
    "text_a": [
      "Sponge bob Squarepants is an Avenger",
      "Marvel Avengers"
    ],
    "text_b": [
     "Barack Obama is the President.",
     "President is the highest office"
  ],
}

dataset = tf.data.Dataset.from_tensor_slices(examples)
next(iter(dataset))

2024-07-19 13:00:00.547201: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-07-19 13:00:00.547305: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2024-07-19 13:00:00.547369: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2024-07-19 13:00:00.547430: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2024-07-19 13:00:00.606053: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2024-07-19 13:00:00.606255: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
{'text_a': <tf.Tensor: shape=(), dtype=string, numpy=b'Sponge bob Squarepants is an Avenger'>,
 'text_b': <tf.Tensor: shape=(), dtype=string, numpy=b'Barack Obama is the President.'>}

Tokenizing

Our first step is to run any string preprocessing and tokenize our dataset. This can be done using the text.BertTokenizer, which is a text.Splitter that can tokenize sentences into subwords or wordpieces for the BERT model given a vocabulary generated from the Wordpiece algorithm. You can learn more about other subword tokenizers available in TF.Text from here.

The vocabulary can be from a previously generated BERT checkpoint, or you can generate one yourself on your own data. For the purposes of this example, let's create a toy vocabulary:

_VOCAB = [
    # Special tokens
    b"[UNK]", b"[MASK]", b"[RANDOM]", b"[CLS]", b"[SEP]",
    # Suffixes
    b"##ack", b"##ama", b"##ger", b"##gers", b"##onge", b"##pants",  b"##uare",
    b"##vel", b"##ven", b"an", b"A", b"Bar", b"Hates", b"Mar", b"Ob",
    b"Patrick", b"President", b"Sp", b"Sq", b"bob", b"box", b"has", b"highest",
    b"is", b"office", b"the",
]

_START_TOKEN = _VOCAB.index(b"[CLS]")
_END_TOKEN = _VOCAB.index(b"[SEP]")
_MASK_TOKEN = _VOCAB.index(b"[MASK]")
_RANDOM_TOKEN = _VOCAB.index(b"[RANDOM]")
_UNK_TOKEN = _VOCAB.index(b"[UNK]")
_MAX_SEQ_LEN = 8
_MAX_PREDICTIONS_PER_BATCH = 5

_VOCAB_SIZE = len(_VOCAB)

lookup_table = tf.lookup.StaticVocabularyTable(
    tf.lookup.KeyValueTensorInitializer(
      keys=_VOCAB,
      key_dtype=tf.string,
      values=tf.range(
          tf.size(_VOCAB, out_type=tf.int64), dtype=tf.int64),
          value_dtype=tf.int64
        ),
      num_oov_buckets=1
)

Let's construct a text.BertTokenizer using the above vocabulary and tokenize the text inputs into a RaggedTensor.`.

bert_tokenizer = text.BertTokenizer(lookup_table, token_out_type=tf.string)
bert_tokenizer.tokenize(examples["text_a"])

<tf.RaggedTensor [[[b'Sp', b'##onge'], [b'bob'], [b'Sq', b'##uare', b'##pants'], [b'is'],
  [b'an'], [b'A', b'##ven', b'##ger']]                                  ,
 [[b'Mar', b'##vel'], [b'A', b'##ven', b'##gers']]]>

bert_tokenizer.tokenize(examples["text_b"])

<tf.RaggedTensor [[[b'Bar', b'##ack'], [b'Ob', b'##ama'], [b'is'], [b'the'], [b'President'],
  [b'[UNK]']]                                                              ,
 [[b'President'], [b'is'], [b'the'], [b'highest'], [b'office']]]>

Text output from text.BertTokenizer allows us see how the text is being tokenized, but the model requires integer IDs. We can set the token_out_type param to tf.int64 to obtain integer IDs (which are the indices into the vocabulary).

bert_tokenizer = text.BertTokenizer(lookup_table, token_out_type=tf.int64)
segment_a = bert_tokenizer.tokenize(examples["text_a"])
segment_a

<tf.RaggedTensor [[[22, 9], [24], [23, 11, 10], [28], [14], [15, 13, 7]],
 [[18, 12], [15, 13, 8]]]>

segment_b = bert_tokenizer.tokenize(examples["text_b"])
segment_b

<tf.RaggedTensor [[[16, 5], [19, 6], [28], [30], [21], [0]], [[21], [28], [30], [27], [29]]]>

text.BertTokenizer returns a RaggedTensor with shape [batch, num_tokens, num_wordpieces]. Because we don't need the extra num_tokens dimensions for our current use case, we can merge the last two dimensions to obtain a RaggedTensor with shape [batch, num_wordpieces]:

segment_a = segment_a.merge_dims(-2, -1)
segment_a

<tf.RaggedTensor [[22, 9, 24, 23, 11, 10, 28, 14, 15, 13, 7], [18, 12, 15, 13, 8]]>

segment_b = segment_b.merge_dims(-2, -1)
segment_b

<tf.RaggedTensor [[16, 5, 19, 6, 28, 30, 21, 0], [21, 28, 30, 27, 29]]>

Content Trimming

The main input to BERT is a concatenation of two sentences. However, BERT requires inputs to be in a fixed-size and shape and we may have content which exceed our budget.

We can tackle this by using a text.Trimmer to trim our content down to a predetermined size (once concatenated along the last axis). There are different text.Trimmer types which select content to preserve using different algorithms. text.RoundRobinTrimmer for example will allocate quota equally for each segment but may trim the ends of sentences. text.WaterfallTrimmer will trim starting from the end of the last sentence.

For our example, we will use RoundRobinTrimmer which selects items from each segment in a left-to-right manner.

trimmer = text.RoundRobinTrimmer(max_seq_length=_MAX_SEQ_LEN)
trimmed = trimmer.trim([segment_a, segment_b])
trimmed

[<tf.RaggedTensor [[22, 9, 24, 23],
  [18, 12, 15, 13]]>,
 <tf.RaggedTensor [[16, 5, 19, 6],
  [21, 28, 30, 27]]>]

trimmed now contains the segments where the number of elements across a batch is 8 elements (when concatenated along axis=-1).

Combining segments

Now that we have segments trimmed, we can combine them together to get a single RaggedTensor. BERT uses special tokens to indicate the beginning ([CLS]) and end of a segment ([SEP]). We also need a RaggedTensor indicating which items in the combined Tensor belong to which segment. We can use text.combine_segments() to get both of these Tensor with special tokens inserted.

segments_combined, segments_ids = text.combine_segments(
  trimmed,
  start_of_sequence_id=_START_TOKEN, end_of_segment_id=_END_TOKEN)
segments_combined, segments_ids

(<tf.RaggedTensor [[3, 22, 9, 24, 23, 4, 16, 5, 19, 6, 4],
  [3, 18, 12, 15, 13, 4, 21, 28, 30, 27, 4]]>,
 <tf.RaggedTensor [[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
  [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]]>)

Masked Language Model Task

Now that we have our basic inputs, we can begin to extract the inputs needed for the "Masked LM and Masking Procedure" task described in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

The masked language model task has two sub-problems for us to think about: (1) what items to select for masking and (2) what values are they assigned?

Item Selection

Because we will choose to select items randomly for masking, we will use a text.RandomItemSelector. RandomItemSelector randomly selects items in a batch subject to restrictions given (max_selections_per_batch, selection_rate and unselectable_ids) and returns a boolean mask indicating which items were selected.

random_selector = text.RandomItemSelector(
    max_selections_per_batch=_MAX_PREDICTIONS_PER_BATCH,
    selection_rate=0.2,
    unselectable_ids=[_START_TOKEN, _END_TOKEN, _UNK_TOKEN]
)
selected = random_selector.get_selection_mask(
    segments_combined, axis=1)
selected

<tf.RaggedTensor [[False, False, False, False, True, False, True, False, False, False,
  False],
 [False, False, True, False, False, False, False, False, True, False,
  False]]>

Choosing the Masked Value

The methodology described the original BERT paper for choosing the value for masking is as follows:

For mask_token_rate of the time, replace the item with the [MASK] token:

"my dog is hairy" -> "my dog is [MASK]"

For random_token_rate of the time, replace the item with a random word:

"my dog is hairy" -> "my dog is apple"

For 1 - mask_token_rate - random_token_rate of the time, keep the item unchanged:

"my dog is hairy" -> "my dog is hairy."

text.MaskedValuesChooser encapsulates this logic and can be used for our preprocessing function. Here's an example of what MaskValuesChooser returns given a mask_token_rate of 80% and default random_token_rate:

mask_values_chooser = text.MaskValuesChooser(_VOCAB_SIZE, _MASK_TOKEN, 0.8)
mask_values_chooser.get_mask_values(segments_combined)

<tf.RaggedTensor [[1, 22, 1, 20, 1, 27, 1, 1, 1, 6, 1],
 [3, 18, 1, 1, 13, 4, 1, 1, 1, 1, 1]]>

When supplied with a RaggedTensor input, text.MaskValuesChooser returns a RaggedTensor of the same shape with either _MASK_VALUE (0), a random ID, or the same unchanged id.

Generating Inputs for Masked Language Model Task

Now that we have a RandomItemSelector to help us select items for masking and text.MaskValuesChooser to assign the values, we can use text.mask_language_model() to assemble all the inputs of this task for our BERT model.

masked_token_ids, masked_pos, masked_lm_ids = text.mask_language_model(
  segments_combined,
  item_selector=random_selector, mask_values_chooser=mask_values_chooser)

Let's dive deeper and examine the outputs of mask_language_model(). The output of masked_token_ids is:

masked_token_ids

<tf.RaggedTensor [[3, 22, 9, 24, 1, 4, 16, 5, 1, 6, 4],
 [3, 18, 1, 15, 13, 4, 21, 28, 30, 1, 4]]>

Remember that our input is encoded using a vocabulary. If we decode masked_token_ids using our vocabulary, we get:

tf.gather(_VOCAB, masked_token_ids)

<tf.RaggedTensor [[b'[CLS]', b'Sp', b'##onge', b'bob', b'[MASK]', b'[SEP]', b'Bar',
  b'##ack', b'[MASK]', b'##ama', b'[SEP]'],
 [b'[CLS]', b'Mar', b'[MASK]', b'A', b'##ven', b'[SEP]', b'President',
  b'is', b'the', b'[MASK]', b'[SEP]']]>

Notice that some wordpiece tokens have been replaced with either [MASK], [RANDOM] or a different ID value. masked_pos output gives us the indices (in the respective batch) of the tokens that have been replaced.

masked_pos

<tf.RaggedTensor [[4, 8],
 [2, 9]]>

masked_lm_ids gives us the original value of the token.

masked_lm_ids

<tf.RaggedTensor [[23, 19],
 [12, 27]]>

We can again decode the IDs here to get human readable values.

tf.gather(_VOCAB, masked_lm_ids)

<tf.RaggedTensor [[b'Sq', b'Ob'],
 [b'##vel', b'highest']]>

Padding Model Inputs

Now that we have all the inputs for our model, the last step in our preprocessing is to package them into fixed 2-dimensional Tensors with padding and also generate a mask Tensor indicating the values which are pad values. We can use text.pad_model_inputs() to help us with this task.

# Prepare and pad combined segment inputs
input_word_ids, input_mask = text.pad_model_inputs(
  masked_token_ids, max_seq_length=_MAX_SEQ_LEN)
input_type_ids, _ = text.pad_model_inputs(
  segments_ids, max_seq_length=_MAX_SEQ_LEN)

# Prepare and pad masking task inputs
masked_lm_positions, masked_lm_weights = text.pad_model_inputs(
  masked_pos, max_seq_length=_MAX_PREDICTIONS_PER_BATCH)
masked_lm_ids, _ = text.pad_model_inputs(
  masked_lm_ids, max_seq_length=_MAX_PREDICTIONS_PER_BATCH)

model_inputs = {
    "input_word_ids": input_word_ids,
    "input_mask": input_mask,
    "input_type_ids": input_type_ids,
    "masked_lm_ids": masked_lm_ids,
    "masked_lm_positions": masked_lm_positions,
    "masked_lm_weights": masked_lm_weights,
}
model_inputs

{'input_word_ids': <tf.Tensor: shape=(2, 8), dtype=int64, numpy=
 array([[ 3, 22,  9, 24,  1,  4, 16,  5],
        [ 3, 18,  1, 15, 13,  4, 21, 28]])>,
 'input_mask': <tf.Tensor: shape=(2, 8), dtype=int64, numpy=
 array([[1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1]])>,
 'input_type_ids': <tf.Tensor: shape=(2, 8), dtype=int64, numpy=
 array([[0, 0, 0, 0, 0, 0, 1, 1],
        [0, 0, 0, 0, 0, 0, 1, 1]])>,
 'masked_lm_ids': <tf.Tensor: shape=(2, 5), dtype=int64, numpy=
 array([[23, 19,  0,  0,  0],
        [12, 27,  0,  0,  0]])>,
 'masked_lm_positions': <tf.Tensor: shape=(2, 5), dtype=int64, numpy=
 array([[4, 8, 0, 0, 0],
        [2, 9, 0, 0, 0]])>,
 'masked_lm_weights': <tf.Tensor: shape=(2, 5), dtype=int64, numpy=
 array([[1, 1, 0, 0, 0],
        [1, 1, 0, 0, 0]])>}

Review

Let's review what we have so far and assemble our preprocessing function. Here's what we have:

def bert_pretrain_preprocess(vocab_table, features):
  # Input is a string Tensor of documents, shape [batch, 1].
  text_a = features["text_a"]
  text_b = features["text_b"]

  # Tokenize segments to shape [num_sentences, (num_words)] each.
  tokenizer = text.BertTokenizer(
      vocab_table,
      token_out_type=tf.int64)
  segments = [tokenizer.tokenize(text).merge_dims(
      1, -1) for text in (text_a, text_b)]

  # Truncate inputs to a maximum length.
  trimmer = text.RoundRobinTrimmer(max_seq_length=6)
  trimmed_segments = trimmer.trim(segments)

  # Combine segments, get segment ids and add special tokens.
  segments_combined, segment_ids = text.combine_segments(
      trimmed_segments,
      start_of_sequence_id=_START_TOKEN,
      end_of_segment_id=_END_TOKEN)

  # Apply dynamic masking task.
  masked_input_ids, masked_lm_positions, masked_lm_ids = (
      text.mask_language_model(
        segments_combined,
        random_selector,
        mask_values_chooser,
      )
  )

  # Prepare and pad combined segment inputs
  input_word_ids, input_mask = text.pad_model_inputs(
    masked_input_ids, max_seq_length=_MAX_SEQ_LEN)
  input_type_ids, _ = text.pad_model_inputs(
    segment_ids, max_seq_length=_MAX_SEQ_LEN)

  # Prepare and pad masking task inputs
  masked_lm_positions, masked_lm_weights = text.pad_model_inputs(
    masked_lm_positions, max_seq_length=_MAX_PREDICTIONS_PER_BATCH)
  masked_lm_ids, _ = text.pad_model_inputs(
    masked_lm_ids, max_seq_length=_MAX_PREDICTIONS_PER_BATCH)

  model_inputs = {
      "input_word_ids": input_word_ids,
      "input_mask": input_mask,
      "input_type_ids": input_type_ids,
      "masked_lm_ids": masked_lm_ids,
      "masked_lm_positions": masked_lm_positions,
      "masked_lm_weights": masked_lm_weights,
  }
  return model_inputs

We previously constructed a tf.data.Dataset and we can now use our assembled preprocessing function bert_pretrain_preprocess() in Dataset.map(). This allows us to create an input pipeline for transforming our raw string data into integer inputs and feed directly into our model.

dataset = (
    tf.data.Dataset.from_tensors(examples)
    .map(functools.partial(bert_pretrain_preprocess, lookup_table))
)

next(iter(dataset))

WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
{'input_word_ids': <tf.Tensor: shape=(2, 8), dtype=int64, numpy=
 array([[ 3, 22,  9,  1,  4, 16,  5,  1],
        [ 3, 18, 12,  1,  4, 21, 28,  1]])>,
 'input_mask': <tf.Tensor: shape=(2, 8), dtype=int64, numpy=
 array([[1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1]])>,
 'input_type_ids': <tf.Tensor: shape=(2, 8), dtype=int64, numpy=
 array([[0, 0, 0, 0, 0, 1, 1, 1],
        [0, 0, 0, 0, 0, 1, 1, 1]])>,
 'masked_lm_ids': <tf.Tensor: shape=(2, 5), dtype=int64, numpy=
 array([[24, 19,  0,  0,  0],
        [15, 30,  0,  0,  0]])>,
 'masked_lm_positions': <tf.Tensor: shape=(2, 5), dtype=int64, numpy=
 array([[3, 7, 0, 0, 0],
        [3, 7, 0, 0, 0]])>,
 'masked_lm_weights': <tf.Tensor: shape=(2, 5), dtype=int64, numpy=
 array([[1, 1, 0, 0, 0],
        [1, 1, 0, 0, 0]])>}