Text processing tools for TensorFlow

import tensorflow as tf
import tensorflow_text as tf_text

def preprocess(vocab_table, example_text):

  # Normalize text
  tf_text.normalize_utf8(example_text)

  # Tokenize into words
  word_tokenizer = tf_text.WhitespaceTokenizer()
  tokens = word_tokenizer.tokenize(example_text)

  # Tokenize into subwords
  subword_tokenizer = tf_text.WordpieceTokenizer(
       lookup_table, token_out_type=tf.int64)
  subtokens = subword_tokenizer.tokenize(tokens).merge_dims(1, -1)

  # Apply padding
  padded_inputs = tf_text.pad_model_inputs(subtokens, max_seq_length=16)
  return padded_inputs
Run in a Notebook

TensorFlow provides you with a rich collection of ops and libraries to help you work with input in text form such as raw text strings or documents. These libraries can perform the preprocessing regularly required by text-based models, and includes other features useful for sequence modeling.

You can extract powerful syntactic and semantic text features from inside the TensorFlow graph as input to your neural net.

Integrating preprocessing with the TensorFlow graph provides the following benefits:

  • Facilitates a large toolkit for working with text
  • Allows integration with a large suite of Tensorflow tools to support projects from problem definition through training, evaluation, and launch
  • Reduces complexity at serving time and prevents training-serving skew

In addition to the above, you do not need to worry about tokenization in training being different than the tokenization at inference, or managing preprocessing scripts.

Model Architectures
Learn how to perform end-to-end BERT preprocessing on text.
Learn how to generate subword vocabularies from text.
Learn how to classify text with the BERT model.
Classify text using Recurrent Neural Networks.
Use Transformer models to translate text.
Learn how to translate text with sequence-to-sequence models.