Have a question? Connect with the community at the TensorFlow Forum Visit Forum

Text classification with an RNN

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

This text classification tutorial trains a recurrent neural network on the IMDB large movie review dataset for sentiment analysis.

Setup

import numpy as np

import tensorflow_datasets as tfds
import tensorflow as tf

tfds.disable_progress_bar()

Import matplotlib and create a helper function to plot graphs:

import matplotlib.pyplot as plt


def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])

Setup input pipeline

The IMDB large movie review dataset is a binary classification dataset—all the reviews have either a positive or negative sentiment.

Download the dataset using TFDS. See the loading text tutorial for details on how to load this sort of data manually.

dataset, info = tfds.load('imdb_reviews', with_info=True,
                          as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

train_dataset.element_spec
(TensorSpec(shape=(), dtype=tf.string, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))

Initially this returns a dataset of (text, label pairs):

for example, label in train_dataset.take(1):
  print('text: ', example.numpy())
  print('label: ', label.numpy())
text:  b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
label:  0

Next shuffle the data for training and create batches of these (text, label) pairs:

BUFFER_SIZE = 10000
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
for example, label in train_dataset.take(1):
  print('texts: ', example.numpy()[:3])
  print()
  print('labels: ', label.numpy()[:3])
texts:  [b"I gotta say, Clive Barker's Undying is by far the best horror game to have ever been made. I've played Resident Evil, Silent Hill and the Evil Dead and Castlevania games but none of them have captured the pure glee with which this game tackles its horrific elements. Barker is good at what he does, which is attach the horror to our world, and it shows as his hand is clearly everywhere in this game. Heck, even his voice is in the game as one of the main characters. Full of lush visuals and enough atmosphere to shake a stick at, Undying is the game to beat in my books as the best horror title. I just wish that this had made it to a console system but alas poor PC sales nipped that one in the bud."
 b'This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a couple of hours. Wonderful performances from Cher and Nicolas Cage (as always) gently row the plot along. There are no rapids to cross, no dangerous waters, just a warm and witty paddle through New York life at its best. A family film in every sense and one that deserves the praise it received.'
 b'In Stand By Me, Vern and Teddy discuss who was tougher, Superman or Mighty Mouse. My friends and I often discuss who would win a fight too. Sometimes we get absurd and compare guys like MacGyver and The Terminator or Rambo and Matrix. But now it seems that we discuss guys like Jackie Chan, Bruce Lee and Jet Li. It is a pointless comparison seeing that Lee is dead, but it is a fun one. And if you go by what we have seen from Jet Li in Lethal 4 and Black Mask, you have to at least say that he would match up well against Chan. In this film he comes across as a martial arts God.<br /><br />Black Mask is about a man that was created along with many other men, to be supreme fighting machines. Their only purpose is to win wars that other people lose. They are invincible in some ways. Now that is the premise for the film, but what that does is sets up all the amazingly choreographed fight scenes.<br /><br />Jet Li is a marvel. He can do things with and to his body that no human being should be able to do. And that is what makes watching him so fun.<br /><br />Besides the martial arts in the film, Black Mask is strong with humour and that is due to the chemistry that Jet has with his co-star, the police officer. They are great together. But to be honest. if anyone is reading this review, they want to know if the film is kick ass in the action department. And the answer to that is a resounding YES!!! Lots and lots of gory mindless action. You will love this film.']

labels:  [1 1 1]

Create the text encoder

The raw text loaded by tfds needs to be processed before it can be used in a model. The simplest way to process text for training is using the experimental.preprocessing.TextVectorization layer. This layer has many capabilities, but this tutorial sticks to the default behavior.

Create the layer, and pass the dataset's text to the layer's .adapt method:

VOCAB_SIZE = 1000
encoder = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=VOCAB_SIZE)
encoder.adapt(train_dataset.map(lambda text, label: text))

The .adapt method sets the layer's vocabulary. Here are the first 20 tokens. After the padding and unknown tokens they're sorted by frequency:

vocab = np.array(encoder.get_vocabulary())
vocab[:20]
array(['', '[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i',
       'this', 'that', 'br', 'was', 'as', 'for', 'with', 'movie', 'but'],
      dtype='<U14')

Once the vocabulary is set, the layer can encode text into indices. The tensors of indices are 0-padded to the longest sequence in the batch (unless you set a fixed output_sequence_length):

encoded_example = encoder(example)[:3].numpy()
encoded_example
array([[ 10,   1, 130, ...,   0,   0,   0],
       [ 11,   7,   2, ...,   0,   0,   0],
       [  8, 847,  33, ...,   0,   0,   0]])

With the default settings, the process is not completely reversible. There are three main reasons for that:

  1. The default value for preprocessing.TextVectorization's standardize argument is "lower_and_strip_punctuation".
  2. The limited vocabulary size and lack of character-based fallback results in some unknown tokens.
for n in range(3):
  print("Original: ", example[n].numpy())
  print("Round-trip: ", " ".join(vocab[encoded_example[n]]))
  print()
Original:  b"I gotta say, Clive Barker's Undying is by far the best horror game to have ever been made. I've played Resident Evil, Silent Hill and the Evil Dead and Castlevania games but none of them have captured the pure glee with which this game tackles its horrific elements. Barker is good at what he does, which is attach the horror to our world, and it shows as his hand is clearly everywhere in this game. Heck, even his voice is in the game as one of the main characters. Full of lush visuals and enough atmosphere to shake a stick at, Undying is the game to beat in my books as the best horror title. I just wish that this had made it to a console system but alas poor PC sales nipped that one in the bud."
Round-trip:  i [UNK] say [UNK] [UNK] [UNK] is by far the best horror game to have ever been made ive played [UNK] evil [UNK] [UNK] and the evil dead and [UNK] [UNK] but none of them have [UNK] the [UNK] [UNK] with which this game [UNK] its [UNK] elements [UNK] is good at what he does which is [UNK] the horror to our world and it shows as his hand is clearly [UNK] in this game [UNK] even his voice is in the game as one of the main characters full of [UNK] [UNK] and enough atmosphere to [UNK] a [UNK] at [UNK] is the game to [UNK] in my [UNK] as the best horror title i just wish that this had made it to a [UNK] [UNK] but [UNK] poor [UNK] [UNK] [UNK] that one in the [UNK]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

Original:  b'This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a couple of hours. Wonderful performances from Cher and Nicolas Cage (as always) gently row the plot along. There are no rapids to cross, no dangerous waters, just a warm and witty paddle through New York life at its best. A family film in every sense and one that deserves the praise it received.'
Round-trip:  this is the kind of film for a [UNK] [UNK] [UNK] when the rest of the world can go [UNK] with its own business as you [UNK] into a big [UNK] and [UNK] for a couple of hours wonderful performances from [UNK] and [UNK] [UNK] as always [UNK] [UNK] the plot along there are no [UNK] to [UNK] no [UNK] [UNK] just a [UNK] and [UNK] [UNK] through new york life at its best a family film in every sense and one that deserves the [UNK] it [UNK]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     

Original:  b'In Stand By Me, Vern and Teddy discuss who was tougher, Superman or Mighty Mouse. My friends and I often discuss who would win a fight too. Sometimes we get absurd and compare guys like MacGyver and The Terminator or Rambo and Matrix. But now it seems that we discuss guys like Jackie Chan, Bruce Lee and Jet Li. It is a pointless comparison seeing that Lee is dead, but it is a fun one. And if you go by what we have seen from Jet Li in Lethal 4 and Black Mask, you have to at least say that he would match up well against Chan. In this film he comes across as a martial arts God.<br /><br />Black Mask is about a man that was created along with many other men, to be supreme fighting machines. Their only purpose is to win wars that other people lose. They are invincible in some ways. Now that is the premise for the film, but what that does is sets up all the amazingly choreographed fight scenes.<br /><br />Jet Li is a marvel. He can do things with and to his body that no human being should be able to do. And that is what makes watching him so fun.<br /><br />Besides the martial arts in the film, Black Mask is strong with humour and that is due to the chemistry that Jet has with his co-star, the police officer. They are great together. But to be honest. if anyone is reading this review, they want to know if the film is kick ass in the action department. And the answer to that is a resounding YES!!! Lots and lots of gory mindless action. You will love this film.'
Round-trip:  in stand by me [UNK] and [UNK] [UNK] who was [UNK] [UNK] or [UNK] [UNK] my friends and i often [UNK] who would [UNK] a fight too sometimes we get [UNK] and [UNK] guys like [UNK] and the [UNK] or [UNK] and [UNK] but now it seems that we [UNK] guys like [UNK] [UNK] [UNK] lee and [UNK] [UNK] it is a [UNK] [UNK] seeing that lee is dead but it is a fun one and if you go by what we have seen from [UNK] [UNK] in [UNK] 4 and black [UNK] you have to at least say that he would [UNK] up well against [UNK] in this film he comes across as a [UNK] [UNK] [UNK] br black [UNK] is about a man that was [UNK] along with many other men to be [UNK] fighting [UNK] their only [UNK] is to [UNK] [UNK] that other people [UNK] they are [UNK] in some ways now that is the premise for the film but what that does is sets up all the [UNK] [UNK] fight [UNK] br [UNK] [UNK] is a [UNK] he can do things with and to his body that no human being should be able to do and that is what makes watching him so [UNK] br [UNK] the [UNK] [UNK] in the film black [UNK] is strong with [UNK] and that is due to the [UNK] that [UNK] has with his [UNK] the police [UNK] they are great together but to be [UNK] if anyone is reading this review they want to know if the film is [UNK] [UNK] in the action [UNK] and the [UNK] to that is a [UNK] yes lots and lots of [UNK] [UNK] action you will love this film

Create the model

A drawing of the information flow in the model

Above is a diagram of the model.

  1. This model can be build as a tf.keras.Sequential.

  2. The first layer is the encoder, which converts the text to a sequence of token indices.

  3. After the encoder is an embedding layer. An embedding layer stores one vector per word. When called, it converts the sequences of word indices to sequences of vectors. These vectors are trainable. After training (on enough data), words with similar meanings often have similar vectors.

    This index-lookup is much more efficient than the equivalent operation of passing a one-hot encoded vector through a tf.keras.layers.Dense layer.

  4. A recurrent neural network (RNN) processes sequence input by iterating through the elements. RNNs pass the outputs from one timestep to their input on the next timestep.

    The tf.keras.layers.Bidirectional wrapper can also be used with an RNN layer. This propagates the input forward and backwards through the RNN layer and then concatenates the final output.

    • The main advantage of a bidirectional RNN is that the signal from the beginning of the input doesn't need to be processed all the way through every timestep to affect the output.

    • The main disadvantage of a bidirectional RNN is that you can't efficiently stream predictions as words are being added to the end.

  5. After the RNN has converted the sequence to a single vector the two layers.Dense do some final processing, and convert from this vector representation to a single logit as the classification output.

The code to implement this is below:

model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),
        output_dim=64,
        # Use masking to handle the variable sequence lengths
        mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

Please note that Keras sequential model is used here since all the layers in the model only have single input and produce single output. In case you want to use stateful RNN layer, you might want to build your model with Keras functional API or model subclassing so that you can retrieve and reuse the RNN layer states. Please check Keras RNN guide for more details.

The embedding layer uses masking to handle the varying sequence-lengths. All the layers after the Embedding support masking:

print([layer.supports_masking for layer in model.layers])
[False, True, True, True, True]

To confirm that this works as expected, evaluate a sentence twice. First, alone so there's no padding to mask:

# predict on a sample text without padding.

sample_text = ('The movie was cool. The animation and the graphics '
               'were out of this world. I would recommend this movie.')
predictions = model.predict(np.array([sample_text]))
print(predictions[0])
[-0.00750345]

Now, evaluate it again in a batch with a longer sentence. The result should be identical:

# predict on a sample text with padding

padding = "the " * 2000
predictions = model.predict(np.array([sample_text, padding]))
print(predictions[0])
[-0.00750345]

Compile the Keras model to configure the training process:

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

Train the model

history = model.fit(train_dataset, epochs=10,
                    validation_data=test_dataset,
                    validation_steps=30)
Epoch 1/10
391/391 [==============================] - 39s 84ms/step - loss: 0.6487 - accuracy: 0.5628 - val_loss: 0.5179 - val_accuracy: 0.7068
Epoch 2/10
391/391 [==============================] - 32s 78ms/step - loss: 0.4195 - accuracy: 0.8019 - val_loss: 0.3714 - val_accuracy: 0.8349
Epoch 3/10
391/391 [==============================] - 31s 78ms/step - loss: 0.3451 - accuracy: 0.8479 - val_loss: 0.3427 - val_accuracy: 0.8495
Epoch 4/10
391/391 [==============================] - 31s 77ms/step - loss: 0.3238 - accuracy: 0.8577 - val_loss: 0.3328 - val_accuracy: 0.8396
Epoch 5/10
391/391 [==============================] - 32s 77ms/step - loss: 0.3144 - accuracy: 0.8638 - val_loss: 0.3377 - val_accuracy: 0.8328
Epoch 6/10
391/391 [==============================] - 31s 76ms/step - loss: 0.3105 - accuracy: 0.8655 - val_loss: 0.3301 - val_accuracy: 0.8630
Epoch 7/10
391/391 [==============================] - 32s 79ms/step - loss: 0.3061 - accuracy: 0.8672 - val_loss: 0.3756 - val_accuracy: 0.8562
Epoch 8/10
391/391 [==============================] - 31s 77ms/step - loss: 0.3056 - accuracy: 0.8694 - val_loss: 0.3190 - val_accuracy: 0.8583
Epoch 9/10
391/391 [==============================] - 31s 77ms/step - loss: 0.3019 - accuracy: 0.8694 - val_loss: 0.3194 - val_accuracy: 0.8526
Epoch 10/10
391/391 [==============================] - 32s 77ms/step - loss: 0.2989 - accuracy: 0.8725 - val_loss: 0.3235 - val_accuracy: 0.8557
test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)
391/391 [==============================] - 14s 36ms/step - loss: 0.3221 - accuracy: 0.8602
Test Loss: 0.3221438527107239
Test Accuracy: 0.8601599931716919
plt.figure(figsize=(16, 8))
plt.subplot(1, 2, 1)
plot_graphs(history, 'accuracy')
plt.ylim(None, 1)
plt.subplot(1, 2, 2)
plot_graphs(history, 'loss')
plt.ylim(0, None)
(0.0, 0.6662303507328033)

png

Run a prediction on a new sentence:

If the prediction is >= 0.0, it is positive else it is negative.

sample_text = ('The movie was cool. The animation and the graphics '
               'were out of this world. I would recommend this movie.')
predictions = model.predict(np.array([sample_text]))

Stack two or more LSTM layers

Keras recurrent layers have two available modes that are controlled by the return_sequences constructor argument:

  • If False it returns only the last output for each input sequence (a 2D tensor of shape (batch_size, output_features)). This is the default, used in the previous model.

  • If True the full sequences of successive outputs for each timestep is returned (a 3D tensor of shape (batch_size, timesteps, output_features)).

Here is what the flow of information looks like with return_sequences=True:

layered_bidirectional

The interesting thing about using an RNN with return_sequences=True is that the output still has 3-axes, like the input, so it can be passed to another RNN layer, like this:

model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(len(encoder.get_vocabulary()), 64, mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64,  return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1)
])
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])
history = model.fit(train_dataset, epochs=10,
                    validation_data=test_dataset,
                    validation_steps=30)
Epoch 1/10
391/391 [==============================] - 71s 145ms/step - loss: 0.6427 - accuracy: 0.5680 - val_loss: 0.4593 - val_accuracy: 0.8021
Epoch 2/10
391/391 [==============================] - 54s 134ms/step - loss: 0.4064 - accuracy: 0.8222 - val_loss: 0.3657 - val_accuracy: 0.8339
Epoch 3/10
391/391 [==============================] - 52s 133ms/step - loss: 0.3467 - accuracy: 0.8529 - val_loss: 0.3382 - val_accuracy: 0.8536
Epoch 4/10
391/391 [==============================] - 53s 133ms/step - loss: 0.3238 - accuracy: 0.8604 - val_loss: 0.3312 - val_accuracy: 0.8469
Epoch 5/10
391/391 [==============================] - 54s 134ms/step - loss: 0.3182 - accuracy: 0.8647 - val_loss: 0.3220 - val_accuracy: 0.8568
Epoch 6/10
391/391 [==============================] - 53s 133ms/step - loss: 0.3140 - accuracy: 0.8648 - val_loss: 0.3193 - val_accuracy: 0.8609
Epoch 7/10
391/391 [==============================] - 53s 134ms/step - loss: 0.3068 - accuracy: 0.8683 - val_loss: 0.3268 - val_accuracy: 0.8620
Epoch 8/10
391/391 [==============================] - 54s 137ms/step - loss: 0.3066 - accuracy: 0.8687 - val_loss: 0.3237 - val_accuracy: 0.8625
Epoch 9/10
391/391 [==============================] - 53s 132ms/step - loss: 0.3057 - accuracy: 0.8686 - val_loss: 0.3200 - val_accuracy: 0.8526
Epoch 10/10
391/391 [==============================] - 54s 137ms/step - loss: 0.3006 - accuracy: 0.8719 - val_loss: 0.3247 - val_accuracy: 0.8583
test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)
391/391 [==============================] - 24s 62ms/step - loss: 0.3207 - accuracy: 0.8607
Test Loss: 0.3207028806209564
Test Accuracy: 0.8607199788093567
# predict on a sample text without padding.

sample_text = ('The movie was not good. The animation and the graphics '
               'were terrible. I would not recommend this movie.')
predictions = model.predict(np.array([sample_text]))
print(predictions)
[[-1.8613071]]
plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
plot_graphs(history, 'accuracy')
plt.subplot(1, 2, 2)
plot_graphs(history, 'loss')

png

Check out other existing recurrent layers such as GRU layers.

If you're interestied in building custom RNNs, see the Keras RNN Guide.