Text classification with an RNN

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

This text classification tutorial trains a recurrent neural network on the IMDB large movie review dataset for sentiment analysis.

Setup

pip install -q tensorflow_datasets
import numpy as np

import tensorflow_datasets as tfds
import tensorflow as tf

tfds.disable_progress_bar()

Import matplotlib and create a helper function to plot graphs:

import matplotlib.pyplot as plt

def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])

Setup input pipeline

The IMDB large movie review dataset is a binary classification dataset—all the reviews have either a positive or negative sentiment.

Download the dataset using TFDS. See the loading text tutorial for details on how to load this sort of data manually.

dataset, info = tfds.load('imdb_reviews', with_info=True,
                          as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

train_dataset.element_spec
Downloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /home/kbuilder/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...
Shuffling and writing examples to /home/kbuilder/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete6S34RL/imdb_reviews-train.tfrecord
Shuffling and writing examples to /home/kbuilder/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete6S34RL/imdb_reviews-test.tfrecord
Shuffling and writing examples to /home/kbuilder/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete6S34RL/imdb_reviews-unsupervised.tfrecord
Dataset imdb_reviews downloaded and prepared to /home/kbuilder/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.

(TensorSpec(shape=(), dtype=tf.string, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))

Initially this returns a dataset of (text, label pairs):

for example, label in train_dataset.take(1):
  print('text: ', example.numpy())
  print('label: ', label.numpy())
text:  b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
label:  0

Next shuffle the data for training and create batches of these (text, label) pairs:

BUFFER_SIZE = 10000
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)
for example, label in train_dataset.take(1):
  print('texts: ', example.numpy()[:3])
  print()
  print('labels: ', label.numpy()[:3])
texts:  [b"<br /><br />One would expect a movie with a famous comedian in the lead role, to be a funny movie. This is not the case here. I laughed out loud once throughout the whole movie, and that wasn't even during the final comedy-scene (which one would also expect to be the funniest). This is one you can watch when it comes to TV, don't spend any other money renting it."
 b"I LOVE this show, it's sure to be a winner. Jessica Alba does a great job, it's about time we have a kick-ass girl who's not the cutesy type. The entire cast is wonderful and all the episopes have good plots. Everything is layed out well, and thought over. To put it together must have taken a while, because it wasn't someone in a hurry that just slapped something together. It's a GREAT show altogether."
 b'SAKURA KILLERS (1+ outta 5 stars) Maybe in 1987 this movie might have seemed cool... if you had never ever seen a *good* ninja movie. Cheesy \'80s music... cheesy dialogue... cheesy acting... and way-beyond-cheesy martial arts sequences. The coolest scene is at the beginning... with an aged Chuck Connors playing golf on a beach... several black clad ninjas try to sneak up on him and it looks like he is too intent on hitting his ball to notice... suddenly he reaches into his golf bag and... naw, I won\'t spoil it for you... if you ever have the misfortune of seeing this movie you\'ll thank me. The story is a lot of nonsense about some stolen videotape or something. A bunch of dim-bulb Caucasian heroes are trained in the ways the ninja because "only a ninja can fight a ninja" or something like that. Strange, these guys don\'t seem to fight any better after their training than before... oh well, the movie does move along pretty briskly. The fight scenes may not be great.. but they are plentiful... and the overdone sound effects are good for a few chuckles.']

labels:  [0 1 0]

Create the text encoder

The raw text loaded by tfds needs to be processed before it can be used in a model. The simplest way to process text for training is using the experimental.preprocessing.TextVectorization layer. This layer has many capabilities, but this tutorial sticks to the default behavior.

Create the layer, and pass the dataset's text to the layer's .adapt method:

VOCAB_SIZE=1000
encoder = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=VOCAB_SIZE)
encoder.adapt(train_dataset.map(lambda text, label: text))

The .adapt method sets the layer's vocabulary. Here are the first 20 tokens. After the padding and unknown tokens they're sorted by frequency:

vocab = np.array(encoder.get_vocabulary())
vocab[:20]
array(['', '[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i',
       'this', 'that', 'br', 'was', 'as', 'for', 'with', 'movie', 'but'],
      dtype='<U14')

Once the vocabulary is set, the layer can encode text into indices. The tensors of indices are 0-padded to the longest sequence in the batch (unless you set a fixed output_sequence_length):

encoded_example = encoder(example)[:3].numpy()
encoded_example
array([[ 13,  13,  29, ...,   0,   0,   0],
       [ 10, 116,  11, ...,   0,   0,   0],
       [  1,   1, 470, ...,   0,   0,   0]])

With the default settings, the process is not completely reversible. There are three main reasns for that:

  1. The default value for preprocessing.TextVectorization's standardize argument is "lower_and_strip_punctuation".
  2. The limited vocabulary size and lack of character-based fallback results in some unknown tokens.
for n in range(3):
  print("Original: ", example[n].numpy())
  print("Round-trip: ", " ".join(vocab[encoded_example[n]]))
  print()
Original:  b"<br /><br />One would expect a movie with a famous comedian in the lead role, to be a funny movie. This is not the case here. I laughed out loud once throughout the whole movie, and that wasn't even during the final comedy-scene (which one would also expect to be the funniest). This is one you can watch when it comes to TV, don't spend any other money renting it."
Round-trip:  br br one would expect a movie with a famous [UNK] in the lead role to be a funny movie this is not the case here i [UNK] out [UNK] once throughout the whole movie and that wasnt even during the final [UNK] which one would also expect to be the [UNK] this is one you can watch when it comes to tv dont [UNK] any other money [UNK] it                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          

Original:  b"I LOVE this show, it's sure to be a winner. Jessica Alba does a great job, it's about time we have a kick-ass girl who's not the cutesy type. The entire cast is wonderful and all the episopes have good plots. Everything is layed out well, and thought over. To put it together must have taken a while, because it wasn't someone in a hurry that just slapped something together. It's a GREAT show altogether."
Round-trip:  i love this show its sure to be a [UNK] [UNK] [UNK] does a great job its about time we have a [UNK] girl whos not the [UNK] type the entire cast is wonderful and all the [UNK] have good [UNK] everything is [UNK] out well and thought over to put it together must have taken a while because it wasnt someone in a [UNK] that just [UNK] something together its a great show [UNK]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     

Original:  b'SAKURA KILLERS (1+ outta 5 stars) Maybe in 1987 this movie might have seemed cool... if you had never ever seen a *good* ninja movie. Cheesy \'80s music... cheesy dialogue... cheesy acting... and way-beyond-cheesy martial arts sequences. The coolest scene is at the beginning... with an aged Chuck Connors playing golf on a beach... several black clad ninjas try to sneak up on him and it looks like he is too intent on hitting his ball to notice... suddenly he reaches into his golf bag and... naw, I won\'t spoil it for you... if you ever have the misfortune of seeing this movie you\'ll thank me. The story is a lot of nonsense about some stolen videotape or something. A bunch of dim-bulb Caucasian heroes are trained in the ways the ninja because "only a ninja can fight a ninja" or something like that. Strange, these guys don\'t seem to fight any better after their training than before... oh well, the movie does move along pretty briskly. The fight scenes may not be great.. but they are plentiful... and the overdone sound effects are good for a few chuckles.'
Round-trip:  [UNK] [UNK] 1 [UNK] 5 stars maybe in [UNK] this movie might have seemed cool if you had never ever seen a good [UNK] movie cheesy 80s music cheesy dialogue cheesy acting and [UNK] [UNK] [UNK] sequences the [UNK] scene is at the beginning with an [UNK] [UNK] [UNK] playing [UNK] on a [UNK] several black [UNK] [UNK] try to [UNK] up on him and it looks like he is too [UNK] on [UNK] his [UNK] to [UNK] [UNK] he [UNK] into his [UNK] [UNK] and [UNK] i wont [UNK] it for you if you ever have the [UNK] of seeing this movie youll [UNK] me the story is a lot of [UNK] about some [UNK] [UNK] or something a bunch of [UNK] [UNK] [UNK] are [UNK] in the ways the [UNK] because only a [UNK] can fight a [UNK] or something like that strange these guys dont seem to fight any better after their [UNK] than before oh well the movie does move along pretty [UNK] the fight scenes may not be great but they are [UNK] and the [UNK] sound effects are good for a few [UNK]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   


Create the model

A drawing of the information flow in the model

Above is a diagram of the model.

  1. This model can be build as a tf.keras.Sequential.

  2. The first layer is the encoder, which converts the text to a sequence of token indices.

  3. After the encoder is an embedding layer. An embedding layer stores one vector per word. When called, it converts the sequences of word indices to sequences of vectors. These vectors are trainable. After training (on enough data), words with similar meanings often have similar vectors.

    This index-lookup is much more efficient than the equivalent operation of passing a one-hot encoded vector through a tf.keras.layers.Dense layer.

  4. A recurrent neural network (RNN) processes sequence input by iterating through the elements. RNNs pass the outputs from one timestep to their input on the next timestep.

    The tf.keras.layers.Bidirectional wrapper can also be used with an RNN layer. This propagates the input forward and backwards through the RNN layer and then concatenates the final output.

    • The main advantage to a bidirectional RNN is that the signal from the beginning of the input doesn't need to be processed all the way through every timestep to affect the output.

    • The main disadvantage of a bidirectional RNN is that you can't efficiently stream predictions as words are being added to the end.

  5. After the RNN has converted the sequence to a single vector the two layers.Dense do some final processing, and convert from this vector representation to a single logit as the classification output.

The code to implement this is below:

model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),
        output_dim=64,
        # Use masking to handle the variable sequence lengths
        mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

Please note that Keras sequential model is used here since all the layers in the model only have single input and produce single output. In case you want to use stateful RNN layer, you might want to build your model with Keras functional API or model subclassing so that you can retrieve and reuse the RNN layer states. Please check Keras RNN guide for more details.

The embedding layer uses masking to handle the varying sequence-lengths. All the layers after the Embedding support masking:

print([layer.supports_masking for layer in model.layers])
[False, True, True, True, True]

To confirm that this works as expected, evaluate a sentence twice. First, alone so there's no padding to mask:

# predict on a sample text without padding.

sample_text = ('The movie was cool. The animation and the graphics '
               'were out of this world. I would recommend this movie.')
predictions = model.predict(np.array([sample_text]))
print(predictions[0])
[0.00017489]

Now, evaluate it again in a batch with a longer sentence. The result should be identical:

# predict on a sample text with padding

padding = "the " * 2000
predictions = model.predict(np.array([sample_text, padding]))
print(predictions[0])
[0.00017488]

Compile the Keras model to configure the training process:

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

Train the model

history = model.fit(train_dataset, epochs=10,
                    validation_data=test_dataset, 
                    validation_steps=30)
Epoch 1/10
391/391 [==============================] - 31s 78ms/step - loss: 0.6277 - accuracy: 0.5881 - val_loss: 0.5300 - val_accuracy: 0.7318
Epoch 2/10
391/391 [==============================] - 29s 73ms/step - loss: 0.4763 - accuracy: 0.7584 - val_loss: 0.4421 - val_accuracy: 0.7724
Epoch 3/10
391/391 [==============================] - 28s 72ms/step - loss: 0.3821 - accuracy: 0.8255 - val_loss: 0.3749 - val_accuracy: 0.8313
Epoch 4/10
391/391 [==============================] - 29s 73ms/step - loss: 0.3460 - accuracy: 0.8482 - val_loss: 0.3658 - val_accuracy: 0.8349
Epoch 5/10
391/391 [==============================] - 29s 73ms/step - loss: 0.3257 - accuracy: 0.8582 - val_loss: 0.3374 - val_accuracy: 0.8500
Epoch 6/10
391/391 [==============================] - 28s 71ms/step - loss: 0.3143 - accuracy: 0.8644 - val_loss: 0.3333 - val_accuracy: 0.8495
Epoch 7/10
391/391 [==============================] - 28s 70ms/step - loss: 0.3079 - accuracy: 0.8686 - val_loss: 0.3254 - val_accuracy: 0.8542
Epoch 8/10
391/391 [==============================] - 29s 74ms/step - loss: 0.3048 - accuracy: 0.8699 - val_loss: 0.3237 - val_accuracy: 0.8573
Epoch 9/10
391/391 [==============================] - 29s 73ms/step - loss: 0.2998 - accuracy: 0.8704 - val_loss: 0.3242 - val_accuracy: 0.8573
Epoch 10/10
391/391 [==============================] - 29s 74ms/step - loss: 0.2996 - accuracy: 0.8711 - val_loss: 0.3243 - val_accuracy: 0.8573

test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))
391/391 [==============================] - 14s 36ms/step - loss: 0.3200 - accuracy: 0.8628
Test Loss: 0.3200407922267914
Test Accuracy: 0.8627600073814392

plt.figure(figsize=(16,8))
plt.subplot(1,2,1)
plot_graphs(history, 'accuracy')
plt.ylim(None,1)
plt.subplot(1,2,2)
plot_graphs(history, 'loss')
plt.ylim(0,None)
(0.0, 0.6440628841519356)

png

Run a prediction on a neew sentence:

If the prediction is >= 0.0, it is positive else it is negative.

sample_text = ('The movie was cool. The animation and the graphics '
               'were out of this world. I would recommend this movie.')
predictions = model.predict(np.array([sample_text]))

Stack two or more LSTM layers

Keras recurrent layers have two available modes that are controlled by the return_sequences constructor argument:

  • If False it returns only the last output for each input sequence (a 2D tensor of shape (batch_size, output_features)). This is the default, used in the previous model.

  • If True the full sequences of successive outputs for each timestep is returned (a 3D tensor of shape (batch_size, timesteps, output_features)).

Here is what the flow of information looks like with return_sequences=True:

layered_bidirectional

The interesting thing about using an RNN with return_sequences=True is that the output still has 3-axes, like the input, so it can be passed to another RNN layer, like this:

model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(len(encoder.get_vocabulary()), 64, mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64,  return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1)
])
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])
history = model.fit(train_dataset, epochs=10,
                    validation_data=test_dataset,
                    validation_steps=30)
Epoch 1/10
391/391 [==============================] - 56s 144ms/step - loss: 0.6507 - accuracy: 0.5621 - val_loss: 0.5173 - val_accuracy: 0.6797
Epoch 2/10
391/391 [==============================] - 52s 133ms/step - loss: 0.4008 - accuracy: 0.8187 - val_loss: 0.3591 - val_accuracy: 0.8448
Epoch 3/10
391/391 [==============================] - 53s 135ms/step - loss: 0.3397 - accuracy: 0.8526 - val_loss: 0.3300 - val_accuracy: 0.8490
Epoch 4/10
391/391 [==============================] - 50s 129ms/step - loss: 0.3238 - accuracy: 0.8599 - val_loss: 0.3306 - val_accuracy: 0.8615
Epoch 5/10
391/391 [==============================] - 50s 127ms/step - loss: 0.3127 - accuracy: 0.8668 - val_loss: 0.3184 - val_accuracy: 0.8557
Epoch 6/10
391/391 [==============================] - 49s 126ms/step - loss: 0.3087 - accuracy: 0.8674 - val_loss: 0.3285 - val_accuracy: 0.8562
Epoch 7/10
391/391 [==============================] - 50s 127ms/step - loss: 0.3075 - accuracy: 0.8684 - val_loss: 0.3223 - val_accuracy: 0.8469
Epoch 8/10
391/391 [==============================] - 49s 126ms/step - loss: 0.3021 - accuracy: 0.8685 - val_loss: 0.3211 - val_accuracy: 0.8521
Epoch 9/10
391/391 [==============================] - 49s 125ms/step - loss: 0.3008 - accuracy: 0.8688 - val_loss: 0.3330 - val_accuracy: 0.8609
Epoch 10/10
391/391 [==============================] - 48s 123ms/step - loss: 0.2995 - accuracy: 0.8709 - val_loss: 0.3169 - val_accuracy: 0.8604

test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))
391/391 [==============================] - 22s 57ms/step - loss: 0.3141 - accuracy: 0.8602
Test Loss: 0.3140539526939392
Test Accuracy: 0.8602399826049805

# predict on a sample text without padding.

sample_text = ('The movie was not good. The animation and the graphics '
                    'were terrible. I would not recommend this movie.')
predictions = model.predict(np.array([sample_text]))
print(predictions)
[[-2.053952]]

plt.figure(figsize=(16,6))
plt.subplot(1,2,1)
plot_graphs(history, 'accuracy')
plt.subplot(1,2,2)
plot_graphs(history, 'loss')

png

Check out other existing recurrent layers such as GRU layers.

If you're interestied in building custom RNNs, see the Keras RNN Guide.