Text classification with an RNN

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

This text classification tutorial trains a recurrent neural network on the IMDB large movie review dataset for sentiment analysis.

Setup

import tensorflow_datasets as tfds
import tensorflow as tf

Import matplotlib and create a helper function to plot graphs:

import matplotlib.pyplot as plt

def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])
  plt.show()

Setup input pipeline

The IMDB large movie review dataset is a binary classification dataset—all the reviews have either a positive or negative sentiment.

Download the dataset using TFDS.

dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True,
                          as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']
WARNING:absl:TFDS datasets with text encoding are deprecated and will be removed in a future version. Instead, you should use the plain text version and tokenize the text using `tensorflow_text` (See: https://www.tensorflow.org/tutorials/tensorflow_text/intro#tfdata_example)

Downloading and preparing dataset imdb_reviews/subwords8k/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /home/kbuilder/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0...

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…





HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))
Shuffling and writing examples to /home/kbuilder/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incomplete8K8ZTT/imdb_reviews-train.tfrecord

HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))
HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))
Shuffling and writing examples to /home/kbuilder/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incomplete8K8ZTT/imdb_reviews-test.tfrecord

HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))
HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))
Shuffling and writing examples to /home/kbuilder/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incomplete8K8ZTT/imdb_reviews-unsupervised.tfrecord

HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))
Dataset imdb_reviews downloaded and prepared to /home/kbuilder/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0. Subsequent calls will reuse this data.

The dataset info includes the encoder (a tfds.features.text.SubwordTextEncoder).

encoder = info.features['text'].encoder
print('Vocabulary size: {}'.format(encoder.vocab_size))
Vocabulary size: 8185

This text encoder will reversibly encode any string, falling back to byte-encoding if necessary.

sample_string = 'Hello TensorFlow.'

encoded_string = encoder.encode(sample_string)
print('Encoded string is {}'.format(encoded_string))

original_string = encoder.decode(encoded_string)
print('The original string: "{}"'.format(original_string))
Encoded string is [4025, 222, 6307, 2327, 4043, 2120, 7975]
The original string: "Hello TensorFlow."

assert original_string == sample_string
for index in encoded_string:
  print('{} ----> {}'.format(index, encoder.decode([index])))
4025 ----> Hell
222 ----> o 
6307 ----> Ten
2327 ----> sor
4043 ----> Fl
2120 ----> ow
7975 ----> .

Prepare the data for training

Next create batches of these encoded strings. Use the padded_batch method to zero-pad the sequences to the length of the longest string in the batch:

BUFFER_SIZE = 10000
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE)

test_dataset = test_dataset.padded_batch(BATCH_SIZE)

Create the model

Build a tf.keras.Sequential model and start with an embedding layer. An embedding layer stores one vector per word. When called, it converts the sequences of word indices to sequences of vectors. These vectors are trainable. After training (on enough data), words with similar meanings often have similar vectors.

This index-lookup is much more efficient than the equivalent operation of passing a one-hot encoded vector through a tf.keras.layers.Dense layer.

A recurrent neural network (RNN) processes sequence input by iterating through the elements. RNNs pass the outputs from one timestep to their input—and then to the next.

The tf.keras.layers.Bidirectional wrapper can also be used with an RNN layer. This propagates the input forward and backwards through the RNN layer and then concatenates the output. This helps the RNN to learn long range dependencies.

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(encoder.vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

Please note that we choose to Keras sequential model here since all the layers in the model only have single input and produce single output. In case you want to use stateful RNN layer, you might want to build your model with Keras functional API or model subclassing so that you can retrieve and reuse the RNN layer states. Please check Keras RNN guide for more details.

Compile the Keras model to configure the training process:

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

Train the model

history = model.fit(train_dataset, epochs=10,
                    validation_data=test_dataset, 
                    validation_steps=30)
Epoch 1/10
391/391 [==============================] - 42s 107ms/step - loss: 0.6590 - accuracy: 0.5480 - val_loss: 0.5776 - val_accuracy: 0.6359
Epoch 2/10
391/391 [==============================] - 42s 107ms/step - loss: 0.3610 - accuracy: 0.8426 - val_loss: 0.3472 - val_accuracy: 0.8620
Epoch 3/10
391/391 [==============================] - 42s 108ms/step - loss: 0.2564 - accuracy: 0.9006 - val_loss: 0.3183 - val_accuracy: 0.8635
Epoch 4/10
391/391 [==============================] - 42s 106ms/step - loss: 0.2136 - accuracy: 0.9215 - val_loss: 0.3450 - val_accuracy: 0.8714
Epoch 5/10
391/391 [==============================] - 41s 106ms/step - loss: 0.1863 - accuracy: 0.9320 - val_loss: 0.3318 - val_accuracy: 0.8589
Epoch 6/10
391/391 [==============================] - 41s 106ms/step - loss: 0.1649 - accuracy: 0.9414 - val_loss: 0.3519 - val_accuracy: 0.8599
Epoch 7/10
391/391 [==============================] - 41s 106ms/step - loss: 0.1469 - accuracy: 0.9488 - val_loss: 0.3936 - val_accuracy: 0.8625
Epoch 8/10
391/391 [==============================] - 42s 106ms/step - loss: 0.1311 - accuracy: 0.9556 - val_loss: 0.4159 - val_accuracy: 0.8615
Epoch 9/10
391/391 [==============================] - 42s 106ms/step - loss: 0.1192 - accuracy: 0.9613 - val_loss: 0.4231 - val_accuracy: 0.8677
Epoch 10/10
391/391 [==============================] - 42s 107ms/step - loss: 0.1150 - accuracy: 0.9613 - val_loss: 0.4111 - val_accuracy: 0.8594

test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))
391/391 [==============================] - 17s 43ms/step - loss: 0.4165 - accuracy: 0.8546
Test Loss: 0.4165024161338806
Test Accuracy: 0.8545600175857544

The above model does not mask the padding applied to the sequences. This can lead to skew if trained on padded sequences and test on un-padded sequences. Ideally you would use masking to avoid this, but as you can see below it only have a small effect on the output.

If the prediction is >= 0.5, it is positive else it is negative.

def pad_to_size(vec, size):
  zeros = [0] * (size - len(vec))
  vec.extend(zeros)
  return vec
def sample_predict(sample_pred_text, pad):
  encoded_sample_pred_text = encoder.encode(sample_pred_text)

  if pad:
    encoded_sample_pred_text = pad_to_size(encoded_sample_pred_text, 64)
  encoded_sample_pred_text = tf.cast(encoded_sample_pred_text, tf.float32)
  predictions = model.predict(tf.expand_dims(encoded_sample_pred_text, 0))

  return (predictions)
# predict on a sample text without padding.

sample_pred_text = ('The movie was cool. The animation and the graphics '
                    'were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=False)
print(predictions)
[[0.15098701]]

# predict on a sample text with padding

sample_pred_text = ('The movie was cool. The animation and the graphics '
                    'were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=True)
print(predictions)
[[0.15792076]]

plot_graphs(history, 'accuracy')

png

plot_graphs(history, 'loss')

png

Stack two or more LSTM layers

Keras recurrent layers have two available modes that are controlled by the return_sequences constructor argument:

  • Return either the full sequences of successive outputs for each timestep (a 3D tensor of shape (batch_size, timesteps, output_features)).
  • Return only the last output for each input sequence (a 2D tensor of shape (batch_size, output_features)).
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(encoder.vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64,  return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1)
])
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])
history = model.fit(train_dataset, epochs=10,
                    validation_data=test_dataset,
                    validation_steps=30)
Epoch 1/10
391/391 [==============================] - 75s 191ms/step - loss: 0.6754 - accuracy: 0.5268 - val_loss: 0.5713 - val_accuracy: 0.6844
Epoch 2/10
391/391 [==============================] - 76s 193ms/step - loss: 0.3913 - accuracy: 0.8333 - val_loss: 0.3784 - val_accuracy: 0.8568
Epoch 3/10
391/391 [==============================] - 76s 194ms/step - loss: 0.2646 - accuracy: 0.9026 - val_loss: 0.3449 - val_accuracy: 0.8536
Epoch 4/10
391/391 [==============================] - 76s 195ms/step - loss: 0.2115 - accuracy: 0.9244 - val_loss: 0.3610 - val_accuracy: 0.8703
Epoch 5/10
391/391 [==============================] - 76s 195ms/step - loss: 0.1714 - accuracy: 0.9447 - val_loss: 0.3694 - val_accuracy: 0.8651
Epoch 6/10
391/391 [==============================] - 76s 196ms/step - loss: 0.1462 - accuracy: 0.9555 - val_loss: 0.4216 - val_accuracy: 0.8620
Epoch 7/10
391/391 [==============================] - 76s 195ms/step - loss: 0.1224 - accuracy: 0.9664 - val_loss: 0.4741 - val_accuracy: 0.8474
Epoch 8/10
391/391 [==============================] - 76s 195ms/step - loss: 0.1053 - accuracy: 0.9722 - val_loss: 0.5059 - val_accuracy: 0.8531
Epoch 9/10
391/391 [==============================] - 75s 193ms/step - loss: 0.1057 - accuracy: 0.9712 - val_loss: 0.4720 - val_accuracy: 0.8516
Epoch 10/10
391/391 [==============================] - 74s 190ms/step - loss: 0.0818 - accuracy: 0.9810 - val_loss: 0.5819 - val_accuracy: 0.8526

test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))
391/391 [==============================] - 32s 82ms/step - loss: 0.5755 - accuracy: 0.8511
Test Loss: 0.5754830241203308
Test Accuracy: 0.8511199951171875

# predict on a sample text without padding.

sample_pred_text = ('The movie was not good. The animation and the graphics '
                    'were terrible. I would not recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=False)
print(predictions)
[[-2.8208718]]

# predict on a sample text with padding

sample_pred_text = ('The movie was not good. The animation and the graphics '
                    'were terrible. I would not recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=True)
print(predictions)
[[-2.6763365]]

plot_graphs(history, 'accuracy')

png

plot_graphs(history, 'loss')

png

Check out other existing recurrent layers such as GRU layers.

If you're interestied in building custom RNNs, see the Keras RNN Guide.