Text classification with preprocessed text: Movie reviews

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

This notebook classifies movie reviews as positive or negative using the text of the review. This is an example of binary—or two-class—classification, an important and widely applicable kind of machine learning problem.

We'll use the IMDB dataset that contains the text of 50,000 movie reviews from the Internet Movie Database. These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.

This notebook uses tf.keras, a high-level API to build and train models in TensorFlow. For a more advanced text classification tutorial using tf.keras, see the MLCC Text Classification Guide.

Setup

import tensorflow as tf
from tensorflow import keras

import tensorflow_datasets as tfds
tfds.disable_progress_bar()

import numpy as np

print(tf.__version__)
2.2.0

Download the IMDB dataset

The IMDB movie reviews dataset comes packaged in tfds. It has already been preprocessed so that the reviews (sequences of words) have been converted to sequences of integers, where each integer represents a specific word in a dictionary.

The following code downloads the IMDB dataset to your machine (or uses a cached copy if you've already downloaded it):

To encode your own text see the Loading text tutorial

(train_data, test_data), info = tfds.load(
    # Use the version pre-encoded with an ~8k vocabulary.
    'imdb_reviews/subwords8k', 
    # Return the train/test datasets as a tuple.
    split = (tfds.Split.TRAIN, tfds.Split.TEST),
    # Return (example, label) pairs from the dataset (instead of a dictionary).
    as_supervised=True,
    # Also return the `info` structure. 
    with_info=True)
WARNING:absl:TFDS datasets with text encoding are deprecated and will be removed in a future version. Instead, you should use the plain text version and tokenize the text using `tensorflow_text` (See: https://www.tensorflow.org/tutorials/tensorflow_text/intro#tfdata_example)

Downloading and preparing dataset imdb_reviews/subwords8k/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /home/kbuilder/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0...
Shuffling and writing examples to /home/kbuilder/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incompleteID3HUW/imdb_reviews-train.tfrecord
Shuffling and writing examples to /home/kbuilder/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incompleteID3HUW/imdb_reviews-test.tfrecord
Shuffling and writing examples to /home/kbuilder/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incompleteID3HUW/imdb_reviews-unsupervised.tfrecord
Dataset imdb_reviews downloaded and prepared to /home/kbuilder/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0. Subsequent calls will reuse this data.

Try the encoder

The dataset info includes the text encoder (a tfds.features.text.SubwordTextEncoder).

encoder = info.features['text'].encoder
print ('Vocabulary size: {}'.format(encoder.vocab_size))
Vocabulary size: 8185

This text encoder will reversibly encode any string:

sample_string = 'Hello TensorFlow.'

encoded_string = encoder.encode(sample_string)
print ('Encoded string is {}'.format(encoded_string))

original_string = encoder.decode(encoded_string)
print ('The original string: "{}"'.format(original_string))

assert original_string == sample_string
Encoded string is [4025, 222, 6307, 2327, 4043, 2120, 7975]
The original string: "Hello TensorFlow."

The encoder encodes the string by breaking it into subwords or characters if the word is not in its dictionary. So the more a string resembles the dataset, the shorter the encoded representation will be.

for ts in encoded_string:
  print ('{} ----> {}'.format(ts, encoder.decode([ts])))
4025 ----> Hell
222 ----> o 
6307 ----> Ten
2327 ----> sor
4043 ----> Fl
2120 ----> ow
7975 ----> .

Explore the data

Let's take a moment to understand the format of the data. The dataset comes preprocessed: each example is an array of integers representing the words of the movie review.

The text of reviews have been converted to integers, where each integer represents a specific word-piece in the dictionary.

Each label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review.

Here's what the first review looks like:

for train_example, train_label in train_data.take(1):
  print('Encoded text:', train_example[:10].numpy())
  print('Label:', train_label.numpy())
Encoded text: [  62   18   41  604  927   65    3  644 7968   21]
Label: 0

The info structure contains the encoder/decoder. The encoder can be used to recover the original text:

encoder.decode(train_example)
"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."

Prepare the data for training

You will want to create batches of training data for your model. The reviews are all different lengths, so use padded_batch to zero pad the sequences while batching:

BUFFER_SIZE = 1000

train_batches = (
    train_data
    .shuffle(BUFFER_SIZE)
    .padded_batch(32))

test_batches = (
    test_data
    .padded_batch(32))

Each batch will have a shape of (batch_size, sequence_length) because the padding is dynamic each batch will have a different length:

for example_batch, label_batch in train_batches.take(2):
  print("Batch shape:", example_batch.shape)
  print("label shape:", label_batch.shape)
  
Batch shape: (32, 1033)
label shape: (32,)
Batch shape: (32, 1008)
label shape: (32,)

Build the model

The neural network is created by stacking layers—this requires two main architectural decisions:

  • How many layers to use in the model?
  • How many hidden units to use for each layer?

In this example, the input data consists of an array of word-indices. The labels to predict are either 0 or 1. Let's build a "Continuous bag of words" style model for this problem:

model = keras.Sequential([
  keras.layers.Embedding(encoder.vocab_size, 16),
  keras.layers.GlobalAveragePooling1D(),
  keras.layers.Dense(1)])

model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 16)          130960    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                 17        
=================================================================
Total params: 130,977
Trainable params: 130,977
Non-trainable params: 0
_________________________________________________________________

The layers are stacked sequentially to build the classifier:

  1. The first layer is an Embedding layer. This layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: (batch, sequence, embedding). To learn more about embeddings, see the word embedding tutorial.
  2. Next, a GlobalAveragePooling1D layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model to handle input of variable length, in the simplest way possible.
  3. This fixed-length output vector is piped through a fully-connected (Dense) layer with 16 hidden units.
  4. The last layer is densely connected with a single output node. This uses the default linear activation function that outputs logits for numerical stability. Another option is to use the sigmoid activation function that returns a float value between 0 and 1, representing a probability, or confidence level.

Hidden units

The above model has two intermediate or "hidden" layers, between the input and output. The number of outputs (units, nodes, or neurons) is the dimension of the representational space for the layer. In other words, the amount of freedom the network is allowed when learning an internal representation.

If a model has more hidden units (a higher-dimensional representation space), and/or more layers, then the network can learn more complex representations. However, it makes the network more computationally expensive and may lead to learning unwanted patterns—patterns that improve performance on training data but not on the test data. This is called overfitting, and we'll explore it later.

Loss function and optimizer

A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs logits (a single-unit layer with a linear activation), we'll use the binary_crossentropy loss function.

This isn't the only choice for a loss function, you could, for instance, choose mean_squared_error. But, generally, binary_crossentropy is better for dealing with probabilities—it measures the "distance" between probability distributions, or in our case, between the ground-truth distribution and the predictions.

Later, when we are exploring regression problems (say, to predict the price of a house), we will see how to use another loss function called mean squared error.

Now, configure the model to use an optimizer and a loss function:

model.compile(optimizer='adam',
              loss=tf.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

Train the model

Train the model by passing the Dataset object to the model's fit function. Set the number of epochs.

history = model.fit(train_batches,
                    epochs=30,
                    validation_data=test_batches,
                    validation_steps=30)
Epoch 1/30
782/782 [==============================] - 4s 5ms/step - loss: 0.6821 - accuracy: 0.5004 - val_loss: 0.6639 - val_accuracy: 0.5052
Epoch 2/30
782/782 [==============================] - 4s 5ms/step - loss: 0.6186 - accuracy: 0.5565 - val_loss: 0.5927 - val_accuracy: 0.5792
Epoch 3/30
782/782 [==============================] - 4s 5ms/step - loss: 0.5368 - accuracy: 0.6691 - val_loss: 0.5294 - val_accuracy: 0.6729
Epoch 4/30
782/782 [==============================] - 4s 5ms/step - loss: 0.4705 - accuracy: 0.7521 - val_loss: 0.4802 - val_accuracy: 0.7406
Epoch 5/30
782/782 [==============================] - 4s 5ms/step - loss: 0.4183 - accuracy: 0.8050 - val_loss: 0.4429 - val_accuracy: 0.7875
Epoch 6/30
782/782 [==============================] - 4s 5ms/step - loss: 0.3781 - accuracy: 0.8356 - val_loss: 0.4172 - val_accuracy: 0.7937
Epoch 7/30
782/782 [==============================] - 4s 5ms/step - loss: 0.3473 - accuracy: 0.8550 - val_loss: 0.3954 - val_accuracy: 0.8438
Epoch 8/30
782/782 [==============================] - 4s 5ms/step - loss: 0.3231 - accuracy: 0.8688 - val_loss: 0.3816 - val_accuracy: 0.8573
Epoch 9/30
782/782 [==============================] - 4s 5ms/step - loss: 0.3025 - accuracy: 0.8789 - val_loss: 0.3701 - val_accuracy: 0.8438
Epoch 10/30
782/782 [==============================] - 4s 5ms/step - loss: 0.2855 - accuracy: 0.8866 - val_loss: 0.3617 - val_accuracy: 0.8646
Epoch 11/30
782/782 [==============================] - 4s 5ms/step - loss: 0.2708 - accuracy: 0.8948 - val_loss: 0.3551 - val_accuracy: 0.8635
Epoch 12/30
782/782 [==============================] - 4s 5ms/step - loss: 0.2573 - accuracy: 0.8999 - val_loss: 0.3561 - val_accuracy: 0.8438
Epoch 13/30
782/782 [==============================] - 4s 5ms/step - loss: 0.2453 - accuracy: 0.9046 - val_loss: 0.3479 - val_accuracy: 0.8635
Epoch 14/30
782/782 [==============================] - 4s 5ms/step - loss: 0.2355 - accuracy: 0.9098 - val_loss: 0.3512 - val_accuracy: 0.8469
Epoch 15/30
782/782 [==============================] - 4s 5ms/step - loss: 0.2274 - accuracy: 0.9136 - val_loss: 0.3461 - val_accuracy: 0.8604
Epoch 16/30
782/782 [==============================] - 4s 5ms/step - loss: 0.2183 - accuracy: 0.9164 - val_loss: 0.3430 - val_accuracy: 0.8646
Epoch 17/30
782/782 [==============================] - 4s 5ms/step - loss: 0.2101 - accuracy: 0.9187 - val_loss: 0.3453 - val_accuracy: 0.8635
Epoch 18/30
782/782 [==============================] - 4s 5ms/step - loss: 0.2033 - accuracy: 0.9221 - val_loss: 0.3450 - val_accuracy: 0.8708
Epoch 19/30
782/782 [==============================] - 4s 5ms/step - loss: 0.1978 - accuracy: 0.9249 - val_loss: 0.3449 - val_accuracy: 0.8656
Epoch 20/30
782/782 [==============================] - 4s 5ms/step - loss: 0.1917 - accuracy: 0.9273 - val_loss: 0.3476 - val_accuracy: 0.8677
Epoch 21/30
782/782 [==============================] - 4s 5ms/step - loss: 0.1849 - accuracy: 0.9308 - val_loss: 0.3483 - val_accuracy: 0.8687
Epoch 22/30
782/782 [==============================] - 4s 5ms/step - loss: 0.1807 - accuracy: 0.9334 - val_loss: 0.3504 - val_accuracy: 0.8677
Epoch 23/30
782/782 [==============================] - 4s 5ms/step - loss: 0.1741 - accuracy: 0.9361 - val_loss: 0.3527 - val_accuracy: 0.8677
Epoch 24/30
782/782 [==============================] - 4s 5ms/step - loss: 0.1701 - accuracy: 0.9376 - val_loss: 0.3621 - val_accuracy: 0.8615
Epoch 25/30
782/782 [==============================] - 4s 5ms/step - loss: 0.1650 - accuracy: 0.9396 - val_loss: 0.3591 - val_accuracy: 0.8708
Epoch 26/30
782/782 [==============================] - 4s 5ms/step - loss: 0.1606 - accuracy: 0.9411 - val_loss: 0.3621 - val_accuracy: 0.8635
Epoch 27/30
782/782 [==============================] - 4s 5ms/step - loss: 0.1564 - accuracy: 0.9438 - val_loss: 0.3701 - val_accuracy: 0.8625
Epoch 28/30
782/782 [==============================] - 4s 5ms/step - loss: 0.1526 - accuracy: 0.9449 - val_loss: 0.3714 - val_accuracy: 0.8604
Epoch 29/30
782/782 [==============================] - 4s 5ms/step - loss: 0.1503 - accuracy: 0.9453 - val_loss: 0.3743 - val_accuracy: 0.8604
Epoch 30/30
782/782 [==============================] - 4s 5ms/step - loss: 0.1461 - accuracy: 0.9483 - val_loss: 0.3779 - val_accuracy: 0.8635

Evaluate the model

And let's see how the model performs. Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy.

loss, accuracy = model.evaluate(test_batches)

print("Loss: ", loss)
print("Accuracy: ", accuracy)
782/782 [==============================] - 2s 3ms/step - loss: 0.3381 - accuracy: 0.8768
Loss:  0.3380794823169708
Accuracy:  0.8768399953842163

This fairly naive approach achieves an accuracy of about 87%. With more advanced approaches, the model should get closer to 95%.

Create a graph of accuracy and loss over time

model.fit() returns a History object that contains a dictionary with everything that happened during training:

history_dict = history.history
history_dict.keys()
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])

There are four entries: one for each monitored metric during training and validation. We can use these to plot the training and validation loss for comparison, as well as the training and validation accuracy:

import matplotlib.pyplot as plt

acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

png

plt.clf()   # clear figure

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')

plt.show()

png

In this plot, the dots represent the training loss and accuracy, and the solid lines are the validation loss and accuracy.

Notice the training loss decreases with each epoch and the training accuracy increases with each epoch. This is expected when using a gradient descent optimization—it should minimize the desired quantity on every iteration.

This isn't the case for the validation loss and accuracy—they seem to peak after about twenty epochs. This is an example of overfitting: the model performs better on the training data than it does on data it has never seen before. After this point, the model over-optimizes and learns representations specific to the training data that do not generalize to test data.

For this particular case, we could prevent overfitting by simply stopping the training after twenty or so epochs. Later, you'll see how to do this automatically with a callback.


#
# Copyright (c) 2017 François Chollet
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.