How to solve a problem on Kaggle with TF-Hub

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

TF-Hub is a platform to share machine learning expertise packaged in reusable resources, notably pre-trained modules. In this tutorial, we will use a TF-Hub text embedding module to train a simple sentiment classifier with a reasonable baseline accuracy. We will then submit the predictions to Kaggle.

For more detailed tutorial on text classification with TF-Hub and further steps for improving the accuracy, take a look at Text classification with TF-Hub.

Setup

!pip install -q kaggle

# This notebook uses features from tf 2.2
!pip install tf-nightly 
WARNING: You are using pip version 20.0.2; however, version 20.1 is available.
You should consider upgrading via the '/tmpfs/src/tf_docs_env/bin/python -m pip install --upgrade pip' command.
Collecting tf-nightly
  Downloading tf_nightly-2.2.0.dev20200430-cp36-cp36m-manylinux2010_x86_64.whl (520.4 MB)
[K     |████████████████████████████████| 520.4 MB 13 kB/s 
[?25hRequirement already satisfied: six>=1.12.0 in /home/kbuilder/.local/lib/python3.6/site-packages (from tf-nightly) (1.14.0)
Requirement already satisfied: scipy==1.4.1; python_version >= "3" in /home/kbuilder/.local/lib/python3.6/site-packages (from tf-nightly) (1.4.1)
Collecting astunparse==1.6.3
  Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Requirement already satisfied: protobuf>=3.9.2 in /home/kbuilder/.local/lib/python3.6/site-packages (from tf-nightly) (3.11.3)
Requirement already satisfied: h5py<2.11.0,>=2.10.0 in /tmpfs/src/tf_docs_env/lib/python3.6/site-packages (from tf-nightly) (2.10.0)
Requirement already satisfied: keras-preprocessing>=1.1.0 in /tmpfs/src/tf_docs_env/lib/python3.6/site-packages (from tf-nightly) (1.1.0)
Collecting tf-estimator-nightly
  Downloading tf_estimator_nightly-2.3.0.dev2020043001-py2.py3-none-any.whl (456 kB)
[K     |████████████████████████████████| 456 kB 78.8 MB/s 
[?25hRequirement already satisfied: grpcio>=1.8.6 in /tmpfs/src/tf_docs_env/lib/python3.6/site-packages (from tf-nightly) (1.28.1)
Requirement already satisfied: wheel>=0.26; python_version >= "3" in /tmpfs/src/tf_docs_env/lib/python3.6/site-packages (from tf-nightly) (0.34.2)
Requirement already satisfied: numpy<2.0,>=1.16.0 in /home/kbuilder/.local/lib/python3.6/site-packages (from tf-nightly) (1.18.3)
Requirement already satisfied: opt-einsum>=2.3.2 in /tmpfs/src/tf_docs_env/lib/python3.6/site-packages (from tf-nightly) (3.2.1)
Requirement already satisfied: google-pasta>=0.1.8 in /tmpfs/src/tf_docs_env/lib/python3.6/site-packages (from tf-nightly) (0.2.0)
Requirement already satisfied: termcolor>=1.1.0 in /home/kbuilder/.local/lib/python3.6/site-packages (from tf-nightly) (1.1.0)
Collecting gast==0.3.3
  Downloading gast-0.3.3-py2.py3-none-any.whl (9.7 kB)
Collecting tb-nightly<2.4.0a0,>=2.3.0a0
  Downloading tb_nightly-2.3.0a20200430-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 89.0 MB/s 
[?25hRequirement already satisfied: wrapt>=1.11.1 in /home/kbuilder/.local/lib/python3.6/site-packages (from tf-nightly) (1.12.1)
Requirement already satisfied: absl-py>=0.7.0 in /home/kbuilder/.local/lib/python3.6/site-packages (from tf-nightly) (0.9.0)
Requirement already satisfied: setuptools in /tmpfs/src/tf_docs_env/lib/python3.6/site-packages (from protobuf>=3.9.2->tf-nightly) (46.1.3)
Requirement already satisfied: werkzeug>=0.11.15 in /tmpfs/src/tf_docs_env/lib/python3.6/site-packages (from tb-nightly<2.4.0a0,>=2.3.0a0->tf-nightly) (1.0.1)
Collecting tensorboard-plugin-wit>=1.6.0
  Downloading tensorboard_plugin_wit-1.6.0.post3-py3-none-any.whl (777 kB)
[K     |████████████████████████████████| 777 kB 79.6 MB/s 
[?25hRequirement already satisfied: requests<3,>=2.21.0 in /home/kbuilder/.local/lib/python3.6/site-packages (from tb-nightly<2.4.0a0,>=2.3.0a0->tf-nightly) (2.23.0)
Requirement already satisfied: google-auth<2,>=1.6.3 in /tmpfs/src/tf_docs_env/lib/python3.6/site-packages (from tb-nightly<2.4.0a0,>=2.3.0a0->tf-nightly) (1.14.1)
Requirement already satisfied: markdown>=2.6.8 in /tmpfs/src/tf_docs_env/lib/python3.6/site-packages (from tb-nightly<2.4.0a0,>=2.3.0a0->tf-nightly) (3.2.1)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /tmpfs/src/tf_docs_env/lib/python3.6/site-packages (from tb-nightly<2.4.0a0,>=2.3.0a0->tf-nightly) (0.4.1)
Requirement already satisfied: certifi>=2017.4.17 in /usr/lib/python3/dist-packages (from requests<3,>=2.21.0->tb-nightly<2.4.0a0,>=2.3.0a0->tf-nightly) (2018.1.18)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/lib/python3/dist-packages (from requests<3,>=2.21.0->tb-nightly<2.4.0a0,>=2.3.0a0->tf-nightly) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/lib/python3/dist-packages (from requests<3,>=2.21.0->tb-nightly<2.4.0a0,>=2.3.0a0->tf-nightly) (2.6)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/lib/python3/dist-packages (from requests<3,>=2.21.0->tb-nightly<2.4.0a0,>=2.3.0a0->tf-nightly) (1.22)
Requirement already satisfied: rsa<4.1,>=3.1.4 in /tmpfs/src/tf_docs_env/lib/python3.6/site-packages (from google-auth<2,>=1.6.3->tb-nightly<2.4.0a0,>=2.3.0a0->tf-nightly) (4.0)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /tmpfs/src/tf_docs_env/lib/python3.6/site-packages (from google-auth<2,>=1.6.3->tb-nightly<2.4.0a0,>=2.3.0a0->tf-nightly) (4.1.0)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/lib/python3/dist-packages (from google-auth<2,>=1.6.3->tb-nightly<2.4.0a0,>=2.3.0a0->tf-nightly) (0.2.1)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /tmpfs/src/tf_docs_env/lib/python3.6/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tb-nightly<2.4.0a0,>=2.3.0a0->tf-nightly) (1.3.0)
Requirement already satisfied: pyasn1>=0.1.3 in /usr/lib/python3/dist-packages (from rsa<4.1,>=3.1.4->google-auth<2,>=1.6.3->tb-nightly<2.4.0a0,>=2.3.0a0->tf-nightly) (0.4.2)
Requirement already satisfied: oauthlib>=3.0.0 in /tmpfs/src/tf_docs_env/lib/python3.6/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tb-nightly<2.4.0a0,>=2.3.0a0->tf-nightly) (3.1.0)
ERROR: tensorflow 2.1.0 has requirement gast==0.2.2, but you'll have gast 0.3.3 which is incompatible.
Installing collected packages: astunparse, tf-estimator-nightly, gast, tensorboard-plugin-wit, tb-nightly, tf-nightly
  Attempting uninstall: gast
    Found existing installation: gast 0.2.2
    Uninstalling gast-0.2.2:
      Successfully uninstalled gast-0.2.2
Successfully installed astunparse-1.6.3 gast-0.3.3 tb-nightly-2.3.0a20200430 tensorboard-plugin-wit-1.6.0.post3 tf-estimator-nightly-2.3.0.dev2020043001 tf-nightly-2.2.0.dev20200430
WARNING: You are using pip version 20.0.2; however, version 20.1 is available.
You should consider upgrading via the '/tmpfs/src/tf_docs_env/bin/python -m pip install --upgrade pip' command.

import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import zipfile

from sklearn import model_selection

Since this tutorial will be using a dataset from Kaggle, it requires creating an API Token for your Kaggle account, and uploading it to the Colab environment.

import os
import pathlib

# Upload the API token.
def get_kaggle():
  try:
    import kaggle
    return kaggle
  except OSError:
    pass

  token_file = pathlib.Path("~/.kaggle/kaggle.json").expanduser()
  token_file.parent.mkdir(exist_ok=True, parents=True)

  try:
    from google.colab import files
  except ImportError:
    raise ValueError("Could not find kaggle token.")

  uploaded = files.upload()
  token_content = uploaded.get('kaggle.json', None)
  if token_content:
    token_file.write_bytes(token_content)
    token_file.chmod(0o600)
  else:
    raise ValueError('Need a file named "kaggle.json"')
  
  import kaggle
  return kaggle


kaggle = get_kaggle()

Getting started

Data

We will try to solve the Sentiment Analysis on Movie Reviews task from Kaggle. The dataset consists of syntactic subphrases of the Rotten Tomatoes movie reviews. The task is to label the phrases as negative or positive on the scale from 1 to 5.

You must accept the competition rules before you can use the API to download the data.

SENTIMENT_LABELS = [
    "negative", "somewhat negative", "neutral", "somewhat positive", "positive"
]

# Add a column with readable values representing the sentiment.
def add_readable_labels_column(df, sentiment_value_column):
  df["SentimentLabel"] = df[sentiment_value_column].replace(
      range(5), SENTIMENT_LABELS)
    
# Download data from Kaggle and create a DataFrame.
def load_data_from_zip(path):
  with zipfile.ZipFile(path, "r") as zip_ref:
    name = zip_ref.namelist()[0]
    with zip_ref.open(name) as zf:
      return pd.read_csv(zf, sep="\t", index_col=0)


# The data does not come with a validation set so we'll create one from the
# training set.
def get_data(competition, train_file, test_file, validation_set_ratio=0.1):
  data_path = pathlib.Path("data")
  kaggle.api.competition_download_files(competition, data_path)
  competition_path = (data_path/competition)
  competition_path.mkdir(exist_ok=True, parents=True)
  competition_zip_path = competition_path.with_suffix(".zip")

  with zipfile.ZipFile(competition_zip_path, "r") as zip_ref:
    zip_ref.extractall(competition_path)
  
  train_df = load_data_from_zip(competition_path/train_file)
  test_df = load_data_from_zip(competition_path/test_file)

  # Add a human readable label.
  add_readable_labels_column(train_df, "Sentiment")

  # We split by sentence ids, because we don't want to have phrases belonging
  # to the same sentence in both training and validation set.
  train_indices, validation_indices = model_selection.train_test_split(
      np.unique(train_df["SentenceId"]),
      test_size=validation_set_ratio,
      random_state=0)

  validation_df = train_df[train_df["SentenceId"].isin(validation_indices)]
  train_df = train_df[train_df["SentenceId"].isin(train_indices)]
  print("Split the training data into %d training and %d validation examples." %
        (len(train_df), len(validation_df)))

  return train_df, validation_df, test_df


train_df, validation_df, test_df = get_data(
    "sentiment-analysis-on-movie-reviews",
    "train.tsv.zip", "test.tsv.zip")
Split the training data into 140315 training and 15745 validation examples.

train_df.head(20)

Training an Model

class MyModel(tf.keras.Model):
  def __init__(self, hub_url):
    super().__init__()
    self.hub_url = hub_url
    self.embed = hub.load(self.hub_url).signatures['default']
    self.sequential = tf.keras.Sequential([
      tf.keras.layers.Dense(500),
      tf.keras.layers.Dense(100),
      tf.keras.layers.Dense(5),
    ])

  def call(self, inputs):
    phrases = inputs['Phrase'][:,0]
    embedding = 5*self.embed(phrases)['default']
    return self.sequential(embedding)

  def get_config(self):
    return {"hub_url":self.hub_url}
model = MyModel("https://tfhub.dev/google/nnlm-en-dim128/1")
model.compile(
    loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=tf.optimizers.Adam(), 
    metrics = [tf.keras.metrics.SparseCategoricalAccuracy(name="accuracy")])
history = model.fit(x=dict(train_df), y=train_df['Sentiment'],
          validation_data=(dict(validation_df), validation_df['Sentiment']),
          epochs = 25)
Epoch 1/25
4385/4385 [==============================] - 13s 3ms/step - loss: 1.0241 - accuracy: 0.5854 - val_loss: 0.9934 - val_accuracy: 0.5891
Epoch 2/25
4385/4385 [==============================] - 13s 3ms/step - loss: 1.0000 - accuracy: 0.5942 - val_loss: 0.9950 - val_accuracy: 0.5982
Epoch 3/25
4385/4385 [==============================] - 13s 3ms/step - loss: 0.9960 - accuracy: 0.5970 - val_loss: 0.9780 - val_accuracy: 0.6005
Epoch 4/25
4385/4385 [==============================] - 13s 3ms/step - loss: 0.9930 - accuracy: 0.5980 - val_loss: 0.9817 - val_accuracy: 0.5971
Epoch 5/25
4385/4385 [==============================] - 13s 3ms/step - loss: 0.9914 - accuracy: 0.5979 - val_loss: 0.9838 - val_accuracy: 0.5928
Epoch 6/25
4385/4385 [==============================] - 13s 3ms/step - loss: 0.9899 - accuracy: 0.5981 - val_loss: 0.9786 - val_accuracy: 0.5991
Epoch 7/25
4385/4385 [==============================] - 13s 3ms/step - loss: 0.9896 - accuracy: 0.5985 - val_loss: 0.9872 - val_accuracy: 0.5880
Epoch 8/25
4385/4385 [==============================] - 13s 3ms/step - loss: 0.9886 - accuracy: 0.5993 - val_loss: 0.9841 - val_accuracy: 0.5969
Epoch 9/25
4385/4385 [==============================] - 13s 3ms/step - loss: 0.9882 - accuracy: 0.5991 - val_loss: 0.9811 - val_accuracy: 0.5955
Epoch 10/25
4385/4385 [==============================] - 13s 3ms/step - loss: 0.9876 - accuracy: 0.5983 - val_loss: 0.9801 - val_accuracy: 0.5954
Epoch 11/25
4385/4385 [==============================] - 13s 3ms/step - loss: 0.9880 - accuracy: 0.5998 - val_loss: 0.9764 - val_accuracy: 0.6010
Epoch 12/25
4385/4385 [==============================] - 13s 3ms/step - loss: 0.9875 - accuracy: 0.6001 - val_loss: 0.9788 - val_accuracy: 0.5947
Epoch 13/25
4385/4385 [==============================] - 13s 3ms/step - loss: 0.9869 - accuracy: 0.6000 - val_loss: 0.9731 - val_accuracy: 0.6011
Epoch 14/25
4385/4385 [==============================] - 13s 3ms/step - loss: 0.9869 - accuracy: 0.5992 - val_loss: 0.9792 - val_accuracy: 0.5956
Epoch 15/25
4385/4385 [==============================] - 13s 3ms/step - loss: 0.9868 - accuracy: 0.5988 - val_loss: 0.9791 - val_accuracy: 0.5930
Epoch 16/25
4385/4385 [==============================] - 13s 3ms/step - loss: 0.9863 - accuracy: 0.5999 - val_loss: 0.9741 - val_accuracy: 0.5987
Epoch 17/25
4385/4385 [==============================] - 13s 3ms/step - loss: 0.9863 - accuracy: 0.5996 - val_loss: 0.9865 - val_accuracy: 0.5945
Epoch 18/25
4385/4385 [==============================] - 13s 3ms/step - loss: 0.9860 - accuracy: 0.6001 - val_loss: 0.9816 - val_accuracy: 0.5931
Epoch 19/25
4385/4385 [==============================] - 13s 3ms/step - loss: 0.9860 - accuracy: 0.6002 - val_loss: 0.9823 - val_accuracy: 0.5912
Epoch 20/25
4385/4385 [==============================] - 13s 3ms/step - loss: 0.9858 - accuracy: 0.6010 - val_loss: 0.9736 - val_accuracy: 0.5990
Epoch 21/25
2660/4385 [=================>............] - ETA: 4s - loss: 0.9848 - accuracy: 0.5995

Prediction

Run predictions for the validation set and training set.

plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
[<matplotlib.lines.Line2D at 0x7f55b9b59240>]

png

train_eval_result = model.evaluate(dict(train_df), train_df['Sentiment'])
validation_eval_result = model.evaluate(dict(validation_df), validation_df['Sentiment'])

print(f"Training set accuracy: {train_eval_result[1]}")
print(f"Validation set accuracy: {validation_eval_result[1]}")
4385/4385 [==============================] - 12s 3ms/step - loss: 0.9864 - accuracy: 0.5976
493/493 [==============================] - 1s 2ms/step - loss: 0.9853 - accuracy: 0.5922
Training set accuracy: 0.5976196527481079
Validation set accuracy: 0.5921880006790161

Confusion matrix

Another very interesting statistic, especially for multiclass problems, is the confusion matrix. The confusion matrix allows visualization of the proportion of correctly and incorrectly labelled examples. We can easily see how much our classifier is biased and whether the distribution of labels makes sense. Ideally the largest fraction of predictions should be distributed along the diagonal.

predictions = model.predict(dict(validation_df))
predictions = tf.argmax(predictions, axis=-1)
predictions
<tf.Tensor: shape=(15745,), dtype=int64, numpy=array([1, 1, 2, ..., 2, 2, 2])>
cm = tf.math.confusion_matrix(validation_df['Sentiment'], predictions)
cm = cm/cm.numpy().sum(axis=1)[:, tf.newaxis]
sns.heatmap(
    cm, annot=True,
    xticklabels=SENTIMENT_LABELS,
    yticklabels=SENTIMENT_LABELS)
plt.xlabel("Predicted")
plt.ylabel("True")
Text(32.99999999999999, 0.5, 'True')

png

We can easily submit the predictions back to Kaggle by pasting the following code to a code cell and executing it:

test_predictions = model.predict(dict(test_df))
test_predictions = np.argmax(test_predictions, axis=-1)

result_df = test_df.copy()

result_df["Predictions"] = test_predictions

result_df.to_csv(
    "predictions.csv",
    columns=["Predictions"],
    header=["Sentiment"])
kaggle.api.competition_submit("predictions.csv", "Submitted from Colab",
                              "sentiment-analysis-on-movie-reviews")

After submitting, check the leaderboard to see how you did.