FaceSSD Fairness Indicators Example Colab

View on TensorFlow.org Run in Google Colab View on GitHub Download notebook


In this activity, you'll use Fairness Indicators to explore the FaceSSD predictions on Labeled Faces in the Wild dataset. Fairness Indicators is a suite of tools built on top of TensorFlow Model Analysis that enable regular evaluation of fairness metrics in product pipelines.

About the Dataset

In this exercise, you'll work with the FaceSSD prediction dataset, approximately 200k different image predictions and groundtruths generated by FaceSSD API.

About the Tools

TensorFlow Model Analysis is a library for evaluating both TensorFlow and non-TensorFlow machine learning models. It allows users to evaluate their models on large amounts of data in a distributed manner, computing in-graph and other metrics over different slices of data and visualize in notebooks.

TensorFlow Data Validation is one tool you can use to analyze your data. You can use it to find potential problems in your data, such as missing values and data imbalances, that can lead to Fairness disparities.

With Fairness Indicators, users will be able to:

  • Evaluate model performance, sliced across defined groups of users
  • Feel confident about results with confidence intervals and evaluations at multiple thresholds


Run the following code to install the fairness_indicators library. This package contains the tools we'll be using in this exercise. Restart Runtime may be requested but is not necessary.

pip install apache_beam
pip install fairness-indicators
pip install witwidget
import os
import tempfile
import apache_beam as beam
import numpy as np
import pandas as pd
from datetime import datetime

import tensorflow_hub as hub
import tensorflow as tf
import tensorflow_model_analysis as tfma
import tensorflow_data_validation as tfdv
from tensorflow_model_analysis.addons.fairness.post_export_metrics import fairness_indicators
from tensorflow_model_analysis.addons.fairness.view import widget_view
from tensorflow_model_analysis.model_agnostic_eval import model_agnostic_predict as agnostic_predict
from tensorflow_model_analysis.model_agnostic_eval import model_agnostic_evaluate_graph
from tensorflow_model_analysis.model_agnostic_eval import model_agnostic_extractor

from witwidget.notebook.visualization import WitConfigBuilder
from witwidget.notebook.visualization import WitWidget

Download and Understand the Data

Labeled Faces in the Wild is a public benchmark dataset for face verification, also known as pair matching. LFW contains more than 13,000 images of faces collected from the web.

We ran FaceSSD predictions on this dataset to predict whether a face is present in a given image. In this Colab, we will slice data according to gender to observe if there are any significant differences between model performance for different gender groups.

If there is more than one face in an image, gender is labeled as "MISSING".

We've hosted the dataset on Google Cloud Platform for convenience. Run the following code to download the data from GCP, the data will take about a minute to download and analyze.

data_location = tf.keras.utils.get_file('lfw_dataset.tf', 'https://storage.googleapis.com/facessd_dataset/lfw_dataset.tfrecord')

stats = tfdv.generate_statistics_from_tfrecord(data_location=data_location)

Defining Constants

BASE_DIR = tempfile.gettempdir()

tfma_eval_result_path = os.path.join(BASE_DIR, 'tfma_eval_result')

compute_confidence_intervals = True

slice_key = 'object/groundtruth/Gender'
label_key = 'object/groundtruth/face'
prediction_key = 'object/prediction/face'

feature_map = {
        tf.io.FixedLenFeature([], tf.string, default_value=['none']),
        tf.io.FixedLenFeature([], tf.float32, default_value=[0.0]),
        tf.io.FixedLenFeature([], tf.float32, default_value=[0.0]),

Model Agnostic Config for TFMA

model_agnostic_config = agnostic_predict.ModelAgnosticConfig(

model_agnostic_extractors = [
        model_agnostic_config=model_agnostic_config, desired_batch_size=3),

Fairness Callbacks and Computing Fairness Metrics

# Helper class for counting examples in beam PCollection
class CountExamples(beam.CombineFn):
    def __init__(self, message):
      self.message = message

    def create_accumulator(self):
      return 0

    def add_input(self, current_sum, element):
      return current_sum + 1

    def merge_accumulators(self, accumulators): 
      return sum(accumulators)

    def extract_output(self, final_sum):
      if final_sum:
        print("%s: %d"%(self.message, final_sum))
metrics_callbacks = [
      thresholds=[0.1, 0.3, 0.5, 0.7, 0.9],

eval_shared_model = tfma.types.EvalSharedModel(

with beam.Pipeline() as pipeline:
  # Read data.
  data = (
      | 'ReadData' >> beam.io.ReadFromTFRecord(data_location))

  # Count all examples.
  data_count = (
      data | 'Count number of examples' >> beam.CombineGlobally(
          CountExamples('Before filtering "Gender:MISSING"')))

  # If there are more than one face in image, the gender feature is 'MISSING'
  # and we are filtering that image out.
  def filter_missing_gender(element):
    example = tf.train.Example.FromString(element)
    if example.features.feature[slice_key].bytes_list.value[0] != b'MISSING':
      yield element

  filtered_data = (
      | 'Filter Missing Gender' >> beam.ParDo(filter_missing_gender))

  # Count after filtering "Gender:MISSING".
  filtered_data_count = (
      filtered_data | 'Count number of examples after filtering'
      >> beam.CombineGlobally(
          CountExamples('After filtering "Gender:MISSING"')))

  # Because LFW data set has always faces by default, we are adding
  # labels as 1.0 for all images.
  def add_face_groundtruth(element):
    example = tf.train.Example.FromString(element)
    example.features.feature[label_key].float_list.value[:] = [1.0]
    yield example.SerializeToString()

  final_data = (
      | 'Add Face Groundtruth' >> beam.ParDo(add_face_groundtruth))

  # Run TFMA.
  _ = (
      | 'ExtractEvaluateAndWriteResults' >>

eval_result = tfma.load_eval_result(output_path=tfma_eval_result_path)

Render Fairness Indicators

Render the Fairness Indicators widget with the exported evaluation results.

Below you will see bar charts displaying performance of each slice of the data on selected metrics. You can adjust the baseline comparison slice as well as the displayed threshold(s) using the drop down menus at the top of the visualization.

A relevant metric for this use case is true positive rate, also known as recall. Use the selector on the left hand side to choose the graph for true_positive_rate. These metric values match the values displayed on the model card.

For some photos, gender is labeled as young instead of male or female, if the person in the photo is too young to be accurately annotated.