Pandas DataFrame to Fairness Indicators Case Study

In this activity, you'll learn how to use Fairness Indicators with a Pandas DataFrame.

Case Study Overview

In this case study we will apply TensorFlow Model Analysis and Fairness Indicators to evaluate data stored as a Pandas DataFrame, where each row contains ground truth labels, various features, and a model prediction. We will show how this workflow can be used to spot potential fairness concerns, independent of the framework one used to construct and train the model. As in this case study, we can analyze the results from any machine learning framework (e.g. TensorFlow, JAX, etc) once they are converted to a Pandas DataFrame.

For this exercise, we will leverage the Deep Neural Network (DNN) model that was developed in the Shape Constraints for Ethics with Tensorflow Lattice case study using the Law School Admissions dataset from the Law School Admissions Council (LSAC). This classifier attempts to predict whether or not a student will pass the bar, based on their Law School Admission Test (LSAT) score and undergraduate GPA. This classifier attempts to predict whether or not a student will pass the bar, based on their LSAT score and undergraduate GPA.

LSAC Dataset

The dataset used within this case study was originally collected for a study called 'LSAC National Longitudinal Bar Passage Study. LSAC Research Report Series' by Linda Wightman in 1998. The dataset is currently hosted here.

  • dnn_lsat_prediction: The LSAT prediction from the DNN model.
  • gender: Gender of the student.
  • lsat: LSAT score received by the student.
  • pass_bar: Ground truth label indicating whether or not the student eventually passed the bar.
  • race: Race of the student.
  • ugpa: A student's undergraduate GPA.
!pip install -q -U \
  tensorflow-model-analysis==0.22.2 \
  tensorflow-data-validation==0.22.1 \
  tfx-bsl==0.22.1 \

Importing required packages:

import os
import tempfile
import pandas as pd
import six.moves.urllib as urllib

import tensorflow_model_analysis as tfma
from google.protobuf import text_format

Download the data and explore the initial dataset.

# Download the LSAT dataset and setup the required filepaths.
_DATA_ROOT = tempfile.mkdtemp(prefix='lsat-data')
_DATA_FILEPATH = os.path.join(_DATA_ROOT, 'lsat_prediction.csv')

data = urllib.request.urlopen(_DATA_PATH)

_LSAT_DF = pd.read_csv(data)

# To simpliy the case study, we will only use the columns that will be used for
# our model.

_LSAT_DF['gender'] = _LSAT_DF['gender'].astype(str)
_LSAT_DF['race1'] = _LSAT_DF['race1'].astype(str)


Configure Fairness Indicators.

There are several parameters that you’ll need to take into account when using Fairness Indicators with a DataFrame

  • Your input DataFrame must contain a prediction column and label column from your model. By default Fairness Indicators will look for a prediction column called prediction and a label column called label within your DataFrame.

    • If either of these values are not found a KeyError will be raised.
  • In addition to a DataFrame, you’ll also need to include an eval_config that should include the metrics to compute, slices to compute the metrics on, and the column names for example labels and predictions.

    • metrics_specs will set the metrics to compute. The FairnessIndicators metric will be required to render the fairness metrics and you can see a list of additional optional metrics here.

    • slicing_specs is an optional slicing parameter to specify what feature you’re interested in investigating. Within this case study race1 is used, however you can also set this value to another feature (for example gender in the context of this DataFrame). If slicing_specs is not provided all features will be included.

    • If your DataFrame includes a label or prediction column that is different from the default prediction or label, you can configure the label_key and prediction_key to a new value.

  • If output_path is not specified a temporary directory will be created.

# Specify Fairness Indicators in eval_config.
eval_config = text_format.Parse("""
  model_specs {
    prediction_key: 'dnn_lsat_prediction',
    label_key: 'pass_bar'
  metrics_specs {
    metrics {class_name: "AUC"}
    metrics {
      class_name: "FairnessIndicators"
      config: '{"thresholds": [0.90]}'
  slicing_specs {
    feature_keys: 'race1'
  """, tfma.EvalConfig())

# Run TensorFlow Model Analysis.
eval_result = tfma.analyze_raw_data(
WARNING:apache_beam.runners.interactive.interactive_environment:Dependencies required for Interactive Beam PCollection visualization are not available, please use: `pip install apache-beam[interactive]` to install necessary dependencies to enable all data visualization features.'t find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.

Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_model_analysis/writers/ tf_record_iterator (from is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 

Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_model_analysis/writers/ tf_record_iterator (from is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 

Explore model performance with Fairness Indicators.

After running Fairness Indicators, we can visualize different metrics that we selected to analyze our models performance. Within this case study we’ve included Fairness Indicators and arbitrarily picked AUC.

When we first look at the overall AUC for each race slice we can see a slight discrepancy in model performance, but nothing that is arguably alarming.

  • Asian: 0.58
  • Black: 0.58
  • Hispanic: 0.58
  • Other: 0.64
  • White: 0.6

However, when we look at the false negative rates split by race, our model again incorrectly predicts the likelihood of a user passing the bar at different rates and, this time, does so by a lot.

  • Asian: 0.01
  • Black: 0.05
  • Hispanic: 0.02
  • Other: 0.01
  • White: 0.01

Most notably the difference between Black and White students is about 380%, meaning that our model is nearly 4x more likely to incorrectly predict that a black student will not pass the bar, than a whilte student. If we were to continue with this effort, a practitioner could use these results as a signal that they should spend more time ensuring that their model works well for people from all backgrounds.

# Render Fairness Indicators.
FairnessIndicatorViewer(slicingMetrics=[{'sliceValue': 'white', 'slice': 'race1:white', 'metrics': {'fairness_…


Within this case study we imported a dataset into a Pandas DataFrame that we then analyzed with Fairness Indicators. Understanding the results of your model and underlying data is an important step in ensuring your model doesn't reflect harmful bias. In the context of this case study we examined the the LSAC dataset and how predictions from this data could be impacted by a students race. The concept of “what is unfair and what is fair have been introduced in multiple disciplines for well over 50 years, including in education, hiring, and machine learning.”1 Fairness Indicator is a tool to help mitigate fairness concerns in your machine learning model.

For more information on using Fairness Indicators and resources to learn more about fairness concerns see here.

  1. Hutchinson, B., Mitchell, M. (2018). 50 Years of Test (Un)fairness: Lessons for Machine Learning.


Below are a few functions to help convert ML models to Pandas DataFrame.

# TensorFlow Estimator to Pandas DataFrame:

# _X_VALUE =  # X value of binary estimator.
# _Y_VALUE =  # Y value of binary estimator.
# _GROUND_TRUTH_LABEL =  # Ground truth value of binary estimator.

def _get_predicted_probabilities(estimator, input_df, get_input_fn):
  predictions = estimator.predict(
      input_fn=get_input_fn(input_df=input_df, num_epochs=1))
  return [prediction['probabilities'][1] for prediction in predictions]

def _get_input_fn_law(input_df, num_epochs, batch_size=None):
  return tf.compat.v1.estimator.inputs.pandas_input_fn(
      x=input_df[[_X_VALUE, _Y_VALUE]],
      batch_size=batch_size or len(input_df),

def estimator_to_dataframe(estimator, input_df, num_keypoints=20):
  x = np.linspace(min(input_df[_X_VALUE]), max(input_df[_X_VALUE]), num_keypoints)
  y = np.linspace(min(input_df[_Y_VALUE]), max(input_df[_Y_VALUE]), num_keypoints)

  x_grid, y_grid = np.meshgrid(x, y)

  positions = np.vstack([x_grid.ravel(), y_grid.ravel()])
  plot_df = pd.DataFrame(positions.T, columns=[_X_VALUE, _Y_VALUE])
  plot_df[_GROUND_TRUTH_LABEL] = np.ones(len(plot_df))
  predictions = _get_predicted_probabilities(
      estimator=estimator, input_df=plot_df, get_input_fn=_get_input_fn_law)
  return pd.DataFrame(
      data=np.array(np.reshape(predictions, x_grid.shape)).flatten())
View on