What is Model Remediation?

Once you’ve performed sliced evaluation of a machine learning model’s performance, you might notice that your model is underperforming across certain slices of data. This type of unequal performance can sometimes lead to unfair and potentially harmful outcomes for vulnerable subsets of the population. Generally, there are three primary types of technical interventions for addressing bias concerns:

  • Changing the input data: Collecting more data, generating synthetic data, adjusting the weights and sampling rates of different slices, etc.1
  • Intervening on the model: Changing the model itself by introducing or altering model objectives, adding constraints, etc.2
  • Post-processing the results: Modifying the outputs of the model or the interpretation of the outputs to improve performance across metrics.3

from tensorflow_model_remediation import min_diff
import tensorflow as tf

# Start by defining a Keras model.
original_model = ...

# Set the MinDiff weight and choose a loss.
min_diff_loss = min_diff.losses.MMDLoss()
min_diff_weight = 1.0  # Hyperparamater to be tuned.

# Create a MinDiff model.
min_diff_model = min_diff.keras.MinDiffModel(
original_model, min_diff_loss, min_diff_weight)

# Compile the MinDiff model normally.
min_diff_model.compile(...)

# Create a MinDiff Dataset and train the min_diff_model.
min_diff_model.fit(min_diff_dataset, ...)

What is MinDiff?

MinDiff is a model remediation technique that seeks to equalize two distributions. In practice, it can be used to balance error rates across different slices of your data by penalizing distributional differences.

Typically, one applies MinDiff when trying to minimize the difference in either false positive rate (FPR) or false negative rate (FNR) between a slice of data belonging to a sensitive class and a better performing slice. For in-depth discussion of fairness metrics, review the literature on this subject.4 5 6

How does MinDiff work?

Given two sets of examples from our dataset, MinDiff penalizes the model during training for differences in the distribution of scores between the two sets. The less distinguishable the two sets are based on prediction scores, the smaller the penalty that will be applied.

The penalty is applied by adding a component to the loss with which the model is training. It can be thought of as a measurement of the difference in distribution of model predictions. As the model trains, it will try to minimize the penalty by bringing the distributions closer together, as in the above graph.

Applying MinDiff may come with tradeoffs with respect to performance on the original task. In practice, we have often found MinDiff to be effective while not deteriorating performance beyond product needs, but this will be application dependent and the decision should be made deliberately by the product owner. For examples showing how to implement MinDiff, see our notebook tutorial.

1Zhang, G., Bai, B., Zhang, J., Bai, K., Zhu, C., Zhao, T. (2020). Demographics Should Not Be the Reason of Toxicity: Mitigating Discrimination in Text Classifications with Instance Weighting.
2Prost, F., Qian H., Chen, Q., Chi, E., Chen, J., Beutel, A. (2019). Toward a better trade-off between performance and fairness with kernel-based distribution matching.
3Alabdulmohsin, I. (2020). Fair Classification via Unconstrained Optimization.
4Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R. (2011). Fairness Through Awareness.
5Hardt, M., Price, E., Srebro, N. (2016). Equality of Opportunity in Supervised Learning.
6Chouldechova, A. (2016). Fair prediction with disparate impact: A study of bias in recidivism prediction instruments.

Resources