Model Remediation | Responsible AI Toolkit

TensorFlow is back at Google I/O on May 14! Register now

What is TensorFlow Model Remediation?

If you have identified fairness concerns with your machine learning model, there are three primary types of technical interventions available:

Training data pre-processing techniques: Collecting more data, generating synthetic data, adjusting the weights of examples and sampling rates of different slices.
Training-time modeling techniques: Changing the model itself by introducing or altering model objectives and adding constraints.
Post-training techniques: Modifying the outputs of the model or the interpretation of the outputs to improve performance across metrics.

The TensorFlow Model Remediation library provides training-time techniques to intervene on the model.

Training-time Modeling

The TensorFlow Model Remediation library provides two techniques for addressing bias and fairness issues in your model, MinDiff and Counterfactual Logit Pairing (CLP). They are described in the table below.

	MinDiff	CLP
When should you use this technique?	To ensure that a model predicts the preferred label equally well for all values of a sensitive attribute. To achieve group equality of opportunity.	To ensure that a model's prediction does not change between "counterfactual pairs" (where the sensitive attribute referenced in a feature is different). For example, in a toxicity classifier, examples such as "I am a man" and "I am a lesbian" should not have a different prediction. To achieve a form of counterfactual fairness.
How does it work?	Penalizes the model during training for differences in the distribution of scores between the two sets.	Penalizes the model during training for output differences between counterfactual pairs of examples.
Input Modalities	Loss functions operate on output so are, in theory, agnostic to the input and model architecture.	Loss functions operate on output so are, in theory, agnostic to the input and model architecture.

MinDiff

CLP

When should you use this technique?

To ensure that a model predicts the preferred label equally well for all values of a sensitive attribute.

To achieve group equality of opportunity.

To ensure that a model's prediction does not change between "counterfactual pairs" (where the sensitive attribute referenced in a feature is different). For example, in a toxicity classifier, examples such as "I am a man" and "I am a lesbian" should not have a different prediction.

To achieve a form of counterfactual fairness.

How does it work?

Penalizes the model during training for differences in the distribution of scores between the two sets.

Penalizes the model during training for output differences between counterfactual pairs of examples.

Input Modalities

Loss functions operate on output so are, in theory, agnostic to the input and model architecture.