Overview

We investigate several different techniques for mitigating social bias in pre-trained language models in our paper An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models to be presented at ACL 2022. To track progress on the intrinsic bias benchmarks we investigate in our work, we introduce the leaderboard Bias Bench.

Bias Bench tracks the effectiveness of bias mitigation techniques across three intrinsic bias benchmarks: StereoSet, CrowS-Pairs, and the Sentence Encoder Association Test (SEAT). Importantly, results on Bias Bench are ranked by StereoSet ICAT score, a metric which accounts for both the stereotype score and language modeling score of a model on StereoSet. We have included three of the strongest debiaising techniques we analyzed in our work as baselines: Self-Debias, CDA, and Dropout.

We hope this leaderboard helps to better track progress in social bias mitigation research in NLP!

Leaderboard

Rank	Model	StereoSet SS				CrowS-Pairs SS				SEAT (Overall)	StereoSet LMS (Overall)	StereoSet ICAT (Overall)
		Gender	Race	Religion	Overall	Gender	Race	Religion	Overall
1 (April 4, 2022)	BERT + Self-Debias (Ensemble) (McGill / Mila)	59.34	54.30	57.26	55.47	52.29	56.70	56.19	55.26	-	84.19	74.98
2 (April 4, 2022)	GPT-2 + Self-Debias (Ensemble) (McGill / Mila)	60.84	57.33	60.45	58.20	56.11	53.29	58.10	54.69	-	89.63	74.92
3 (April 4, 2022)	GPT-2 + CDA (Ensemble) (McGill / Mila)	64.02	57.31	63.55	59.05	56.87	60.66	51.43	58.43	0.259	90.24	73.98
4 (April 4, 2022)	GPT-2 + Dropout (McGill / Mila)	63.35	57.50	64.17	59.04	57.63	60.47	52.38	58.66	0.266	90.18	73.86
5 (April 4, 2022)	GPT-2 (McGill / Mila)	62.65	58.90	63.26	59.90	56.87	59.69	62.86	59.22	0.313	91.17	73.12
6 (April 4, 2022)	BERT (McGill / Mila)	60.28	57.03	59.70	57.82	57.25	62.33	62.86	60.81	0.590	84.36	71.15
7 (April 4, 2022)	BERT + CDA (Ensemble) (McGill / Mila)	59.61	57.73	58.37	57.39	56.11	56.70	60.00	56.85	0.557	83.17	70.87
8 (April 4, 2022)	BERT + Dropout (McGill / Mila)	60.66	57.07	59.13	57.90	55.34	59.03	55.24	57.41	0.586	82.94	69.83

Submissions

To make a submission to Bias Bench, please contact nicholas.meade@mila.quebec.

FAQ

What metric is reported for SEAT in Bias Bench? We report the average absolute effect size across the gender, race, and religion SEAT tests we evaluate in our work.

What are the ensemble models reported in the leaderboard? The baseline ensemble models (e.g., BERT + CDA) are aggregated results across three debiased models (one for each bias domain). The StereoSet language modeling scores and ICAT scores for these models are computed by aggregating results from each bias domain (e.g., gender bias) from each debiased model (e.g., gender debiased).

How can I learn more about the intrinsic bias benchmarks used in Bias Bench? To learn more about StereoSet, CrowS-Pairs, and the Sentence Encoder Association Test (SEAT), we refer readers to their original respective works. Our work also summarizes each of these benchmarks.

Why are there no SEAT results reported for the Self-Debias models? There are no SEAT results for the Self-Debias models as Self-Debias is a post-hoc debiasing procedure which does not alter a model’s internal representations. For more information on this, refer to Section 3 of our work.

What do SS and LMS denote? SS and LMS denote stereotype score and language modeling score, respectively.

Ethical Considerations

In our work, we used a binary definition of gender while investigating gender bias in pre-trained language models. While we fully recognize gender as non-binary, our survey closely follows the original methodology of the techniques explored in this work. We believe it will be critical for future research in gender bias to use a more fluid definition of gender and we are encouraged by early work in this direction (Manzini et al., 2019; Dinan et al., 2020b). Similarly, our work makes use of a narrow definition of religious and racial bias.

We also note we do not investigate the extrinsic harm caused by any of the studied pre-trained language models, nor any potential reduction in harm by making use of any of our studied debiasing techniques. In other words, we do not investigate how biases in pre-trained language models effect humans in real-world settings.

Finally, we highlight that all of the intrinsic bias benchmarks used in this work have only positive predictive power. In other words, they can identify models as biased, but cannot verify a model as unbiased. For example, a stereotype score of 50% on StereoSet or CrowS-Pairs is not indicative of an unbiased model. Additionally, recent work demonstrated the potential unreliability of the bias benchmarks used in this work (Blodgett et al., 2021). Because of this, we caution readers from making definitive claims about bias in pre-trained language models based on these benchmarks alone.

Citation

If you use our code, please cite the following paper:

@inproceedings{meade_empirical_2022,
  address = {Online},
  title = {An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models},
  booktitle = {Proceedings of the 60th {Annual} {Meeting} of the {Association} for {Computational} {Linguistics},
  publisher = {Association for Computational Linguistics},
  author = {Meade, Nicholas and Poole-Dayan, Elinor and Reddy, Siva},
  month = may,
  year = {2022},
}