Paving the Road to Trustworthy AI: Bias in Large Language Models
Language models have become essential components in various NLP applications, such as machine translation, sentiment analysis, and question-answering systems. However, there is a growing concern that these models may perpetuate and amplify bias present in the data on which they were trained. Detecting bias in large language models is crucial because biased language models can amplify existing societal biases, such as racism and sexism, or perpetuate harmful stereotypes. When not properly controlled, the use of biased language models in decision-making processes, such as those used in loan applications, healthcare, hiring, and criminal justice, can result in unfair outcomes.
Bias in NLP models can arise from historical patterns, biases, and stereotypes encoded in the training data. For example, a study of classifiers trained to predict the occupation of a person based on their biography was more likely to predict “surgeon” for men’s biographies compared to women’s. In our work, we focus here on masked language models such as BERT that are fine tuned to perform specific tasks. We want to understand how pretraining and fine tuning each contribute to the observed biases of these models.
Such an understanding not only provides practitioners actionable guidance on how to responsibly use and finetune pretrained language models, it also highlights ‘infeasibility conditions’, i.e. situations where it is impossible to fully mitigate fairness-specific risks of downstream applications of a model.