Detecting and Evaluating Bias in Large Language Models: Concepts, Methods, and Challenges
DOI:
https://doi.org/10.35566/jbds/gaoKeywords:
Large Language Models, Bias Evaluation, Fairness in NLP, Certification-based Methods, ReproducibilityAbstract
Large Language Models (LLMs) are increasingly deployed in sensitive real-world contexts, yet concerns remain about their biases and the harms they can cause. Existing surveys mostly discuss sources of bias and mitigation techniques, but give less systematic attention to how bias in LLMs should be detected, measured, and reported. This survey addresses that gap. We present a structured review of methods for detecting and evaluating bias in LLMs. We first introduce the conceptual foundations, including representational versus allocational harms and taxonomies of bias. We then discuss how to design evaluations in practice: specifying measurement targets, choosing datasets and metrics, and reasoning about validity and reliability. Building on this, we review intrinsic methods that probe representations and likelihoods, and extrinsic methods that assess bias in classification, question answering, open-ended generation, and dialogue. We further highlight recent advances in counterfactual and certification-based evaluation, which aim to provide stronger guarantees on fairness metrics. Beyond English-centric settings, we survey cross-lingual and application-specific evaluations, intersectional bias analysis, and meta-level issues such as evaluator reliability, metric robustness, reproducibility, and governance. The review concludes by synthesizing best practices and offering a practitioner-oriented checklist, providing both a conceptual map and a practical toolkit for evaluating bias in LLMs.
References
Abid, A., Farooqi, M., & Zou, J. (2021). Large language models associate muslims with violence. Nature Machine Intelligence, 3(6), 461–463. doi: https://doi.org/10.1038/s42256-021-00356-9 DOI: https://doi.org/10.1038/s42256-021-00359-2
An, J., Huang, D., Lin, C., & Tai, M. (2025, February). Measuring gender and racial biases in large language models: Intersectional evidence from automated resume evaluation. PNAS Nexus, 4(3). Retrieved from http://dx.doi.org/10.1093/pnasnexus/pgaf089 doi: https://doi.org/10.1093/pnasnexus/pgaf089 DOI: https://doi.org/10.1093/pnasnexus/pgaf089
Anthis, J., Lum, K., Ekstrand, M., Feller, A., D’Amour, A., & Tan, C. (2024). The impossibility of fair LLMs. arXiv. Retrieved from https://arxiv.org/abs/2406.03198 doi: https://doi.org/10.18653/v1/2025.acl-long.5 DOI: https://doi.org/10.18653/v1/2025.acl-long.5
Barocas, S., & Selbst, A. D. (2016). Big data’s disparate impact. California Law Review, 104(3), 671–732. doi: https://doi.org/10.2139/ssrn.2477899 DOI: https://doi.org/10.2139/ssrn.2477899
Bartl, M., Nissim, M., & Gatt, A. (2020). Unmasking contextual stereotypes: Measuring and mitigating bert’s gender bias. In Proceedings of the second workshop on gender bias in natural language processing (pp. 1–16). Barcelona, Spain (Online): Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.gebnlp-1.1/
Bastani, O., Zhang, X., & Solar-Lezama, A. (2019). Probabilistic verification of fairness properties via concentration. Proceedings of the ACM on Programming Languages, 3(OOPSLA), 118:1–118:27. Retrieved from https://dl.acm.org/doi/10.1145/3360544 doi: https://doi.org/10.1145/3360544 DOI: https://doi.org/10.1145/3360544
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 acm conference on fairness, accountability, and transparency (facct) (pp. 610–623). doi: https://doi.org/10.1145/3442188.3445922 DOI: https://doi.org/10.1145/3442188.3445922
Blodgett, S. L., Barocas, S., Daume III, H., & Wallach, H. (2020). Language (technology) is power: A critical survey of bias in NLP. In Proceedings of the 58th annual meeting of the association for computational linguistics (acl) (pp. 5454–5476). doi: https://doi.org/10.18653/v1/2020.acl-main.485 DOI: https://doi.org/10.18653/v1/2020.acl-main.485
Bolukbasi, T., Chang, K., Zou, J., Saligrama, V., & Kalai, A. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems 29 (neurips 2016) (pp. 4349–4357).
Bordia, S., & Bowman, S. R. (2019). Identifying and reducing gender bias in word-level language models. arXiv. Retrieved from https://arxiv.org/abs/1904.03035 doi: https://doi.org/10.18653/v1/n19-3002 DOI: https://doi.org/10.18653/v1/N19-3002
Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency (pp. 77–91).
Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183–186. doi: https://doi.org/10.1126/science.aal4230 DOI: https://doi.org/10.1126/science.aal4230
Cao, Y., Pruksachatkun, Y., Chang, K., Gupta, R., Kumar, V., Dhamala, J., & Galstyan, A. (2022). On the intrinsic and extrinsic fairness evaluation metrics for contextualized language representations. In Proceedings of acl 2022 (short papers) (pp. 561–570). doi: https://doi.org/10.18653/v1/2022.acl-short.62 DOI: https://doi.org/10.18653/v1/2022.acl-short.62
Chaudhary, I., Hu, Q., Kumar, M., Ziyadi, M., Gupta, R., & Singh, G. (2025). Certifying counterfactual bias in LLMs. In International conference on learning representations (iclr). (OpenReview)
Crenshaw, K. (1991, July). Mapping the margins: Intersectionality, identity politics, and violence against women of color. Stanford Law Review, 43(6), 1241–1299. Retrieved from http://dx.doi.org/10.2307/1229039 doi: https://doi.org/10.2307/1229039 DOI: https://doi.org/10.2307/1229039
Cui, J., Chiang, W.-L., Stoica, I., & Hsieh, C.-J. (2025). OR-Bench: An over-refusal benchmark for large language models. In Proceedings of the 42nd international conference on machine learning (icml). (arXiv:2405.20947)
De-Arteaga, M., Romanov, A., Wallach, H., Chayes, J., Borgs, C., Kalai, A., & Crawford, K. (2019). Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Proceedings of the acm conference on fairness, accountability, and transparency (fat*). doi: https://doi.org/10.1145/3287560.3287572 DOI: https://doi.org/10.1145/3287560.3287572
Dev, S., & Phillips, J. M. (2019). Attenuating bias in word vectors. In Proceedings of the 22nd international conference on artificial intelligence and statistics (aistats) (pp. 879–887).
Dhamala, J., Sun, T., Kumar, V., Krishna, S., Pruksachatkun, Y., Chang, K., & Gupta, R. (2021). BOLD: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 acm conference on fairness, accountability, and transparency (facct). doi: https://doi.org/10.1145/3442188.3445924 DOI: https://doi.org/10.1145/3442188.3445924
Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference (itcs) (pp. 214–226). (Preprint 2011) doi: https://doi.org/10.1145/2090236.2090255 DOI: https://doi.org/10.1145/2090236.2090255
Ethayarajh, K. (2019). How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 55–65). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1006/ doi: https://doi.org/10.18653/v1/D19-1006 DOI: https://doi.org/10.18653/v1/D19-1006
European Parliament, & Council of the European Union. (2024). Regulation (EU) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence and amending regulations (EC) no 300/2008, (EU) 2017/745, (EU) 2017/746, (EU) 2019/881 and (EU) 2022/2065 and directive 2009/125/ec (Artificial Intelligence Act). Official Journal of the European Union, L 2024/1689. Retrieved from https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng
Ferrara, E. (2023). Fairness and bias in artificial intelligence: A brief survey of sources, impacts, and mitigation strategies. arXiv. Retrieved from https://arxiv.org/abs/2304.07683 doi: https://doi.org/10.3390/sci6010003 DOI: https://doi.org/10.2196/preprints.48399
Gallegos, I. O., Rossi, R. A., Barrow, J., Tanjim, M. M. M., Kim, S., Dernoncourt, F., … Ahmed, N. K. (2024). Bias and fairness in large language models: A survey. Computational Linguistics, 50(3), 1097–1158. doi: https://doi.org/10.1162/coli_a_00524 DOI: https://doi.org/10.1162/coli_a_00524
Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Findings of the association for computational linguistics: Emnlp 2020 (pp. 3356–3369). doi: https://doi.org/10.18653/v1/2020.findings-emnlp.301 DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.301
Gumilar, K. E., Indraprasta, B. R., Hsu, Y.-C., Yu, Z.-Y., Chen, H., Irawan, B., … Tan, M. (2024, July). Disparities in medical recommendations from AI-based chatbots across different countries/regions. Scientific Reports, 14(1). Retrieved from http://dx.doi.org/10.1038/s41598-024-67689-0 doi: https://doi.org/10.1038/s41598-024-67689-0 DOI: https://doi.org/10.1038/s41598-024-67689-0
Guo, Y., Guo, M., Su, J., Yang, Z., Zhu, M., Li, H., & Qiu, M. (2024). Bias in large language models: Origin, evaluation, and mitigation. arXiv. Retrieved from https://arxiv.org/abs/2411.10915
Hanu, L., & Unitary team. (2020). Detoxify. https://github.com/unitaryai/detoxify. Retrieved from https://github.com/unitaryai/detoxify (GitHub repository)
Hofmann, V., Kalluri, P. R., Jurafsky, D., & King, S. (2024). Dialect prejudice predicts AI decisions about people’s character, employability, and criminality. In Proceedings of the 2024 acm conference on fairness, accountability, and transparency (facct) (pp. 1321–1340).
Huang, P., Zhang, H., Jiang, R., Stanforth, R., Welbl, J., Rae, J., … Kohli, P. (2019). Reducing sentiment bias in language models via counterfactual evaluation. arXiv. Retrieved from https://arxiv.org/abs/1911.03064 doi: https://doi.org/10.18653/v1/2020.findings-emnlp.7 DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.7
Kearns, M., Neel, S., Roth, A., & Wu, Z. S. (2018, Jul). Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In J. Dy & A. Krause (Eds.), Proceedings of the 35th international conference on machine learning (Vol. 80, pp. 2564–2572). PMLR. Retrieved from https://proceedings.mlr.press/v80/kearns18a.html
Kleinberg, J., Mullainathan, S., & Raghavan, M. (2016). Inherent trade-offs in the fair determination of risk scores. Retrieved from https://arxiv.org/abs/1609.05807
Kotek, H., Dockum, R., & Sun, D. (2023). Gender bias and stereotypes in large language models. In Proceedings of the acm collective intelligence conference (ci 2023). doi: https://doi.org/10.1145/3582269.3615599 DOI: https://doi.org/10.1145/3582269.3615599
Krishna, S., Gupta, R., Verma, A., Dhamala, J., Pruksachatkun, Y., & Chang, K.-W. (2022, may). Measuring fairness of text classifiers via prediction sensitivity. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 5830–5842). Dublin, Ireland: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.acl-long.401/ doi: https://doi.org/10.18653/v1/2022.acl-long.401 DOI: https://doi.org/10.18653/v1/2022.acl-long.401
Kurita, K., Vyas, N., Pareek, A., Black, A. W., & Tsvetkov, Y. (2019). Measuring bias in contextualized word representations. In Proceedings of the first acl workshop on gender bias for nlp (pp. 166–172). doi: https://doi.org/10.18653/v1/w19-3823 DOI: https://doi.org/10.18653/v1/W19-3823
Kusner, M. J., Loftus, J., Russell, C., & Silva, R. (2017). Counterfactual fairness. In Advances in neural information processing systems 30 (neurips 2017) (pp. 4066–4076).
Lee, G., Hartmann, V., Park, J., Papailiopoulos, D., & Lee, K. (2023). Prompted llms as chatbot modules for long open-domain conversation. In Findings of the association for computational linguistics: Acl 2023 (pp. 4536–4554). Association for Computational Linguistics. Retrieved from https://aclanthology.org/2023.findings-acl.277/ doi: https://doi.org/10.18653/v1/2023.findings-acl.277 DOI: https://doi.org/10.18653/v1/2023.findings-acl.277
Li, Y., Du, M., Song, R., Wang, X., & Wang, Y. (2024). A survey on fairness in large language models. arXiv. Retrieved from https://arxiv.org/abs/2308.10149 (Version 2 (revised 2024-02-21)) doi: https://doi.org/10.48550/arXiv.2308.10149
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., … Koreeda, Y. (2023). Holistic evaluation of language models. Transactions on Machine Learning Research. Retrieved from https://arxiv.org/abs/2211.09110 (TMLR; arXiv:2211.09110) doi: https://doi.org/10.48550/arXiv.2211.09110
Liu, T., Luo, R., Chen, Q., Qin, Z., Sun, R., Yu, Y., & Zhang, C. (2024). Jailbreaking black-box large language models in twenty queries. In 33rd usenix security symposium (usenix security 24). Philadelphia, PA: USENIX Association. Retrieved from https://www.usenix.org/conference/usenixsecurity24/presentation/liu-tong (See also arXiv:2310.08419)
Liu, Y., Yang, T., Huang, S., Zhang, Z., Huang, H., Wei, F., … Zhang, Q. (2023). Calibrating llm-based evaluator. Retrieved from https://arxiv.org/abs/2309.13308
May, C., Wang, A., Bordia, S., Bowman, S. R., & Rudinger, R. (2019). On measuring social biases in sentence encoders. In Proceedings of naacl-hlt 2019 (pp. 622–628). doi: https://doi.org/10.18653/v1/n19-1063 DOI: https://doi.org/10.18653/v1/N19-1063
Meade, N., Poole-Dayan, E., & Reddy, S. (2022). An empirical survey of the effectiveness of debiasing techniques for pre-trained language models. In Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 1878–1898). Dublin, Ireland: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.acl-long.132/ (Published at ACL 2022; widely cited in 2023 literature) doi: https://doi.org/10.18653/v1/2022.acl-long.132
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys, 54(6), 1–35. doi: https://doi.org/10.1145/3457607 DOI: https://doi.org/10.1145/3457607
Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., … Gebru, T. (2019). Model cards for model reporting. In Proceedings of the 2019 conference on fairness, accountability, and transparency (fat*) (pp. 220–229). doi: https://doi.org/10.1145/3287560.3287596 DOI: https://doi.org/10.1145/3287560.3287596
Nadeem, M., Bethke, A., & Reddy, S. (2021). Stereoset: Measuring stereotypical bias in pretrained language models. In Proceedings of acl 2021 (long papers) (pp. 5356–5371). doi: https://doi.org/10.18653/v1/2021.acl-long.416 DOI: https://doi.org/10.18653/v1/2021.acl-long.416
Nangia, N., Vania, C., Bhalerao, R., & Bowman, S. R. (2020). Crows-pairs: A challenge dataset for measuring social biases in masked language models. In Findings of the association for computational linguistics: Emnlp 2020 (pp. 227–239). doi: https://doi.org/10.18653/v1/2020.emnlp-main.154 DOI: https://doi.org/10.18653/v1/2020.emnlp-main.154
Panickssery, A., Bowman, S. R., & Feng, S. (2024). Llm evaluators recognize and favor their own generations. Retrieved from https://arxiv.org/abs/2404.13076 doi: https://doi.org/10.52202/079017-2197 DOI: https://doi.org/10.52202/079017-2197
Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., … Bowman, S. (2022). BBQ: A hand-built bias benchmark for question answering. In Findings of the association for computational linguistics: Acl 2022 (pp. 2086–2105). doi: https://doi.org/10.18653/v1/2022.findings-acl.165 DOI: https://doi.org/10.18653/v1/2022.findings-acl.165
Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., … Irving, G. (2022). Red teaming language models with language models. Retrieved from https://arxiv.org/abs/2202.03286 doi: https://doi.org/10.18653/v1/2022.emnlp-main.225 DOI: https://doi.org/10.18653/v1/2022.emnlp-main.225
Raji, I. D., Denton, E., Bender, E. M., Hanna, A., & Paullada, A. (2021). AI and the everything in the whole wide world benchmark: A critical analysis of the biggest benchmarks in AI. arXiv. Retrieved from https://arxiv.org/abs/2111.15366 (Metric validity discussion)
Ravfogel, S., Elazar, Y., Gonen, H., Twiton, M., & Goldberg, Y. (2020). Null it out: Debiasing text representations by iterative nullspace projection. In Proceedings of the 58th annual meeting of the association for computational linguistics (acl) (pp. 7237–7256). DOI: https://doi.org/10.18653/v1/2020.acl-main.647
Rozado, D. (2025). Gender and positional biases in llm-based hiring decisions: Evidence from comparative cv/resume evaluations. Retrieved from https://arxiv.org/abs/2505.17049 doi: https://doi.org/10.7717/peerj-cs.3628 DOI: https://doi.org/10.7717/peerj-cs.3628
Rudinger, R., Naradowsky, J., Leonard, B., & Van Durme, B. (2018). Gender bias in coreference resolution. In Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: Human language technologies, volume 2 (short papers) (pp. 8–14). New Orleans, Louisiana: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N18-2002/ doi: https://doi.org/10.18653/v1/N18-2002 DOI: https://doi.org/10.18653/v1/N18-2002
Rupprecht, J., Ahnert, G., & Strohmaier, M. (2025). Prompt perturbations reveal human-like biases in large language model survey responses. Retrieved from https://arxiv.org/abs/2507.07188
Sheng, E., Chang, K., Natarajan, P., & Peng, N. (2019). The woman worked as a babysitter: On biases in language generation. In Proceedings of emnlp-ijcnlp 2019 (pp. 3407–3412). doi: https://doi.org/10.18653/v1/d19-1339 DOI: https://doi.org/10.18653/v1/D19-1339
Sherry, J. H. (1965). The civil rights act of 1964: Fair employment practices under title VII. Cornell Hotel and Restaurant Administration Quarterly, 6(2), 3–6. doi: https://doi.org/10.1177/001088046500600202 DOI: https://doi.org/10.1177/001088046500600202
Sim, J., & Reid, N. (1999). Statistical inference by confidence intervals: Issues of interpretation and utilization. Physical Therapy, 79(2), 186–195. doi: https://doi.org/10.1093/ptj/79.2.186 DOI: https://doi.org/10.1093/ptj/79.2.186
Smith, E. M., Hall, M., Kambadur, M., Presani, E., & Williams, A. (2022). I’m sorry to hear that: Finding new biases in language models with a holistic descriptor dataset. arXiv. Retrieved from https://arxiv.org/abs/2201.11745 doi: https://doi.org/10.18653/v1/2022.emnlp-main.625 DOI: https://doi.org/10.18653/v1/2022.emnlp-main.625
Solaiman, I., Brundage, M., Clark, J., Askell, A., Herbert-Voss, A., Wu, J., … Wang, J. (2019). Release strategies and the social impacts of language models. arXiv. Retrieved from https://arxiv.org/abs/1908.09203
Suresh, H., & Guttag, J. V. (2021). A framework for understanding sources of harm throughout the machine learning life cycle. In Proceedings of acm eaamo 2021. (Article 7) doi: https://doi.org/10.1145/3465416.3483305 DOI: https://doi.org/10.1145/3465416.3483305
Tabassi, E. (2023). Artificial intelligence risk management framework (ai rmf 1.0). Retrieved from http://dx.doi.org/10.6028/NIST.AI.100-1 doi: https://doi.org/10.6028/nist.ai.100-1 DOI: https://doi.org/10.6028/NIST.AI.100-1
Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., & Shieber, S. (2020). Investigating gender bias in language models using causal mediation analysis. In Advances in neural information processing systems 33 (neurips 2020).
Wang, B., Chen, W., Pei, H., Xie, C., Kang, M., Zhang, C., … Li, B. (2024). Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv. Retrieved from https://arxiv.org/abs/2306.11698
Zhang, Y., Huang, Y., Sun, Y., Liu, C., Zhao, Z., Fang, Z., … Zhu, J. (2024). Multitrust: A comprehensive benchmark towards trustworthy multimodal large language models. arXiv. Retrieved from https://arxiv.org/abs/2406.07057 doi: https://doi.org/10.52202/079017-1561 DOI: https://doi.org/10.52202/079017-1561
Zhao, J., Wang, T., Yatskar, M., Cotterell, R., Ordonez, V., & Chang, K. (2019). Gender bias in contextualized word embeddings. arXiv. Retrieved from https://arxiv.org/abs/1904.03310 doi: https://doi.org/10.18653/v1/n19-1064 DOI: https://doi.org/10.18653/v1/N19-1064
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K. (2017a). Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the 2017 conference on empirical methods in natural language processing (emnlp).
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K. (2018). Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of naacl-hlt 2018 (pp. 15–20). doi: https://doi.org/10.18653/v1/n18-2003 DOI: https://doi.org/10.18653/v1/N18-2003
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K.-W. (2017b). Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the 2017 conference on empirical methods in natural language processing (pp. 2979–2989). Association for Computational Linguistics. Retrieved from http://dx.doi.org/10.18653/v1/D17-1323 doi: https://doi.org/10.18653/v1/d17-1323 DOI: https://doi.org/10.18653/v1/D17-1323
Zollo, T. P., Morrill, T., Deng, Z., Snell, J. C., Pitassi, T., & Zemel, R. (2024). Prompt risk control: A rigorous framework for responsible deployment of large language models. arXiv. Retrieved from https://arxiv.org/abs/2311.13628