Do LLMs Align Human Values Regarding Social Biases? Judging and Explaining Social Biases with LLMs (EMNLP2025: Findings)
- Do LLMs Align Human Values Regarding Social Biases? Judging and Explaining Social Biases with LLMs
- Catelogue
- Abstract
- 🛠️ Step1: Prepare all scenarios
- ⚖️ Step2: Judging
- 🗣️ Step3: Explaining
- 📚 More explanations of our work and repository
Large language models (LLMs) can lead to undesired consequences when misaligned with human values, especially in scenarios involving complex and sensitive social biases. Previous studies have revealed the misalignment of LLMs with human values using expert-designed or agent-based emulated bias scenarios. However, it remains unclear whether the alignment of LLMs with human values differs across different types of scenarios (e.g., scenarios containing negative vs. non-negative questions). In this study, we investigate the alignment of LLMs with human values regarding social biases (HVSB) in different types of bias scenarios. Through extensive analysis of 12 LLMs from four model families and four datasets, we demonstrate that LLMs with large model parameter scales do not necessarily have lower misalignment rate and attack success rate. Moreover, LLMs show a certain degree of alignment preference for specific types of scenarios and the LLMs from the same model family tend to have higher judgment consistency. In addition, we study the understanding capacity of LLMs with their explanations of HVSB. We find no significant differences in the understanding of HVSB across LLMs. We also find LLMs prefer their own generated explanations. Additionally, we endow smaller language models (LMs) with the ability to explain HVSB. The generation results show that the explanations generated by the fine-tuned smaller LMs are more readable, but have a relatively lower model agreeability.
Our scenarios come from four sources: BBQ dataset (Q&A), BiasDPO dataset (conversation), SS, and CP datasets (emulated realistic scenarios). The scenarios are prepared using the following scripts. If you do not want to prepare all scenarios, you can skip this step and use the prepared scenarios in data/ folder (direct got to ⚖️ Step2: Judging).
Before running the scripts, please set the following parameters in preprocess/preprocess_utils.py:
- Set
bbq_dataset_pathto the path of the BBQ dataset). - Set
ss_dataset_pathto the path of the SS dataset). - Set
cp_dataset_pathto the path of the CP dataset).
For BBQ dataset, please run the following script:
python preprocess/preprocess_bbq.pyFor BiasDPO dataset, please run the following script:
python preprocess/preprocess_biasdpo.pyThen, please set openai_api_key in preprocess/preprocess_utils.py to your OpenAI API key.
For SS dataset, please run the following script:
python preprocess/preprocess_ss.pyFor CP dataset, please run the following script:
python preprocess/preprocess_cp.pyMore details of our dataset can be found in Dataset explanation.
Before running the judging script, please set API keys according to API keys.
Using ChatGPT or DeepSeek API to judging BBQ/BiasDPO/SS/CP datasets, please run the following script:
python code/judgment/judge_api.py --dataset [bbq, biasdpo, ss, cp] --model_name_or_path [gpt-3.5-turbo, gpt-4o, gpt-4o-mini] --system_prompt ['', untargeted, targeted]
Using Qwen or Llama model family to judging BBQ/BiasDPO/SS/CP dataset, please run the following script:
python code/judgment/judge_llm.py --dataset [bbq, biasdpo, ss, cp] --model_name_or_path [Qwen, Llama] --system_prompt ['', untargeted, targeted]
Before running the explaining script, please using the following script to sample scenarios from the prepared scenarios in data/ folder:
python preprocess/gather_all_then_sample.py
Then, please set the API keys according to API keys.
Using ChatGPT or DeepSeek API to explaining BBQ/BiasDPO/SS/CP datasets, please run the following script:
python code/explanation/explain_api.py \
--dataset [bbq, biasdpo, ss, cp] \
--model_name_or_path [gpt-3.5-turbo, gpt-4o, gpt-4o-mini] \
--file_path [data/bbq_tobe_exp_500/bbq.jsonl, data/biasdpo_tobe_exp_500/biasdpo.jsonl, data/ss_tobe_exp_500/ss.jsonl, data/cp_tobe_exp_500/cp.jsonl]
Using Qwen or Llama model family to explaining BBQ/BiasDPO/SS/CP dataset, please run the following script:
python code/explanation/explain_llm.py \
--dataset [bbq, biasdpo, ss, cp] \
--model_name_or_path [Qwen, Llama] \
--file_path [data/bbq_tobe_exp_500/bbq.jsonl, data/biasdpo_tobe_exp_500/biasdpo.jsonl, data/ss_tobe_exp_500/ss.jsonl, data/cp_tobe_exp_500/cp.jsonl]
Before running the model agreeability script, please set API keys according to API keys.
Using ChatGPT or DeepSeek API to explaining BBQ/BiasDPO/SS/CP datasets, please run the following script:
python code/agreeability/agree_api.py \
--dataset <DATASET_NAME> \
--model_name_or_path [gpt-3.5-turbo, gpt-4o, gpt-4o-mini] \
--file_path data/<DATASET_NAME>_explanation/<MODEL_NAME>/<DATASET_NAME>.jsonl
Using Qwen or Llama model family to explaining BBQ/BiasDPO/SS/CP dataset, please run the following script:
python code/agreeability/agree_llm.py \
--dataset <DATASET_NAME> \
--model_name_or_path [Qwen, Llama] \
--file_path data/<DATASET_NAME>_explanation/<MODEL_NAME>/<DATASET_NAME>.jsonl
We use function get_bbq_dataset_path(), get_ss_dataset_path(), and get_cp_dataset_path() to return the path of the BBQ dataset, SS dataset, and CP dataset path. So, please set the dataset paths in the file preprocess/preprocess_utils.py.
For BBQ dataset, the path format is your/path/to/BBQ/{}.jsonl, for SS and CP datasets, the path is the SS dataset file path.
Please set the following parameters in code/judgment/judge_utils.py, code/explanation/explain_utils.py, and code/agreeability/agree_utils.py:
- Set
openai_api_keyto your OpenAI API key. - Set
deepseek_api_keyto your DeepSeek API key. - Set
hf_tokento your Huggingface token.
@article{liu2025llms,
title={Do LLMs Align Human Values Regarding Social Biases? Judging and Explaining Social Biases with LLMs},
author={Liu, Yang and Chu, Chenhui},
journal={arXiv preprint arXiv:2509.13869},
year={2025}
}
