Multilingual TruthfulQA

This repository contains the code and resources for the paper "Truth Knows No Language: Evaluating Truthfulness Beyond English".

Abstract

We introduce a professionally translated extension of the TruthfulQA benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and Spanish. Truthfulness evaluations of large language models (LLMs) have primarily been conducted in English. However, the ability of LLMs to maintain truthfulness across languages remains under-explored. Our study evaluates 12 state-of-the-art open LLMs, comparing base and instruction-tuned models using human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring. Our findings reveal that, while LLMs perform best in English and worst in Basque (the lowest-resourced language), overall truthfulness discrepancies across languages are smaller than anticipated. Furthermore, we show that LLM-as-a-Judge correlates more closely with human judgments than multiple-choice metrics, and that informativeness plays a critical role in truthfulness assessment. Our results also indicate that machine translation provides a viable approach for extending truthfulness benchmarks to additional languages, offering a scalable alternative to professional translation. Finally, we observe that universal knowledge questions are better handled across languages than context- and time-dependent ones, highlighting the need for truthfulness evaluations that account for cultural and temporal variability.

Resources

Paper: https://arxiv.org/abs/2502.09387
Dataset: https://huggingface.co/datasets/HiTZ/truthfulqa-multi
Judges: https://huggingface.co/collections/HiTZ/multilingual-truthfulqa-682f33d0d1d5a60d13604eb6

Repository Structure

analysis/: Scripts and notebooks for analyzing the results of the experiments and evaluations.
data/: Contains the raw and processed data used for the experiments.
experiments/: Scripts to run the main experiments, including generating model answers.
judge/: Code related to the judge models, including running the judges and evaluating their performance.
MT_experiments/: Scripts and resources for experiments involving machine translation.
results/: Stores the outputs and saved results from various experiments and evaluations.
utils/: Utility scripts and helper functions used across the project.

Usage

Obtain model answers

Obtain the answers of the model using harness. These scripts typically call shell scripts located in experiments/run/.

For standard models:

sbatch experiments/generative.slurm

For larger models:

sbatch experiments/generative_big.slurm

Run MC2

To run the MC2:

For standard models:

sbatch experiments/mc2.slurm

For larger models:

sbatch experiments/mc2_big.slurm

Judging Process

This project includes a comprehensive framework for judging the truthfulness of model-generated answers.

1. Train Judge Models (Optional):

If you need to train custom judge models, refer to the scripts and resources in judge/train_judge/.
Training data may involve translation, see judge/translate_training_data/.

2. Judge Model Answers:

Use the following command to have the judge model evaluate the answers generated in the previous steps. This typically calls scripts in judge/run_experiments/.

sbatch judge/run_experiments/judge.slurm

3. Process Judge Results:

The script judge/judge_results.py can be used to process or aggregate the outputs from the judge models.

4. Evaluate Judges:

Compare the judge model's evaluations against manual annotations:

python judge/correlate_to_manual.py

Analyze Results

To generate statistics on judge performance by category and type:

python analysis/get_stats.py

This will output by_category.csv and by_type.csv in the analysis/ directory.

To calculate inter-annotator agreement (Cohen's Kappa) between manual evaluations and MC2 results:

python analysis/iaa_mc2.py

Compare to Machine Translated (MT) Data Analysis

Scripts in analysis/compare_to_MT_translation/ are used to compare results obtained using human-translated data versus machine-translated data.

check_translation_quality.py: Evaluates the quality of machine-translated questions and answers against human translations using metrics like BLEU, CHRF, etc.
```
python analysis/compare_to_MT_translation/check_translation_quality.py
```
comparison.py: Compares judge outputs for human-translated data versus machine-translated data (from judge/judge_output/MT-claude/). It calculates agreement and identifies instances where judgments differ.
```
python analysis/compare_to_MT_translation/comparison.py
```
statistical_test_MT.py: Performs statistical tests (Chi-square) to compare the distributions of 'yes'/'no' judgments between human-translated and machine-translated datasets for different models and languages.
```
python analysis/compare_to_MT_translation/statistical_test_MT.py
```
The analysis/compare_to_MT_translation/translate/ directory contains scripts for performing machine translation, likely using different services/models (e.g., translate_with_anthropic.py).

Cultural Nuances Analysis

Scripts in analysis/cultural_nuances/ are used to investigate if there are differences in model performance on questions with local/cultural nuances versus global questions. It relies on a predefined list of culturally specific instances from VeritasQA.

test_local_instances.py: Loads judge results, separates instances based on whether they are in the VeritasQA list (global) or not (local), and then calculates and outputs performance metrics for these two subsets.
```
python analysis/cultural_nuances/test_local_instances.py
```

Cite this work

@misc{figueras2025truthknowslanguageevaluating,
      title={Truth Knows No Language: Evaluating Truthfulness Beyond English}, 
      author={Blanca Calvo Figueras and Eneko Sagarzazu and Julen Etxaniz and Jeremy Barnes and Pablo Gamallo and Iria De Dios Flores and Rodrigo Agerri},
      year={2025},
      eprint={2502.09387},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.09387}, 
}

For questions contact [email protected] and [email protected]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multilingual TruthfulQA

Abstract

Resources

Repository Structure

Usage

Obtain model answers

Run MC2

Judging Process

Analyze Results

Compare to Machine Translated (MT) Data Analysis

Cultural Nuances Analysis

Cite this work

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
MT_experiments		MT_experiments
analysis		analysis
data		data
experiments		experiments
judge		judge
results		results
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

hitz-zentroa/truthfulqa-multi

Folders and files

Latest commit

History

Repository files navigation

Multilingual TruthfulQA

Abstract

Resources

Repository Structure

Usage

Obtain model answers

Run MC2

Judging Process

Analyze Results

Compare to Machine Translated (MT) Data Analysis

Cultural Nuances Analysis

Cite this work

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages