A Tool for Benchmarking Large Language Models' Robustness in Assessing the Realism of Driving Scenarios

Abstract

In recent years, autonomous driving systems have made significant progress, yet ensuring their safety remains a key challenge. To this end, scenario-based testing offers a practical solution, and simulation-based methods have gained traction due to the high cost and risk of real-world testing. However, evaluating the realism of simulated scenarios remains difficult, creating demand for effective assessment methods. Recent advances show that Large Language Models (LLMs) possess strong reasoning and generalization capabilities, suggesting their potential in assessing scenario realism through scenario-related textual prompts. Motivated by this, we propose DriveRLR, a benchmark tool to assess the robustness of LLMs in evaluating the realism of driving scenarios. DriveRLR generates mutated scenario variants, constructs prompts, which are then used to assess a given LLM's ability and robustness in determining the realism of driving scenarios. We validate DriveRLR on the DeepScenario dataset using three state-of-the-art LLMs: GPT-5, Llama 4 Maverick, and Mistral Small 3.2. Results show that DriveRLR effectively reveals differences in the robustness of various LLMs, demonstrating its effectiveness and practical value in scenario realism assessment. Beyond LLM robustness evaluation, DriveRLR can serve as a practical component in applications such as an objective function to guide scenario generation, supporting simulation-based ADS testing workflows.

Setup

Follow the instructions below to set up and configure the environment.

# 1) Enter the project directory
cd DriveRLR

# 2) Create and activate a conda environment
conda create -n DriveRLR python=3.9 -y
conda activate DriveRLR

# 3) Install dependencies
pip install -r requirements.txt

# 4) Install build tooling and build the wheel/sdist from source
python -m pip install --upgrade pip build
python -m build

# 5) Install the built wheel locally
pip install dist/driverlr-0.1.0-py3-none-any.whl

Usage

There are three ways to use this tool:

1. Run in Terminal

Run the tool directly without writing any code. Each parameter can be set via command-line input, with default values shown. Final output location will also be displayed.

python tool.py

2. Use in Python Script

Call specific functions in your own Python code. We provide several callable functions. See example below:

python example.py

3. Modify the Source Code

You can modify the source code to fit your needs. For example:

Change how scenario parameters are mutated
Modify prompt templates
Add new evaluation metrics

After modification, recompile or run as needed.

Project Structure

DriveRLR/
├── assets/                     # Images or other static resources
├── data/                       # Input/output data files
├── dist/                       # Built distributions (.whl, etc.)
├── src/                        # Source code directory
├── example.py                  # Example usage script
├── LICENSE                     # License file
├── pyproject.toml              # Build/configuration file for Python packaging
├── README.md                   # Project documentation
├── requirements.txt            # Python dependencies
├── scenario-toolset.tar.gz     # Archived toolset package
└── tool.py                     # Main script to run the tool

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Tool for Benchmarking Large Language Models' Robustness in Assessing the Realism of Driving Scenarios

Abstract

Setup

Usage

1. Run in Terminal

2. Use in Python Script

3. Modify the Source Code

Project Structure

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
data		data
dist		dist
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example.py		example.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
scenario-toolset.tar.gz		scenario-toolset.tar.gz
tool.py		tool.py

License

Simula-COMPLEX/DriveRLR

Folders and files

Latest commit

History

Repository files navigation

A Tool for Benchmarking Large Language Models' Robustness in Assessing the Realism of Driving Scenarios

Abstract

Setup

Usage

1. Run in Terminal

2. Use in Python Script

3. Modify the Source Code

Project Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages