CVDP Benchmark

The CVDP Benchmark is a framework for evaluating LLM and agent solutions on hardware verification challenges.

Front Matter

Dataset: The CVDP Benchmark dataset is available on Hugging Face at 🤗 nvidia/cvdp-benchmark-dataset.

Paper: For detailed methodology and evaluation results, see our arXiv preprint: Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification.

Dataset Notes: The dataset published on Hugging Face contains the vast majority of evaluation problems discussed in the preprint. Twenty datapoints were omitted from the initial public release release due to test harness issues and licensing restrictions. Additionally, we excluded the reference solutions—'output' for non-agentic and 'patch' for agentic—from the initial dataset release to help mitigate data contamination. We hope to make these available in the future; if you have an immediate need, please feel free to reach out to discuss.

New LLM Benchmarking Coalition: The Silicon Integration Initiative (Si2) is adopting CVDP into their newly launched LLM Benchmarking Coalition (LBC), where industry and academic members work together to improve benchmarking and expedite the development of high-quality large language models for semiconductor design problems. The coalition builds upon CVDP by extending problems to cover new categories and design domains, refining benchmarking metrics, and maintaining a leaderboard. Learn more at Si2 LLM Benchmarking Coalition.

Quick Start

Prerequisites

Python 3.12 is recommended for optimal compatibility.

Docker CE (Community Edition) with a recent version is required:

Install Docker CE from https://docs.docker.com/get-docker/

Add your user to the docker group to run Docker without sudo permissions:

# Add current user to docker group
sudo usermod -aG docker $USER

# Log out and back in, or restart your session
# Verify Docker works without sudo:
docker --version

Setup Instructions

1. Create a virtual environment (recommended):

# Create virtual environment
python -m venv cvdp_env

# Activate virtual environment
# On Linux/macOS:
source cvdp_env/bin/activate
# On Windows:
cvdp_env\Scripts\activate

2. Install Python dependencies:

pip install -r requirements.txt

3. Configure environment variables:

cp .env.example .env
# Edit .env with your OpenAI API key:
# OPENAI_USER_KEY=your_api_key_here

🤖 Try Non-Agentic Workflow (LLM-based)

Evaluate language models via direct API calls on example problems:

1. Test golden reference solutions first:

# Run golden solutions to verify test harness works  
./run_benchmark.py -f example_dataset/cvdp_v1.0.1_example_nonagentic_code_generation_no_commercial_with_solutions.jsonl

# Check results - should show 100% pass rate
cat work/report.txt

2. Multi-sample evaluation (recommended - no-commercial dataset):

# Run 5 samples for statistical reliability with Pass@1 (requires OPENAI_USER_KEY)
./run_samples.py -f example_dataset/cvdp_v1.0.1_example_nonagentic_code_generation_no_commercial_with_solutions.jsonl -l -m gpt-4o-mini -n 5 -k 1 -p work_composite

# Check results in composite_report.txt
cat work_composite/composite_report.txt

3. Single run with LLM (no-commercial dataset):

# Same dataset with LLM evaluation (requires OPENAI_USER_KEY)
./run_benchmark.py -f example_dataset/cvdp_v1.0.1_example_nonagentic_code_generation_no_commercial_with_solutions.jsonl -l -m gpt-4o-mini -p work_llm

# Check LLM results
cat work_llm/report.txt

4. Try other example datasets:

# Commercial dataset (5 samples - requires OPENAI_USER_KEY)
# NOTE: Requires commercial EDA tools (Cadence Xcelium), license network, and VERIF_EDA_IMAGE setup
# See README sections: "EDA License Network Setup" and "Custom Verification Images" 
./run_samples.py -f example_dataset/cvdp_v1.0.1_example_nonagentic_code_generation_commercial_with_solutions.jsonl -l -m gpt-4o-mini -n 5 -k 1 -p work_commercial_composite

# Code comprehension dataset (5 samples)  
# NOTE: Uses BLEU/ROUGE scoring + LLM-based subjective scoring (requires OPENAI_USER_KEY)
# Subjective scoring is automatically enabled - no additional configuration needed
./run_samples.py -f example_dataset/cvdp_v1.0.1_example_nonagentic_code_comprehension_with_solutions.jsonl -l -m gpt-4o-mini -n 5 -k 1 -p work_comprehension_composite

# Check results
cat work_commercial_composite/composite_report.txt
cat work_comprehension_composite/composite_report.txt

🔧 Try Agentic Workflow (Docker-based)

Evaluate custom agents running in Docker containers:

1. Test golden reference solutions first:

# Run golden solutions to verify test harness works
./run_benchmark.py -f example_dataset/cvdp_v1.0.1_example_agentic_code_generation_no_commercial_with_solutions.jsonl

# Check results - should show 100% pass rate
cat work/report.txt

2. Multi-sample evaluation with agent (recommended):

# Copy and build the example agent first
# NOTE: This is a dummy agent that will fail - it just replaces "input" with "loompa" in RTL
# For real evaluation, you'll need to implement proper agent logic in agent.py
cp -r examples/agent/ ./my-agent/
cd my-agent/
./build_agent.sh
cd ..

# Run 5 samples for statistical reliability with Pass@1
./run_samples.py -f example_dataset/cvdp_v1.0.1_example_agentic_code_generation_no_commercial_with_solutions.jsonl -l -g example-agent -n 5 -k 1 -p work_composite

# Check results in composite_report.txt
cat work_composite/composite_report.txt

3. Single run with agent (no-commercial dataset):

# Run agent on same dataset  
# NOTE: Example agent is for demonstration only - expect low success rates
./run_benchmark.py -f example_dataset/cvdp_v1.0.1_example_agentic_code_generation_no_commercial_with_solutions.jsonl -l -g example-agent -p work_agent

# Check agent results
cat work_agent/report.txt

4. Try commercial dataset with multi-sampling:

# Agent run on commercial problems (5 samples)
# NOTE: Requires commercial EDA tools (Cadence Xcelium), license network, and VERIF_EDA_IMAGE setup
# NOTE: This is a dummy agent that will fail - it just replaces "input" with "loompa" in RTL
./run_samples.py -f example_dataset/cvdp_v1.0.1_example_agentic_code_generation_commercial_with_solutions.jsonl -l -g example-agent -n 5 -k 1 -p work_agentic_commercial_composite

# Check results
cat work_agentic_commercial_composite/composite_report.txt

5. Debug single problem:

# Run one problem to examine the agent workflow
# USEFUL: Good for understanding how agents interact with the environment and debug failures  
./run_benchmark.py -f example_dataset/cvdp_v1.0.1_example_agentic_code_generation_no_commercial_with_solutions.jsonl -i cvdp_agentic_fixed_arbiter_0001 -l -g example-agent -p work_debug

# Navigate to the work directory to examine agent execution
cd work_debug/cvdp_agentic_fixed_arbiter_0001/harness/1/

# Manually run the agent for debugging
./run_docker_agent.sh -d

# Run evaluation manually
./run_docker_harness_direct.sh

🚀 Next Steps

Resources:

Full Dataset: Download from 🤗 nvidia/cvdp-benchmark-dataset
Custom Models: See Non-Agentic Guide for custom model development
Custom Agents: See Agentic Guide for building your own agents
Advanced Features: See Examples for model factories and custom implementations

📖 Complete Documentation: For comprehensive documentation including detailed installation instructions, configuration options, advanced usage patterns, command-line reference, and troubleshooting guides, see README_FULL.md.

Documentation Index

README_FULL.md - Complete documentation with detailed guides and configuration
README_NON_AGENTIC.md - Complete guide for non-agentic evaluation workflow
README_AGENTIC.md - Complete guide for Docker agent evaluation workflow
examples/README.md - Custom model and agent development for end users
README_DEVELOPER.md - Internal development and architecture documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CVDP Benchmark

Front Matter

Quick Start

Prerequisites

Setup Instructions

🤖 Try Non-Agentic Workflow (LLM-based)

🔧 Try Agentic Workflow (Docker-based)

🚀 Next Steps

Documentation Index

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
example_dataset		example_dataset
examples		examples
src		src
tests		tests
tools		tools
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
README_AGENTIC.md		README_AGENTIC.md
README_DEVELOPER.md		README_DEVELOPER.md
README_FULL.md		README_FULL.md
README_NON_AGENTIC.md		README_NON_AGENTIC.md
requirements.txt		requirements.txt
run_benchmark.py		run_benchmark.py
run_reporter.py		run_reporter.py
run_samples.py		run_samples.py

License

NVlabs/cvdp_benchmark

Folders and files

Latest commit

History

Repository files navigation

CVDP Benchmark

Front Matter

Quick Start

Prerequisites

Setup Instructions

🤖 Try Non-Agentic Workflow (LLM-based)

🔧 Try Agentic Workflow (Docker-based)

🚀 Next Steps

Documentation Index

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages