Documentation | GitHub | Issues
LLM evaluation framework with benchmark environments, pluggable solvers, composable interceptor proxy, and multi-format reporting.
pip install -e . # core
pip install -e ".[scoring]" # + sympy for symbolic math
pip install -e ".[stats]" # + scipy (regression analysis)
pip install -e ".[scoring,stats]" # + sympy + scipy for confidence intervals
pip install -e ".[harbor]" # + Harbor agents (OpenHands, Terminus-2)
pip install -e ".[inspect]" # + Inspect AI log export
pip install -e ".[all]" # common runtime integrationsexport NVIDIA_API_KEY="your-api-key-here"
# Run a benchmark from the CLI
nel eval run --bench mmlu \
--model-url https://integrate.api.nvidia.com/v1 \
--model-id nvidia/nemotron-3-super-120b-a12b \
--api-key $NVIDIA_API_KEY \
--repeats 3 --max-problems 100
# Run from a YAML config
nel eval run config.yaml
nel eval run config.yaml --resume
# Generate a report
nel eval report ./eval_results/ -f markdown -o report.md17 built-in benchmarks plus external harness integrations:
| Benchmark | Type | Scoring |
|---|---|---|
| mmlu, mmlu_pro, gpqa | Multichoice | multichoice_regex |
| gsm8k, math500, mgsm | Math | numeric_match / answer_line |
| drop, triviaqa | QA | fuzzy_match |
| humaneval | Code | code_sandbox (Docker) |
| simpleqa, healthbench | Judge | needs_judge |
| pinchbench | Agentic | code_sandbox / needs_judge |
| xstest | Safety | needs_judge |
| terminal-bench-hard, terminal-bench-v1 | Terminal tasks | Task test harness |
| nmp_harbor | Agentic NMP | Harbor task tests |
External environments via URI schemes: lm-eval://, skills://, vlmevalkit://, gym://, harbor://, container://.
Built-in local interceptor proxy for LLM traffic. Intercepts all agent-to-model requests for caching, logging, payload modification, turn limiting, and custom transformations — no external dependencies required.
services:
nemotron:
type: api
url: https://integrate.api.nvidia.com/v1/chat/completions
protocol: chat_completions
model: nvidia/nemotron-3-super-120b-a12b
api_key: ${NVIDIA_API_KEY}
proxy:
request_timeout: 600
interceptors:
- name: turn_counter
config:
max_turns: 100
- name: drop_params
config:
params: [max_tokens]
verbose: trueAvailable interceptors:
| Interceptor | Stage | Description |
|---|---|---|
endpoint |
request→response | Async HTTP forwarding with retry, backoff, connection pooling |
caching |
request→response | Disk-backed SQLite cache with canonical keys |
turn_counter |
request | Per-session turn counting with budget injection |
drop_params |
request | Strip named parameters from requests |
modify_tools |
request | Add/remove properties in tool schemas |
system_message |
request | Inject/replace/prepend system messages |
payload_modifier |
request | Recursive parameter add/remove/rename |
raise_client_errors |
response | Convert 4xx to exceptions |
log_tokens |
response | Log token usage per request |
response_stats |
response | Aggregate timing and token statistics |
reasoning |
response | Normalize <think> blocks to reasoning_content |
progress_tracking |
response | Progress counter with optional webhook |
logging |
request + response | Request/response logging with body preview |
Configured via solver.type in each benchmark:
| Solver Type | Config type |
Use Case |
|---|---|---|
| SimpleSolver | simple |
Standard chat/completion/VLM (default) |
| HarborSolver | harbor |
Harbor agents (OpenHands, Terminus-2, etc.) |
| ToolCallingSolver | tool_calling |
Tool-use with Gym resource servers |
| GymDelegationSolver | gym_delegation |
Delegate to nemo-gym server |
| OpenClawSolver | openclaw |
OpenClaw CLI agent |
| ContainerSolver | container |
Legacy container harness |
Evaluation results can be exported to experiment trackers and compatible formats:
output:
export: [inspect, wandb, mlflow]inspect— Producesinspect_ai-compatibleEvalLogJSON files. Install withpip install -e ".[inspect]".wandb/mlflow— Push scores and artifacts to experiment trackers. Install withpip install -e ".[export]".
from nemo_evaluator import benchmark, scorer, ScorerInput, exact_match
@benchmark(name="my-bench", dataset="hf://my-org/data?split=test",
prompt="Q: {question}\nA:", target_field="answer")
@scorer
def my_scorer(sample: ScorerInput) -> dict:
return exact_match(sample)Per-problem Docker/SLURM sandboxes for code execution and agentic evaluation. Two modes: stateful (shared sandbox for solve + verify) and stateless (separate agent and verification containers with shared volume).
Pyxis/Enroot-based execution with auto-selected container images per URI scheme. Uses node_pools topology for flexible resource allocation across model, agent, and sandbox nodes.
| Tag suffix | Contents |
|---|---|
:latest |
Base + gym + vlmevalkit |
:latest-lm-eval |
+ lm-evaluation-harness |
:latest-skills |
+ NeMo Skills |
:latest-full |
All harnesses |
| Command | Purpose |
|---|---|
nel eval run |
Run evaluation (name or YAML) |
nel eval merge <dir> |
Merge sharded results |
nel eval report <dir> |
Generate reports |
nel list |
List benchmarks |
nel serve -b <name> |
Serve as HTTP endpoint |
nel validate -b <name> |
Sanity check |
nel export <paths> --dest <exporter> |
Export bundles |
nel cache-sqsh <image> |
Build a SLURM .sqsh cache image |
nel report <dir> |
Generate multi-benchmark reports |
nel compare |
Paired run comparison |
nel gate |
Multi-benchmark quality gate |
nel config |
Persistent user config |
nel package |
Containerize BYOB benchmark |
Use nel compare when you want to compare two runs of the same benchmark and inspect score deltas, flips, and statistical evidence.
nel compare ./results/baseline ./results/candidate --strictFull tutorial: docs/tutorials/compare.md
Use nel gate when you want one GO / NO-GO / INCONCLUSIVE decision across multiple benchmarks from an explicit policy file.
nel gate ./results/baseline ./results/candidate \
--policy gate_policy.yaml \
--strict \
--output gate_report.jsonFull tutorial: docs/tutorials/quality-gate.md
See examples/configs/ for 25+ end-to-end configs covering all solver types, verification methods, and execution backends.