🤖 feat: upload Terminal-Bench results to BigQuery #1737

ammar-agent · 2026-01-17T20:03:05Z

Add CI pipeline to persist benchmark results in BigQuery for tracking performance over time.

Changes

Add scripts/upload-tbench-results.py to parse Harbor output and upload one row per trial to benchmarks.tbench_results
Add upload step to terminal-bench.yml workflow (guarded for coder/mux repo only)

Schema

Single table mux-benchmarks.benchmarks.tbench_results:

Identity: run_id, task_id
GitHub context: github_run_id, github_workflow, github_sha, github_ref, github_actor, github_event_name
Run config: model_name, thinking_level, mode, dataset, experiments
Results: accuracy, n_resolved, n_unresolved
Per-task: passed, score, n_input_tokens, n_output_tokens
Raw JSON: run_result_json, task_result_json (future-proofing)
Ingestion: ingested_at

Test Plan

Run workflow_dispatch on this PR branch
Verify BQ table is populated correctly

Generated with mux • Model: anthropic:claude-opus-4-5 • Thinking: high • Cost: $6.82

Add CI pipeline to persist benchmark results in BigQuery for tracking performance over time. Changes: - Add scripts/upload-tbench-results.py to parse Harbor output and upload one row per trial to benchmarks.tbench_results - Add upload step to terminal-bench.yml workflow (guarded for coder/mux) Schema: run_id, task_id, github context (6 cols), run config (5 cols), results (accuracy, n_resolved, n_unresolved), per-task (passed, score, tokens), raw JSON blobs for future-proofing.

github-actions bot added enhancement New feature or functionality ci labels Jan 17, 2026

ammar-agent force-pushed the ci-telemetry-y83p branch 2 times, most recently from 9f16b47 to 1e0136e Compare January 17, 2026 23:33

ammar-agent force-pushed the ci-telemetry-y83p branch from 1e0136e to 90ff717 Compare January 17, 2026 23:46

ammario merged commit 14e556b into main Jan 18, 2026
22 checks passed

ammario deleted the ci-telemetry-y83p branch January 18, 2026 00:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🤖 feat: upload Terminal-Bench results to BigQuery #1737

🤖 feat: upload Terminal-Bench results to BigQuery #1737

Uh oh!

ammar-agent commented Jan 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🤖 feat: upload Terminal-Bench results to BigQuery #1737

🤖 feat: upload Terminal-Bench results to BigQuery #1737

Uh oh!

Conversation

ammar-agent commented Jan 17, 2026

Changes

Schema

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants