Skip to content

Conversation

@ammar-agent
Copy link
Collaborator

Add CI pipeline to persist benchmark results in BigQuery for tracking performance over time.

Changes

  • Add scripts/upload-tbench-results.py to parse Harbor output and upload one row per trial to benchmarks.tbench_results
  • Add upload step to terminal-bench.yml workflow (guarded for coder/mux repo only)

Schema

Single table mux-benchmarks.benchmarks.tbench_results:

  • Identity: run_id, task_id
  • GitHub context: github_run_id, github_workflow, github_sha, github_ref, github_actor, github_event_name
  • Run config: model_name, thinking_level, mode, dataset, experiments
  • Results: accuracy, n_resolved, n_unresolved
  • Per-task: passed, score, n_input_tokens, n_output_tokens
  • Raw JSON: run_result_json, task_result_json (future-proofing)
  • Ingestion: ingested_at

Test Plan

  • Run workflow_dispatch on this PR branch
  • Verify BQ table is populated correctly

Generated with mux • Model: anthropic:claude-opus-4-5 • Thinking: high • Cost: $6.82

@github-actions github-actions bot added enhancement New feature or functionality ci labels Jan 17, 2026
@ammar-agent ammar-agent force-pushed the ci-telemetry-y83p branch 2 times, most recently from 9f16b47 to 1e0136e Compare January 17, 2026 23:33
Add CI pipeline to persist benchmark results in BigQuery for tracking
performance over time.

Changes:
- Add scripts/upload-tbench-results.py to parse Harbor output and upload
  one row per trial to benchmarks.tbench_results
- Add upload step to terminal-bench.yml workflow (guarded for coder/mux)

Schema: run_id, task_id, github context (6 cols), run config (5 cols),
results (accuracy, n_resolved, n_unresolved), per-task (passed, score,
tokens), raw JSON blobs for future-proofing.
@ammario ammario merged commit 14e556b into main Jan 18, 2026
22 checks passed
@ammario ammario deleted the ci-telemetry-y83p branch January 18, 2026 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci enhancement New feature or functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants