Add vLLM Buildkite CI failure report tool#8014
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
|
this might be better lives in vllm/ repo in long term, i will just add this to test-infra for now, and can move this anywhere later |
ffcf29d to
02173aa
Compare
| required=True, | ||
| help="Branch name, e.g. atalman:release_212_tests", | ||
| ) | ||
| parser.add_argument("--token", required=True, help="Buildkite API token") |
There was a problem hiding this comment.
Just worth double check that ClickHouse for this, we have all the Buildkite CI signals from ClickHouse there. Using Buildkite API token is fine, but I guess not many of us have access to that while ClickHouse is more available
There was a problem hiding this comment.
this tool is mainly for release pytorch investigation, potentially this can be used in https://github.com/vllm-project/vllm-dashboard for us to investigate the release errors
|
FYI, Kevin shares this wip dashboard earlier https://vllm-ci-dashboard.vercel.app, I can see the signals from @atalman's PR there. It doesn't have the logs though. |
ah yes, I'm thinking add more feature to the dashboard as release investigation tool |
Script to fetch the latest Buildkite CI build for a given branch, extract all failed steps with their failure reasons, and provide direct links to the relevant log lines. Authored with Claude.
02173aa to
7f54bf6
Compare
huydhn
left a comment
There was a problem hiding this comment.
Let's also create a skill for this in https://github.com/pytorch/test-infra/tree/main/.claude/skills
Script to fetch the latest vllm Buildkite CI build for a given branch, extract all failed steps with their failure reasons, and provide direct links to the relevant log lines.
you need to pass the buidkite token for this, see readme, simply log in as buildkite user, and generate a access token:
https://buildkite.com/user/api-access-tokens
this can potentially used to auto-detect failures after a release run. Ideal route:
example local result