-
Notifications
You must be signed in to change notification settings - Fork 27
feat: [Collector] Kernel power measurement support [1/N] #104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: [Collector] Kernel power measurement support [1/N] #104
Conversation
collector/helper.py
Outdated
| """ | ||
| # Map device name to system file | ||
| device_upper = device_name.upper() | ||
| if "H100" in device_upper: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reduce the number of if else with a prefix match?
| --backend trtllm \ | ||
| --ops gemm attention_context attention_generation \ | ||
| --measure-power \ | ||
| --power-limits 700 500 300 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the typical power limits? Shall we provide them as default so we the collected data points can cover majority of cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPUs are shipped in float-clock-cap-power mode, we have a default power limit (e.g., 700W for H100), which is on the product spec sheet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, different GPU models have quite different default power limits. 400W for A100, 700W for H100, etc. If not specified, we could query the GPU's default (max) power limit and just use that as the default.
I think the challenging part is whether or not to make power measurement the default. It does inflate benchmarking time since each memory-bound kernel is measured for three seconds (configurable). Though if someone decided to benchmark anyway, spending some extra time for a more complete measurement could make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would you like to use compute_bound in aiconfigurator.sdk? We have a SOL_FULL mode of the database for query query_op func. E.g., you can get sol_time, sol_math, sol_mem = db.query_gemm(SOL_MODE=SOL_FULL)
you can then check weather it's compute bound or mem bound by checking sol_time equals sol_math or sol_mem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we already have a way of deciding boundness, I do agree that we want to re-use the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I will use the database query! I wasn't aware that this existed.
|
Thank you for the comments! Just a quick update -- I think there's some chance I could get some A100 GPU time tomorrow or Wed, so I will apply suggestions, test it out, and ping you again. |
collector/README.md
Outdated
|
|
||
| ### Requirements | ||
|
|
||
| - Zeus (`pip install zeus`) for measurement & GPU power limit control |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaywonchung , I took a brief look at Zeus codebase. Zeus seems to be an abstraction layer on top of NVML. Conceptually, I kind of agree with this extra abstraction layer idea because it's easier to add other venders' device support. My key question here: do we commit to maintain this lib for a long-term horizon? On the other end, NVML is part of Nvidia's official library suite that enable customers to interact with GPU. Maybe we want to directly use NVML here?
@tianhaox for Viz and suggestion on introducing new dependency here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, for GPUs, it's an abstraction layer over NVML/AMDSMI + CUDA_VISIBLE_DEVICES/HIP_VISIBLE_DEVICES index remapping (because NVML is not on the CUDA application layer, it ignores it). Zeus's measurement API hasn't changed for very long since we know that's the API people use most often.
But I totally understand the concerns around having a new dependency. I'd be happy to do what the maintainers decide towards. I'll pause testing in case we decide to switch to pure NVML, since that requires some extra code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just come up with a patchset to migrate to NVML. I run some simple test on my end it seems working fine on my end.
git apply kaim-zeus2nvml.patch to reply the changes on top of this branch will do
03fa5d1 to
c1295dc
Compare
Signed-off-by: Jae-Won Chung <[email protected]> Signed-off-by: Kai Ma <[email protected]>
Signed-off-by: Jae-Won Chung <[email protected]> Signed-off-by: Kai Ma <[email protected]>
Signed-off-by: Jae-Won Chung <[email protected]> Signed-off-by: Kai Ma <[email protected]>
Signed-off-by: Jae-Won Chung <[email protected]> Signed-off-by: Kai Ma <[email protected]>
Signed-off-by: Jae-Won Chung <[email protected]> Signed-off-by: Kai Ma <[email protected]>
Signed-off-by: Jae-Won Chung <[email protected]> Signed-off-by: Kai Ma <[email protected]>
Signed-off-by: Kai Ma <[email protected]>
Signed-off-by: Kai Ma <[email protected]>
Signed-off-by: Kai Ma <[email protected]>
Signed-off-by: Kai Ma <[email protected]>
7e7494c to
d63c47d
Compare
Signed-off-by: Kai Ma <[email protected]>
Overview:
The broader context is enabling power-aware planning in power-constrained scenarios. For that, we want to start with kernel power benchmarking.
This PR is the first step towards kernel power measurement support. It begins with
trtllmbackend and the basic GEMM and attention ops, and NCCL kernels.Apologies for the delay. OSDI deadline and other work left little bandwidth for me :(
Details:
I'm hoping the changes in
collector/README.mdshould provide a bit more detail. I'm using Zeus because it's just convenient (like obeyingCUDA_VISIBLE_DEVICES), it's not very heavy of a dependency, and it also works out of the box for AMD GPUs, which is nice to have as an open-source project.This is a cleaned up version of what did run before, but I no longer have the GPU node to test this refactored version properly. But when I'll get any GPUs is unclear, so I decided to post this anyway as a draft PR to receive feedback on general structure.
Where should the reviewer start?
README.md-> Structure changes incollector.py-> Each op.cc. @jasonqinzhou