Skip to content

Conversation

@jaywonchung
Copy link

@jaywonchung jaywonchung commented Nov 13, 2025

Overview:

The broader context is enabling power-aware planning in power-constrained scenarios. For that, we want to start with kernel power benchmarking.

This PR is the first step towards kernel power measurement support. It begins with trtllm backend and the basic GEMM and attention ops, and NCCL kernels.

Apologies for the delay. OSDI deadline and other work left little bandwidth for me :(

Details:

I'm hoping the changes in collector/README.md should provide a bit more detail. I'm using Zeus because it's just convenient (like obeying CUDA_VISIBLE_DEVICES), it's not very heavy of a dependency, and it also works out of the box for AMD GPUs, which is nice to have as an open-source project.

This is a cleaned up version of what did run before, but I no longer have the GPU node to test this refactored version properly. But when I'll get any GPUs is unclear, so I decided to post this anyway as a draft PR to receive feedback on general structure.

Where should the reviewer start?

README.md -> Structure changes in collector.py -> Each op.

cc. @jasonqinzhou

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 13, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jaywonchung jaywonchung changed the title [Collector] Kernel power measurement support [1/N] feat: [Collector] Kernel power measurement support [1/N] Nov 13, 2025
@github-actions github-actions bot added the feat label Nov 13, 2025
"""
# Map device name to system file
device_upper = device_name.upper()
if "H100" in device_upper:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reduce the number of if else with a prefix match?

--backend trtllm \
--ops gemm attention_context attention_generation \
--measure-power \
--power-limits 700 500 300 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the typical power limits? Shall we provide them as default so we the collected data points can cover majority of cases?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPUs are shipped in float-clock-cap-power mode, we have a default power limit (e.g., 700W for H100), which is on the product spec sheet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, different GPU models have quite different default power limits. 400W for A100, 700W for H100, etc. If not specified, we could query the GPU's default (max) power limit and just use that as the default.

I think the challenging part is whether or not to make power measurement the default. It does inflate benchmarking time since each memory-bound kernel is measured for three seconds (configurable). Though if someone decided to benchmark anyway, spending some extra time for a more complete measurement could make sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would you like to use compute_bound in aiconfigurator.sdk? We have a SOL_FULL mode of the database for query query_op func. E.g., you can get sol_time, sol_math, sol_mem = db.query_gemm(SOL_MODE=SOL_FULL)
you can then check weather it's compute bound or mem bound by checking sol_time equals sol_math or sol_mem

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we already have a way of deciding boundness, I do agree that we want to re-use the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I will use the database query! I wasn't aware that this existed.

@jaywonchung
Copy link
Author

Thank you for the comments! Just a quick update -- I think there's some chance I could get some A100 GPU time tomorrow or Wed, so I will apply suggestions, test it out, and ping you again.


### Requirements

- Zeus (`pip install zeus`) for measurement & GPU power limit control

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaywonchung , I took a brief look at Zeus codebase. Zeus seems to be an abstraction layer on top of NVML. Conceptually, I kind of agree with this extra abstraction layer idea because it's easier to add other venders' device support. My key question here: do we commit to maintain this lib for a long-term horizon? On the other end, NVML is part of Nvidia's official library suite that enable customers to interact with GPU. Maybe we want to directly use NVML here?
@tianhaox for Viz and suggestion on introducing new dependency here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, for GPUs, it's an abstraction layer over NVML/AMDSMI + CUDA_VISIBLE_DEVICES/HIP_VISIBLE_DEVICES index remapping (because NVML is not on the CUDA application layer, it ignores it). Zeus's measurement API hasn't changed for very long since we know that's the API people use most often.

But I totally understand the concerns around having a new dependency. I'd be happy to do what the maintainers decide towards. I'll pause testing in case we decide to switch to pure NVML, since that requires some extra code.

Copy link

@kaim-eng kaim-eng Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just come up with a patchset to migrate to NVML. I run some simple test on my end it seems working fine on my end.
git apply kaim-zeus2nvml.patch to reply the changes on top of this branch will do

kaim-zeus2nvml.patch
zeus2nvml_MIGRATION_NOTES.md

@jaywonchung jaywonchung force-pushed the jw-collector-power-measurement-support branch from 03fa5d1 to c1295dc Compare November 20, 2025 15:41
jaywonchung and others added 10 commits November 20, 2025 08:34
Signed-off-by: Jae-Won Chung <[email protected]>
Signed-off-by: Kai Ma <[email protected]>
Signed-off-by: Jae-Won Chung <[email protected]>
Signed-off-by: Kai Ma <[email protected]>
Signed-off-by: Jae-Won Chung <[email protected]>
Signed-off-by: Kai Ma <[email protected]>
Signed-off-by: Jae-Won Chung <[email protected]>
Signed-off-by: Kai Ma <[email protected]>
Signed-off-by: Jae-Won Chung <[email protected]>
Signed-off-by: Kai Ma <[email protected]>
Signed-off-by: Jae-Won Chung <[email protected]>
Signed-off-by: Kai Ma <[email protected]>
@kaim-eng kaim-eng force-pushed the jw-collector-power-measurement-support branch from 7e7494c to d63c47d Compare November 20, 2025 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants