feat: [Collector] Kernel power measurement support [1/N] #104

jaywonchung · 2025-11-13T03:25:45Z

Overview:

The broader context is enabling power-aware planning in power-constrained scenarios. For that, we want to start with kernel power benchmarking.

This PR is the first step towards kernel power measurement support. It begins with trtllm backend and the basic GEMM and attention ops, and NCCL kernels.

Apologies for the delay. OSDI deadline and other work left little bandwidth for me :(

Details:

I'm hoping the changes in collector/README.md should provide a bit more detail. I'm using Zeus because it's just convenient (like obeying CUDA_VISIBLE_DEVICES), it's not very heavy of a dependency, and it also works out of the box for AMD GPUs, which is nice to have as an open-source project.

This is a cleaned up version of what did run before, but I no longer have the GPU node to test this refactored version properly. But when I'll get any GPUs is unclear, so I decided to post this anyway as a draft PR to receive feedback on general structure.

Where should the reviewer start?

README.md -> Structure changes in collector.py -> Each op.

cc. @jasonqinzhou

copy-pr-bot · 2025-11-13T03:25:48Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

jasonqinzhou · 2025-11-14T04:32:32Z

collector/helper.py

+    """
+    # Map device name to system file
+    device_upper = device_name.upper()
+    if "H100" in device_upper:


Can we reduce the number of if else with a prefix match?

jasonqinzhou · 2025-11-14T04:56:14Z

collector/README.md

+  --backend trtllm \
+  --ops gemm attention_context attention_generation \
+  --measure-power \
+  --power-limits 700 500 300 \


What are the typical power limits? Shall we provide them as default so we the collected data points can cover majority of cases?

GPUs are shipped in float-clock-cap-power mode, we have a default power limit (e.g., 700W for H100), which is on the product spec sheet.

Also, different GPU models have quite different default power limits. 400W for A100, 700W for H100, etc. If not specified, we could query the GPU's default (max) power limit and just use that as the default.

I think the challenging part is whether or not to make power measurement the default. It does inflate benchmarking time since each memory-bound kernel is measured for three seconds (configurable). Though if someone decided to benchmark anyway, spending some extra time for a more complete measurement could make sense.

tianhaox · 2025-11-14T05:15:58Z

collector/trtllm/collect_attn.py

how would you like to use compute_bound in aiconfigurator.sdk? We have a SOL_FULL mode of the database for query query_op func. E.g., you can get sol_time, sol_math, sol_mem = db.query_gemm(SOL_MODE=SOL_FULL)
you can then check weather it's compute bound or mem bound by checking sol_time equals sol_math or sol_mem

If we already have a way of deciding boundness, I do agree that we want to re-use the code.

Yeah I will use the database query! I wasn't aware that this existed.

jaywonchung · 2025-11-17T17:04:43Z

Thank you for the comments! Just a quick update -- I think there's some chance I could get some A100 GPU time tomorrow or Wed, so I will apply suggestions, test it out, and ping you again.

kaim-eng · 2025-11-17T18:06:36Z

collector/README.md

+
+### Requirements
+
+- Zeus (`pip install zeus`) for measurement & GPU power limit control


@jaywonchung , I took a brief look at Zeus codebase. Zeus seems to be an abstraction layer on top of NVML. Conceptually, I kind of agree with this extra abstraction layer idea because it's easier to add other venders' device support. My key question here: do we commit to maintain this lib for a long-term horizon? On the other end, NVML is part of Nvidia's official library suite that enable customers to interact with GPU. Maybe we want to directly use NVML here?
@tianhaox for Viz and suggestion on introducing new dependency here.

Yeah, for GPUs, it's an abstraction layer over NVML/AMDSMI + CUDA_VISIBLE_DEVICES/HIP_VISIBLE_DEVICES index remapping (because NVML is not on the CUDA application layer, it ignores it). Zeus's measurement API hasn't changed for very long since we know that's the API people use most often.

But I totally understand the concerns around having a new dependency. I'd be happy to do what the maintainers decide towards. I'll pause testing in case we decide to switch to pure NVML, since that requires some extra code.

Just come up with a patchset to migrate to NVML. I run some simple test on my end it seems working fine on my end.
git apply kaim-zeus2nvml.patch to reply the changes on top of this branch will do

kaim-zeus2nvml.patch
zeus2nvml_MIGRATION_NOTES.md

Signed-off-by: Jae-Won Chung <[email protected]> Signed-off-by: Kai Ma <[email protected]>

Signed-off-by: Kai Ma <[email protected]>

jaywonchung changed the title ~~[Collector] Kernel power measurement support [1/N]~~ feat: [Collector] Kernel power measurement support [1/N] Nov 13, 2025

github-actions bot added the feat label Nov 13, 2025

jasonqinzhou reviewed Nov 14, 2025

View reviewed changes

tianhaox reviewed Nov 14, 2025

View reviewed changes

kaim-eng reviewed Nov 17, 2025

View reviewed changes

jaywonchung force-pushed the jw-collector-power-measurement-support branch from 03fa5d1 to c1295dc Compare November 20, 2025 15:41

jaywonchung and others added 10 commits November 20, 2025 08:34

Port

e00884c

Signed-off-by: Jae-Won Chung <[email protected]> Signed-off-by: Kai Ma <[email protected]>

Fixes

d5916ae

Signed-off-by: Jae-Won Chung <[email protected]> Signed-off-by: Kai Ma <[email protected]>

Fixes

1fe4aef

Signed-off-by: Jae-Won Chung <[email protected]> Signed-off-by: Kai Ma <[email protected]>

Fix arg

3cbe1a7

Signed-off-by: Jae-Won Chung <[email protected]> Signed-off-by: Kai Ma <[email protected]>

Fix attn

f5d420b

Signed-off-by: Jae-Won Chung <[email protected]> Signed-off-by: Kai Ma <[email protected]>

Fix lint

d7c5fb5

Signed-off-by: Jae-Won Chung <[email protected]> Signed-off-by: Kai Ma <[email protected]>

Patch: Migrate NVML from Zeus and fix the collect_comm.sh bug

844cc25

Signed-off-by: Kai Ma <[email protected]>

re-use perf_database function to calculate SOL in collector

ff7a82a

Signed-off-by: Kai Ma <[email protected]>

replace multi-line if-else with dict

47829b1

Signed-off-by: Kai Ma <[email protected]>

fix Linting complaints

d63c47d

Signed-off-by: Kai Ma <[email protected]>

kaim-eng force-pushed the jw-collector-power-measurement-support branch from 7e7494c to d63c47d Compare November 20, 2025 16:36

using power data in database to estimate runtime power

b43ad96

Signed-off-by: Kai Ma <[email protected]>


		### Requirements

		- Zeus (`pip install zeus`) for measurement & GPU power limit control

feat: [Collector] Kernel power measurement support [1/N] #104

Are you sure you want to change the base?

feat: [Collector] Kernel power measurement support [1/N] #104

Uh oh!

Conversation

jaywonchung commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Uh oh!

copy-pr-bot bot commented Nov 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaywonchung commented Nov 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaim-eng Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jaywonchung commented Nov 13, 2025 •

edited

Loading

kaim-eng Nov 17, 2025 •

edited

Loading