Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

英伟达开源驱动,发现出现算力监控异常的情况 #54

Open
Mrxuexi opened this issue Feb 27, 2025 · 3 comments
Open

英伟达开源驱动,发现出现算力监控异常的情况 #54

Mrxuexi opened this issue Feb 27, 2025 · 3 comments
Assignees

Comments

@Mrxuexi
Copy link

Mrxuexi commented Feb 27, 2025

环境信息

驱动版本

Driver Version: 535.161.07

CUDA版本

CUDA Version: 12.2

GPU 型号

NVIDIA GeForce RTX 4090

问题

英伟达开源驱动,发现出现算力监控异常的情况,算力监控数据为 0,导致了算力限制失效。

执行业务

github.com/tensorflow/benchmarks 项目的 benchmark 测试:

python3 /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --forward_only=True --allow_growth=True --model=resnet50 --num_batches=100000 --batch_size=64

容器镜像:

nvidia/tensorflow:21.08-tf1-py3

监控指标数据上下文

Device_last_kernel_of_container{ctrname="container-0",deviceuuid="GPU-460f0295-3254-3747-1174-571caee60d31",podname="20-l76fr",podnamespace="default",vdeviceid="0",zone="vGPU"} 0
Device_last_kernel_of_container{ctrname="container-0",deviceuuid="GPU-460f0295-3254-3747-1174-571caee60d31",podname="30-q6qcq",podnamespace="default",vdeviceid="0",zone="vGPU"} 0
Device_last_kernel_of_container{ctrname="container-0",deviceuuid="GPU-460f0295-3254-3747-1174-571caee60d31",podname="50-c97x2",podnamespace="default",vdeviceid="0",zone="vGPU"} 0
```#53 #

## 排查记录
### HAMi-core 提取关键日志
``` bash
[HAMI-core Debug(1497:140544248379136:multiprocess_utilization_watcher.c:224)]: userutil=0 currentcores=5214590 total=6291456 limit=5 share=6291456

[HAMI-core Info(2623:139981076604736:multiprocess_memory_limit.c:296)]: dev=0 pid=2623 host pid=2623 i=0

排查分析

userutil=0,发现是 proc = find_proc_by_hostpid(processes_sample[i].pid); 获取到的结果为空导致。

结合上面第二条日志,pid=2623 host pid=2623 都为容器内 pid。

遂发现自编的开源英伟达驱动,nvmlDeviceGetComputeRunningProcesses,拿到的结果看起来是容器内的 benchmark 应用 pid。
nvmlDeviceGetProcessUtilization,拿到的结果确认是 host pid。
因此在 HAMi-core 的 src/multiprocess/multiprocess_utilization_watcher.c 的 get_used_gpu_utilization 函数中,通过 nvmlDeviceGetProcessUtilization 获取的 sm_utils 匹配不上,最终计算的 sm_util 结果为 0

@chaunceyjiang
Copy link
Contributor

/assign @chaunceyjiang

I will test with this 535.161.07. Currently, I don't have a 4090, only a T4 card.

@Mrxuexi
Copy link
Author

Mrxuexi commented Mar 3, 2025

收集到同事分析,可能和以下提交有关:
NVIDIA/open-gpu-kernel-modules@f59818b#diff-508f4f46a83e851c6a5ffcad176811af0859d5b9190af474bd21119cd0a5b14a

Image

@jiangxiaobin96
Copy link

有一个不太理解了的地方,设置了--pid=host后,容器内进程的pid和该进程在host上的pid应该是一致的,这里为什么会不匹配。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants