We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Driver Version: 535.161.07
CUDA Version: 12.2
NVIDIA GeForce RTX 4090
英伟达开源驱动,发现出现算力监控异常的情况,算力监控数据为 0,导致了算力限制失效。
github.com/tensorflow/benchmarks 项目的 benchmark 测试:
python3 /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --forward_only=True --allow_growth=True --model=resnet50 --num_batches=100000 --batch_size=64
nvidia/tensorflow:21.08-tf1-py3
Device_last_kernel_of_container{ctrname="container-0",deviceuuid="GPU-460f0295-3254-3747-1174-571caee60d31",podname="20-l76fr",podnamespace="default",vdeviceid="0",zone="vGPU"} 0 Device_last_kernel_of_container{ctrname="container-0",deviceuuid="GPU-460f0295-3254-3747-1174-571caee60d31",podname="30-q6qcq",podnamespace="default",vdeviceid="0",zone="vGPU"} 0 Device_last_kernel_of_container{ctrname="container-0",deviceuuid="GPU-460f0295-3254-3747-1174-571caee60d31",podname="50-c97x2",podnamespace="default",vdeviceid="0",zone="vGPU"} 0 ```#53 # ## 排查记录 ### HAMi-core 提取关键日志 ``` bash [HAMI-core Debug(1497:140544248379136:multiprocess_utilization_watcher.c:224)]: userutil=0 currentcores=5214590 total=6291456 limit=5 share=6291456 [HAMI-core Info(2623:139981076604736:multiprocess_memory_limit.c:296)]: dev=0 pid=2623 host pid=2623 i=0
userutil=0,发现是 proc = find_proc_by_hostpid(processes_sample[i].pid); 获取到的结果为空导致。
结合上面第二条日志,pid=2623 host pid=2623 都为容器内 pid。
遂发现自编的开源英伟达驱动,nvmlDeviceGetComputeRunningProcesses,拿到的结果看起来是容器内的 benchmark 应用 pid。 nvmlDeviceGetProcessUtilization,拿到的结果确认是 host pid。 因此在 HAMi-core 的 src/multiprocess/multiprocess_utilization_watcher.c 的 get_used_gpu_utilization 函数中,通过 nvmlDeviceGetProcessUtilization 获取的 sm_utils 匹配不上,最终计算的 sm_util 结果为 0
The text was updated successfully, but these errors were encountered:
/assign @chaunceyjiang
I will test with this 535.161.07. Currently, I don't have a 4090, only a T4 card.
535.161.07
Sorry, something went wrong.
收集到同事分析,可能和以下提交有关: NVIDIA/open-gpu-kernel-modules@f59818b#diff-508f4f46a83e851c6a5ffcad176811af0859d5b9190af474bd21119cd0a5b14a
有一个不太理解了的地方,设置了--pid=host后,容器内进程的pid和该进程在host上的pid应该是一致的,这里为什么会不匹配。
chaunceyjiang
No branches or pull requests
环境信息
驱动版本
Driver Version: 535.161.07
CUDA版本
CUDA Version: 12.2
GPU 型号
NVIDIA GeForce RTX 4090
问题
英伟达开源驱动,发现出现算力监控异常的情况,算力监控数据为 0,导致了算力限制失效。
执行业务
github.com/tensorflow/benchmarks 项目的 benchmark 测试:
python3 /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --forward_only=True --allow_growth=True --model=resnet50 --num_batches=100000 --batch_size=64
容器镜像:
nvidia/tensorflow:21.08-tf1-py3
监控指标数据上下文
排查分析
userutil=0,发现是 proc = find_proc_by_hostpid(processes_sample[i].pid); 获取到的结果为空导致。
结合上面第二条日志,pid=2623 host pid=2623 都为容器内 pid。
遂发现自编的开源英伟达驱动,nvmlDeviceGetComputeRunningProcesses,拿到的结果看起来是容器内的 benchmark 应用 pid。
nvmlDeviceGetProcessUtilization,拿到的结果确认是 host pid。
因此在 HAMi-core 的 src/multiprocess/multiprocess_utilization_watcher.c 的 get_used_gpu_utilization 函数中,通过 nvmlDeviceGetProcessUtilization 获取的 sm_utils 匹配不上,最终计算的 sm_util 结果为 0
The text was updated successfully, but these errors were encountered: