test: add test for DCGM basic functionality#114
test: add test for DCGM basic functionality#114yacao wants to merge 1 commit intolinux-system-roles:mainfrom
Conversation
Reviewer's GuideAdds an Ansible-managed bash test script for validating NVIDIA DCGM installation and basic functionality, and wires it into the role so it is deployed as an executable test on systems with NVIDIA GPUs. Sequence diagram for running DCGM validation test scriptsequenceDiagram
actor Tester
participant TestScript as test_dcgm_sh
participant Shell
participant DCGMI as dcgmi_binary
participant DCGMService as nvidia_dcgm_service
participant GPU as Nvidia_GPU
Tester->>TestScript: Execute test-dcgm.sh
TestScript->>Shell: Check dcgmi in PATH
Shell-->>TestScript: dcgmi found or missing
TestScript->>DCGMService: Query service status
DCGMService-->>TestScript: active/inactive
TestScript->>DCGMI: discovery -l
DCGMI->>GPU: Probe devices
GPU-->>DCGMI: List of GPUs
DCGMI-->>TestScript: Discovery output
TestScript->>DCGMI: diag -r 1
DCGMI->>GPU: Run quick diagnostics
GPU-->>DCGMI: Diagnostic result
DCGMI-->>TestScript: Success or failure code
TestScript-->>Tester: Print PASS/FAIL summary
Flow diagram for Ansible deployment of DCGM test scriptflowchart TD
A["Ansible_playbook_run"] --> B["Ansible_role_hpc_azure"]
B --> C{"hpc_install_nvidia_dcgm"}
C -- true --> D["Install_nvidia_dcgm_package"]
D --> E["Enable_and_start_nvidia_dcgm_service"]
E --> F["Template_test_dcgm_sh"]
F --> G["/usr/local/hpc/tests/test-dcgm.sh (example __hpc_azure_tests_dir)"]
C -- false --> H["Skip_DCGM_install_and_test_script"]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- The
dcgmi diag -r 1check treats any occurrence offail|erroras a failure, which will also match benign phrases likeNo errors found; consider tightening the pattern (e.g., anchoring to status fields or specific failure lines) so successful diagnostics aren’t falsely reported as failed. - Given that the quick diagnostic is known to fail on some GPUs (RHELHPC-185), you may want to make
test_dcgm_diagnon-fatal or optionally skippable (e.g., via a flag) so the whole script doesn’t hard-fail on configurations where this is an accepted limitation.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `dcgmi diag -r 1` check treats any occurrence of `fail|error` as a failure, which will also match benign phrases like `No errors found`; consider tightening the pattern (e.g., anchoring to status fields or specific failure lines) so successful diagnostics aren’t falsely reported as failed.
- Given that the quick diagnostic is known to fail on some GPUs (RHELHPC-185), you may want to make `test_dcgm_diag` non-fatal or optionally skippable (e.g., via a flag) so the whole script doesn’t hard-fail on configurations where this is an accepted limitation.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| - name: Install Diagnostics test script | ||
| template: | ||
| src: test-dcgm.sh.j2 | ||
| dest: "{{ __hpc_azure_tests_dir }}/test-dcgm.sh" |
There was a problem hiding this comment.
The directory __hpc_azure_tests_dir is created above in this block:
- name: Install Azure-specific platform packages
when: ansible_facts["system_vendor"] == "Microsoft Corporation"So either add that when condition to the block that contains the task "Install Diagnostics test script", or refactor the code so that __hpc_azure_tests_dir is created somewhere else without that condition.
Is it possible to install and run DCGM on a non Microsoft system?
There was a problem hiding this comment.
Yes, DCGM can be installed on non Microsoft platforms, as long as NVIDIA GPU exist. So if we have other folder to store this kind of general test scripts, we can put it there.
There was a problem hiding this comment.
@yacao The only folders defined in this role are specific to azure:
__hpc_install_prefix: /opt
__hpc_azure_resource_dir: "{{ __hpc_install_prefix }}/hpc/azure"
__hpc_azure_tools_dir: "{{ __hpc_azure_resource_dir }}/tools"
__hpc_azure_tests_dir: "{{ __hpc_azure_resource_dir }}/tests"
__hpc_azure_runtime_dir: /var/hpc/azure@dgchinner any idea where the DCGM stuff should go?
There was a problem hiding this comment.
Our specific test cases that will be run by e2e-tests should be placed in __hpc_azure_tests_dir.
In the case of azure installation constraints, the dcgm package control variable should only be set to true in tests/tests_azure.yml, and all the other tests/tests_*yml files should set it to false. That way we are only installing DCGM on azure-based builds and not during ansible CI image builds where the presence of DCGM has no relevance to what the CI images are actually testing.
Medium term, we really should get rid of all the open coded when: ansible_facts["system_vendor"] == "Microsoft Corporation" checks and put them under installation control variables that are only set to true for the Azure specific build. This would allow ansible CI to then run on azure based machines in exactly the same way it runs on non-azure based machines...
There was a problem hiding this comment.
I'll also add the point that the failing CI test is because teh DGCM tests are trying to install on a non-azure machine. We've specifically set up the system-role to only create teh azure dirs on azure machines to catch issues like this in the ansible CI. i.e. the CI failure here is meaningful, it should not be ignored, and it indicates that the DCGM installation is not set up properly (as per my previous comment).
There was a problem hiding this comment.
Updated related test yml files and the CI pass now. As for the __hpc_azure_tests_dir dependency, I think we can use it for now since we are only focusing on Azure, we can adjust it later. Thanks all!
This commit introduces a new test script `templates/test-dcgm.sh.j2` to verify the installation and basic functionality of NVIDIA Data Center GPU Manager (DCGM). The script performs checks for `dcgmi` binary, `nvidia-dcgm` service status, GPU discovery, quick diagnostics. This script is only supposed to run on systems with NVIDIA GPUs. Signed-off-by: Yaju Cao <yacao@redhat.com>
Enhancement:
This commit introduces a new test script
templates/test-dcgm.sh.j2to verify the installation and basic functionality of NVIDIA Data Center GPU Manager (DCGM). The script performs checks fordcgmibinary,nvidia-dcgmservice status, GPU discovery, quick diagnostics. This script is only supposed to run on systems with NVIDIA GPUs.Reason:
Validate the DCGM package is installed and basic functionality.
Result:
[2026-03-27 07:33:45] ========================================
[2026-03-27 07:33:45] NVIDIA DCGM Test
[2026-03-27 07:33:45] ========================================
[2026-03-27 07:33:45] Test: Checking for dcgmi binary
[PASS] Checking for dcgmi binary
[2026-03-27 07:33:45] Test: Checking DCGM service status
[PASS] DCGM service is active
[2026-03-27 07:33:45] Test: Running 'dcgmi discovery -l'
[PASS] Discovery found 1 GPU(s)
[2026-03-27 07:33:45] Test: Running 'dcgmi diag -r 1' (quick diagnostic)
[FAIL] Diagnostic returned error/failure
The last failure is known issue, it fails on some GPUs like NC4as T4, and it is recorded in https://redhat.atlassian.net/browse/RHELHPC-185
Issue Tracker Tickets (Jira or BZ if any):
https://redhat.atlassian.net/browse/RHELHPC-130
Summary by Sourcery
Tests: