test: add test for DCGM basic functionality by yacao · Pull Request #114 · linux-system-roles/hpc

yacao · 2026-03-27T08:34:29Z

Enhancement:
This commit introduces a new test script templates/test-dcgm.sh.j2 to verify the installation and basic functionality of NVIDIA Data Center GPU Manager (DCGM). The script performs checks for dcgmi binary, nvidia-dcgm service status, GPU discovery, quick diagnostics. This script is only supposed to run on systems with NVIDIA GPUs.

Reason:
Validate the DCGM package is installed and basic functionality.

Result:
[2026-03-27 07:33:45] ========================================
[2026-03-27 07:33:45] NVIDIA DCGM Test
[2026-03-27 07:33:45] ========================================
[2026-03-27 07:33:45] Test: Checking for dcgmi binary
[PASS] Checking for dcgmi binary
[2026-03-27 07:33:45] Test: Checking DCGM service status
[PASS] DCGM service is active
[2026-03-27 07:33:45] Test: Running 'dcgmi discovery -l'
[PASS] Discovery found 1 GPU(s)
[2026-03-27 07:33:45] Test: Running 'dcgmi diag -r 1' (quick diagnostic)
[FAIL] Diagnostic returned error/failure
The last failure is known issue, it fails on some GPUs like NC4as T4, and it is recorded in https://redhat.atlassian.net/browse/RHELHPC-185

Issue Tracker Tickets (Jira or BZ if any):
https://redhat.atlassian.net/browse/RHELHPC-130

Summary by Sourcery

Tests:

Introduce a DCGM test script that checks dcgmi presence, DCGM service status, GPU discovery, and runs a quick diagnostic, and install it into the HPC Azure tests directory.

sourcery-ai · 2026-03-27T08:34:35Z

Reviewer's Guide

Adds an Ansible-managed bash test script for validating NVIDIA DCGM installation and basic functionality, and wires it into the role so it is deployed as an executable test on systems with NVIDIA GPUs.

Sequence diagram for running DCGM validation test script

sequenceDiagram
    actor Tester
    participant TestScript as test_dcgm_sh
    participant Shell
    participant DCGMI as dcgmi_binary
    participant DCGMService as nvidia_dcgm_service
    participant GPU as Nvidia_GPU

    Tester->>TestScript: Execute test-dcgm.sh
    TestScript->>Shell: Check dcgmi in PATH
    Shell-->>TestScript: dcgmi found or missing
    TestScript->>DCGMService: Query service status
    DCGMService-->>TestScript: active/inactive

    TestScript->>DCGMI: discovery -l
    DCGMI->>GPU: Probe devices
    GPU-->>DCGMI: List of GPUs
    DCGMI-->>TestScript: Discovery output

    TestScript->>DCGMI: diag -r 1
    DCGMI->>GPU: Run quick diagnostics
    GPU-->>DCGMI: Diagnostic result
    DCGMI-->>TestScript: Success or failure code

    TestScript-->>Tester: Print PASS/FAIL summary

Flow diagram for Ansible deployment of DCGM test script

flowchart TD
    A["Ansible_playbook_run"] --> B["Ansible_role_hpc_azure"]
    B --> C{"hpc_install_nvidia_dcgm"}
    C -- true --> D["Install_nvidia_dcgm_package"]
    D --> E["Enable_and_start_nvidia_dcgm_service"]
    E --> F["Template_test_dcgm_sh"]
    F --> G["/usr/local/hpc/tests/test-dcgm.sh (example __hpc_azure_tests_dir)"]
    C -- false --> H["Skip_DCGM_install_and_test_script"]

File-Level Changes

Change	Details	Files
Install a DCGM diagnostics test script via Ansible template task.	Adds a new Ansible task to install a DCGM test script from a Jinja2 template. Places the rendered script under the configured HPC Azure tests directory with root ownership and executable permissions. Ensures the script is deployed as part of the existing DCGM installation/service configuration block.	`tasks/main.yml`
Introduce a bash test script to validate DCGM presence, service status, GPU discovery, and quick diagnostics.	Implements a parameterized bash script template with ansible_managed header and SPDX license for DCGM testing. Parses -v (verbose) and -h (help) flags, with structured logging utilities and pass/fail helpers that increment a counter or exit non‑zero. Adds a test to verify the dcgmi binary exists in PATH, logging its path when verbose is enabled. Adds a test to verify the nvidia-dcgm systemd service is active, logging full service status on failure. Adds a GPU discovery test using 'dcgmi discovery -l', counting GPUs via a simple regex and failing if none are reported. Adds a quick diagnostic test using 'dcgmi diag -r 1', treating any output containing 'fail' or 'error' (case-insensitive) as a failure. Defines a main() wrapper that runs all tests sequentially, logs a summary with total passed tests, and exits with success only if all checks pass.	`templates/test-dcgm.sh.j2`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've left some high level feedback:

The dcgmi diag -r 1 check treats any occurrence of fail|error as a failure, which will also match benign phrases like No errors found; consider tightening the pattern (e.g., anchoring to status fields or specific failure lines) so successful diagnostics aren’t falsely reported as failed.
Given that the quick diagnostic is known to fail on some GPUs (RHELHPC-185), you may want to make test_dcgm_diag non-fatal or optionally skippable (e.g., via a flag) so the whole script doesn’t hard-fail on configurations where this is an accepted limitation.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The `dcgmi diag -r 1` check treats any occurrence of `fail|error` as a failure, which will also match benign phrases like `No errors found`; consider tightening the pattern (e.g., anchoring to status fields or specific failure lines) so successful diagnostics aren’t falsely reported as failed.
- Given that the quick diagnostic is known to fail on some GPUs (RHELHPC-185), you may want to make `test_dcgm_diag` non-fatal or optionally skippable (e.g., via a flag) so the whole script doesn’t hard-fail on configurations where this is an accepted limitation.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

richm · 2026-03-30T20:28:37Z

tasks/main.yml

+    - name: Install Diagnostics test script
+      template:
+        src: test-dcgm.sh.j2
+        dest: "{{ __hpc_azure_tests_dir }}/test-dcgm.sh"


The directory __hpc_azure_tests_dir is created above in this block:

- name: Install Azure-specific platform packages when: ansible_facts["system_vendor"] == "Microsoft Corporation"

So either add that when condition to the block that contains the task "Install Diagnostics test script", or refactor the code so that __hpc_azure_tests_dir is created somewhere else without that condition.

Is it possible to install and run DCGM on a non Microsoft system?

Yes, DCGM can be installed on non Microsoft platforms, as long as NVIDIA GPU exist. So if we have other folder to store this kind of general test scripts, we can put it there.

@yacao The only folders defined in this role are specific to azure:

__hpc_install_prefix: /opt __hpc_azure_resource_dir: "{{ __hpc_install_prefix }}/hpc/azure" __hpc_azure_tools_dir: "{{ __hpc_azure_resource_dir }}/tools" __hpc_azure_tests_dir: "{{ __hpc_azure_resource_dir }}/tests" __hpc_azure_runtime_dir: /var/hpc/azure

@dgchinner any idea where the DCGM stuff should go?

Our specific test cases that will be run by e2e-tests should be placed in __hpc_azure_tests_dir.

In the case of azure installation constraints, the dcgm package control variable should only be set to true in tests/tests_azure.yml, and all the other tests/tests_*yml files should set it to false. That way we are only installing DCGM on azure-based builds and not during ansible CI image builds where the presence of DCGM has no relevance to what the CI images are actually testing.

Medium term, we really should get rid of all the open coded when: ansible_facts["system_vendor"] == "Microsoft Corporation" checks and put them under installation control variables that are only set to true for the Azure specific build. This would allow ansible CI to then run on azure based machines in exactly the same way it runs on non-azure based machines...

I'll also add the point that the failing CI test is because teh DGCM tests are trying to install on a non-azure machine. We've specifically set up the system-role to only create teh azure dirs on azure machines to catch issues like this in the ansible CI. i.e. the CI failure here is meaningful, it should not be ignored, and it indicates that the DCGM installation is not set up properly (as per my previous comment).

Updated related test yml files and the CI pass now. As for the __hpc_azure_tests_dir dependency, I think we can use it for now since we are only focusing on Azure, we can adjust it later. Thanks all!

This commit introduces a new test script `templates/test-dcgm.sh.j2` to verify the installation and basic functionality of NVIDIA Data Center GPU Manager (DCGM). The script performs checks for `dcgmi` binary, `nvidia-dcgm` service status, GPU discovery, quick diagnostics. This script is only supposed to run on systems with NVIDIA GPUs. Signed-off-by: Yaju Cao <yacao@redhat.com>

yacao requested review from richm and spetrosi as code owners March 27, 2026 08:34

sourcery-ai bot reviewed Mar 27, 2026

View reviewed changes

richm reviewed Mar 30, 2026

View reviewed changes

yacao force-pushed the test-dcgm branch from 45b733c to 05dd148 Compare April 2, 2026 07:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add test for DCGM basic functionality#114

test: add test for DCGM basic functionality#114
yacao wants to merge 1 commit intolinux-system-roles:mainfrom
yacao:test-dcgm

yacao commented Mar 27, 2026 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Mar 27, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

richm Mar 30, 2026

Uh oh!

yacao Mar 31, 2026 •

edited

Loading

Uh oh!

richm Mar 31, 2026

Uh oh!

dgchinner Mar 31, 2026

Uh oh!

dgchinner Mar 31, 2026

Uh oh!

yacao Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yacao commented Mar 27, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for running DCGM validation test script

Flow diagram for Ansible deployment of DCGM test script

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

richm Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

yacao Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

richm Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

dgchinner Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

dgchinner Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

yacao Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yacao commented Mar 27, 2026 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Mar 27, 2026 •

edited

Loading

yacao Mar 31, 2026 •

edited

Loading