Skip to content

test: add test for DCGM basic functionality#114

Open
yacao wants to merge 1 commit intolinux-system-roles:mainfrom
yacao:test-dcgm
Open

test: add test for DCGM basic functionality#114
yacao wants to merge 1 commit intolinux-system-roles:mainfrom
yacao:test-dcgm

Conversation

@yacao
Copy link
Copy Markdown
Collaborator

@yacao yacao commented Mar 27, 2026

Enhancement:
This commit introduces a new test script templates/test-dcgm.sh.j2 to verify the installation and basic functionality of NVIDIA Data Center GPU Manager (DCGM). The script performs checks for dcgmi binary, nvidia-dcgm service status, GPU discovery, quick diagnostics. This script is only supposed to run on systems with NVIDIA GPUs.

Reason:
Validate the DCGM package is installed and basic functionality.

Result:
[2026-03-27 07:33:45] ========================================
[2026-03-27 07:33:45] NVIDIA DCGM Test
[2026-03-27 07:33:45] ========================================
[2026-03-27 07:33:45] Test: Checking for dcgmi binary
[PASS] Checking for dcgmi binary
[2026-03-27 07:33:45] Test: Checking DCGM service status
[PASS] DCGM service is active
[2026-03-27 07:33:45] Test: Running 'dcgmi discovery -l'
[PASS] Discovery found 1 GPU(s)
[2026-03-27 07:33:45] Test: Running 'dcgmi diag -r 1' (quick diagnostic)
[FAIL] Diagnostic returned error/failure
The last failure is known issue, it fails on some GPUs like NC4as T4, and it is recorded in https://redhat.atlassian.net/browse/RHELHPC-185

Issue Tracker Tickets (Jira or BZ if any):
https://redhat.atlassian.net/browse/RHELHPC-130

Summary by Sourcery

Tests:

  • Introduce a DCGM test script that checks dcgmi presence, DCGM service status, GPU discovery, and runs a quick diagnostic, and install it into the HPC Azure tests directory.

@yacao yacao requested review from richm and spetrosi as code owners March 27, 2026 08:34
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai bot commented Mar 27, 2026

Reviewer's Guide

Adds an Ansible-managed bash test script for validating NVIDIA DCGM installation and basic functionality, and wires it into the role so it is deployed as an executable test on systems with NVIDIA GPUs.

Sequence diagram for running DCGM validation test script

sequenceDiagram
    actor Tester
    participant TestScript as test_dcgm_sh
    participant Shell
    participant DCGMI as dcgmi_binary
    participant DCGMService as nvidia_dcgm_service
    participant GPU as Nvidia_GPU

    Tester->>TestScript: Execute test-dcgm.sh
    TestScript->>Shell: Check dcgmi in PATH
    Shell-->>TestScript: dcgmi found or missing
    TestScript->>DCGMService: Query service status
    DCGMService-->>TestScript: active/inactive

    TestScript->>DCGMI: discovery -l
    DCGMI->>GPU: Probe devices
    GPU-->>DCGMI: List of GPUs
    DCGMI-->>TestScript: Discovery output

    TestScript->>DCGMI: diag -r 1
    DCGMI->>GPU: Run quick diagnostics
    GPU-->>DCGMI: Diagnostic result
    DCGMI-->>TestScript: Success or failure code

    TestScript-->>Tester: Print PASS/FAIL summary
Loading

Flow diagram for Ansible deployment of DCGM test script

flowchart TD
    A["Ansible_playbook_run"] --> B["Ansible_role_hpc_azure"]
    B --> C{"hpc_install_nvidia_dcgm"}
    C -- true --> D["Install_nvidia_dcgm_package"]
    D --> E["Enable_and_start_nvidia_dcgm_service"]
    E --> F["Template_test_dcgm_sh"]
    F --> G["/usr/local/hpc/tests/test-dcgm.sh (example __hpc_azure_tests_dir)"]
    C -- false --> H["Skip_DCGM_install_and_test_script"]
Loading

File-Level Changes

Change Details Files
Install a DCGM diagnostics test script via Ansible template task.
  • Adds a new Ansible task to install a DCGM test script from a Jinja2 template.
  • Places the rendered script under the configured HPC Azure tests directory with root ownership and executable permissions.
  • Ensures the script is deployed as part of the existing DCGM installation/service configuration block.
tasks/main.yml
Introduce a bash test script to validate DCGM presence, service status, GPU discovery, and quick diagnostics.
  • Implements a parameterized bash script template with ansible_managed header and SPDX license for DCGM testing.
  • Parses -v (verbose) and -h (help) flags, with structured logging utilities and pass/fail helpers that increment a counter or exit non‑zero.
  • Adds a test to verify the dcgmi binary exists in PATH, logging its path when verbose is enabled.
  • Adds a test to verify the nvidia-dcgm systemd service is active, logging full service status on failure.
  • Adds a GPU discovery test using 'dcgmi discovery -l', counting GPUs via a simple regex and failing if none are reported.
  • Adds a quick diagnostic test using 'dcgmi diag -r 1', treating any output containing 'fail' or 'error' (case-insensitive) as a failure.
  • Defines a main() wrapper that runs all tests sequentially, logs a summary with total passed tests, and exits with success only if all checks pass.
templates/test-dcgm.sh.j2

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The dcgmi diag -r 1 check treats any occurrence of fail|error as a failure, which will also match benign phrases like No errors found; consider tightening the pattern (e.g., anchoring to status fields or specific failure lines) so successful diagnostics aren’t falsely reported as failed.
  • Given that the quick diagnostic is known to fail on some GPUs (RHELHPC-185), you may want to make test_dcgm_diag non-fatal or optionally skippable (e.g., via a flag) so the whole script doesn’t hard-fail on configurations where this is an accepted limitation.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `dcgmi diag -r 1` check treats any occurrence of `fail|error` as a failure, which will also match benign phrases like `No errors found`; consider tightening the pattern (e.g., anchoring to status fields or specific failure lines) so successful diagnostics aren’t falsely reported as failed.
- Given that the quick diagnostic is known to fail on some GPUs (RHELHPC-185), you may want to make `test_dcgm_diag` non-fatal or optionally skippable (e.g., via a flag) so the whole script doesn’t hard-fail on configurations where this is an accepted limitation.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

- name: Install Diagnostics test script
template:
src: test-dcgm.sh.j2
dest: "{{ __hpc_azure_tests_dir }}/test-dcgm.sh"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The directory __hpc_azure_tests_dir is created above in this block:

- name: Install Azure-specific platform packages
  when: ansible_facts["system_vendor"] == "Microsoft Corporation"

So either add that when condition to the block that contains the task "Install Diagnostics test script", or refactor the code so that __hpc_azure_tests_dir is created somewhere else without that condition.

Is it possible to install and run DCGM on a non Microsoft system?

Copy link
Copy Markdown
Collaborator Author

@yacao yacao Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, DCGM can be installed on non Microsoft platforms, as long as NVIDIA GPU exist. So if we have other folder to store this kind of general test scripts, we can put it there.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yacao The only folders defined in this role are specific to azure:

__hpc_install_prefix: /opt
__hpc_azure_resource_dir: "{{ __hpc_install_prefix }}/hpc/azure"
__hpc_azure_tools_dir: "{{ __hpc_azure_resource_dir }}/tools"
__hpc_azure_tests_dir: "{{ __hpc_azure_resource_dir }}/tests"
__hpc_azure_runtime_dir: /var/hpc/azure

@dgchinner any idea where the DCGM stuff should go?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our specific test cases that will be run by e2e-tests should be placed in __hpc_azure_tests_dir.

In the case of azure installation constraints, the dcgm package control variable should only be set to true in tests/tests_azure.yml, and all the other tests/tests_*yml files should set it to false. That way we are only installing DCGM on azure-based builds and not during ansible CI image builds where the presence of DCGM has no relevance to what the CI images are actually testing.

Medium term, we really should get rid of all the open coded when: ansible_facts["system_vendor"] == "Microsoft Corporation" checks and put them under installation control variables that are only set to true for the Azure specific build. This would allow ansible CI to then run on azure based machines in exactly the same way it runs on non-azure based machines...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll also add the point that the failing CI test is because teh DGCM tests are trying to install on a non-azure machine. We've specifically set up the system-role to only create teh azure dirs on azure machines to catch issues like this in the ansible CI. i.e. the CI failure here is meaningful, it should not be ignored, and it indicates that the DCGM installation is not set up properly (as per my previous comment).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated related test yml files and the CI pass now. As for the __hpc_azure_tests_dir dependency, I think we can use it for now since we are only focusing on Azure, we can adjust it later. Thanks all!

This commit introduces a new test script `templates/test-dcgm.sh.j2` to
verify the installation and basic functionality of NVIDIA Data Center
GPU Manager (DCGM). The script performs checks for `dcgmi` binary,
`nvidia-dcgm` service status, GPU discovery, quick diagnostics. This
script is only supposed to run on systems with NVIDIA GPUs.

Signed-off-by: Yaju Cao <yacao@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants