fix: workaround for cuda13 plugin load failure by yacao · Pull Request #118 · linux-system-roles/hpc

yacao · 2026-04-03T03:31:01Z

Enhancement:
DCGM GPU diagnostic (dcgmi diag) may fail on CUDA 12 systems with:

Error: Cannot load plugins. Unable to change to the plugin dir
'/usr/libexec/datacenter-gpu-manager-4/plugins/cuda13/':
'No such file or directory'

This change adds a workaround to create a symlink from cuda13 to
cuda12 when cuda13 plugins are not present.

Reason:
Current DCGM (4.5.x) may attempt to load plugins from a cuda13 path
even when only CUDA12 plugins are installed. This results in diagnostic
failures on some VM sizes (e.g. NC4as_T4_v3).

Result:
A symlink is created:
/usr/libexec/datacenter-gpu-manager-4/plugins/cuda13 ->
/usr/libexec/datacenter-gpu-manager-4/plugins/cuda12

The task is idempotent and only applies when:

CUDA12 plugins are present
CUDA13 plugin path does not exist

This workaround can be removed once CUDA13 plugins are available or
DCGM behavior is corrected.

Issue Tracker:
https://redhat.atlassian.net/browse/RHELHPC-185

Summary by Sourcery

Bug Fixes:

Prevent dcgmi diagnostics from failing by creating a cuda13-to-cuda12 plugin symlink when only CUDA 12 plugins are installed and the CUDA 13 path is absent.

sourcery-ai · 2026-04-03T03:31:07Z

Reviewer's Guide

Implements an Ansible-based workaround for DCGM on CUDA 12 systems by conditionally creating a cuda13 -> cuda12 plugin directory symlink so diagnostics continue to work when CUDA 13 plugins are missing.

Flow diagram for conditional cuda13_to_cuda12_symlink_creation

flowchart TD
  A[Run Ansible role tasks/main] --> B[stat path:/usr/libexec/datacenter-gpu-manager-4/plugins/cuda12]
  B -->|dcgm_cuda12_plugin.stat.exists = true| C[stat path:/usr/libexec/datacenter-gpu-manager-4/plugins/cuda13 follow:no]
  B -->|dcgm_cuda12_plugin.stat.exists = false| F[Skip symlink creation]
  C -->|dcgm_cuda13_plugin.stat.exists = true| F
  C -->|dcgm_cuda13_plugin.stat.exists = false| D[file src:/usr/libexec/datacenter-gpu-manager-4/plugins/cuda12 dest:/usr/libexec/datacenter-gpu-manager-4/plugins/cuda13 state:link]
  D --> E[Symlink cuda13 -> cuda12 present]
  F --> E

File-Level Changes

Change	Details	Files
Add an idempotent Ansible task sequence to detect CUDA plugin directories and create a cuda13 -> cuda12 symlink when needed to work around DCGM plugin loading failures.	Add a stat task to detect presence of the CUDA 12 DCGM plugin directory and register its state. Add a stat task (with symlink following disabled) to detect presence of the CUDA 13 DCGM plugin path and register its state. Add a conditional file task that creates a symlink from the CUDA 12 plugin directory to the CUDA 13 plugin directory only when CUDA 12 exists and CUDA 13 does not.	`tasks/main.yml`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 2 issues, and left some high level feedback:

The follow: no option for the second stat task should be nested under stat: (alongside path) rather than at the task’s top level to ensure it is passed to the module correctly.
In the symlink creation task, state: link is mis-indented relative to src and dest, which will result in invalid YAML or an incorrect task structure; align it under file: with the other options.
Consider extracting the repeated DCGM plugin base path (/usr/libexec/datacenter-gpu-manager-4/plugins) into a variable to avoid duplication and make future path changes easier.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The `follow: no` option for the second `stat` task should be nested under `stat:` (alongside `path`) rather than at the task’s top level to ensure it is passed to the module correctly.
- In the symlink creation task, `state: link` is mis-indented relative to `src` and `dest`, which will result in invalid YAML or an incorrect task structure; align it under `file:` with the other options.
- Consider extracting the repeated DCGM plugin base path (`/usr/libexec/datacenter-gpu-manager-4/plugins`) into a variable to avoid duplication and make future path changes easier.

## Individual Comments

### Comment 1
<location path="tasks/main.yml" line_range="615-619" />
<code_context>
+        path: /usr/libexec/datacenter-gpu-manager-4/plugins/cuda12
+      register: dcgm_cuda12_plugin
+
+    - name: Check if cuda13 plugin exists
+      stat:
+        path: /usr/libexec/datacenter-gpu-manager-4/plugins/cuda13
+      register: dcgm_cuda13_plugin
+      follow: no
+
+    - name: Create symlink as workaround for cuda13 plugins load issue
</code_context>
<issue_to_address>
**issue (bug_risk):** The `follow: no` option is being set at the task level, so it will be ignored by the `stat` module.

`follow` must be passed as an argument to the `stat` module, not as a task-level key. Please move it under `stat:` with `path:` so it’s honored, e.g.:

```yaml
    - name: Check if cuda13 plugin exists
      stat:
        path: /usr/libexec/datacenter-gpu-manager-4/plugins/cuda13
        follow: no
      register: dcgm_cuda13_plugin
```
</issue_to_address>

### Comment 2
<location path="tasks/main.yml" line_range="625" />
<code_context>
+      file:
+        src: /usr/libexec/datacenter-gpu-manager-4/plugins/cuda12
+        dest: /usr/libexec/datacenter-gpu-manager-4/plugins/cuda13
+       state: link
+      when:
+        - dcgm_cuda12_plugin.stat.exists
</code_context>
<issue_to_address>
**issue (bug_risk):** The indentation of `state: link` is inconsistent and may break YAML parsing or task execution.

`state` needs to be indented to align with `src` and `dest` under `file`; otherwise it may be parsed at the wrong level. It should be:

```yaml
    - name: Create symlink as workaround for cuda13 plugins load issue
      file:
        src: /usr/libexec/datacenter-gpu-manager-4/plugins/cuda12
        dest: /usr/libexec/datacenter-gpu-manager-4/plugins/cuda13
        state: link
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

tasks/main.yml

Current DCGM may attempt to load plugins from a 'cuda13' path even when only CUDA12 plugins are installed, causing dcgmi diag to fail with "Cannot load plugins" errors on some VM sizes like NC4as_T4_v3. Add a workaround to create a symlink from cuda13 to cuda12 when cuda13 plugins are not present. This can be removed when we update to CUDA 13. Signed-off-by: Yaju Cao <yacao@redhat.com>

yacao requested review from richm and spetrosi as code owners April 3, 2026 03:31

sourcery-ai bot reviewed Apr 3, 2026

View reviewed changes

tasks/main.yml Outdated Show resolved Hide resolved

tasks/main.yml Outdated Show resolved Hide resolved

yacao force-pushed the fix-dcgm-cuda branch from cd35d38 to 372d702 Compare April 3, 2026 07:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: workaround for cuda13 plugin load failure#118

fix: workaround for cuda13 plugin load failure#118
yacao wants to merge 1 commit intolinux-system-roles:mainfrom
yacao:fix-dcgm-cuda

yacao commented Apr 3, 2026 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Apr 3, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yacao commented Apr 3, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Flow diagram for conditional cuda13_to_cuda12_symlink_creation

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yacao commented Apr 3, 2026 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Apr 3, 2026 •

edited

Loading