Skip to content

fix: workaround for cuda13 plugin load failure#118

Open
yacao wants to merge 1 commit intolinux-system-roles:mainfrom
yacao:fix-dcgm-cuda
Open

fix: workaround for cuda13 plugin load failure#118
yacao wants to merge 1 commit intolinux-system-roles:mainfrom
yacao:fix-dcgm-cuda

Conversation

@yacao
Copy link
Copy Markdown
Collaborator

@yacao yacao commented Apr 3, 2026

Enhancement:
DCGM GPU diagnostic (dcgmi diag) may fail on CUDA 12 systems with:

Error: Cannot load plugins. Unable to change to the plugin dir
'/usr/libexec/datacenter-gpu-manager-4/plugins/cuda13/':
'No such file or directory'

This change adds a workaround to create a symlink from cuda13 to
cuda12 when cuda13 plugins are not present.

Reason:
Current DCGM (4.5.x) may attempt to load plugins from a cuda13 path
even when only CUDA12 plugins are installed. This results in diagnostic
failures on some VM sizes (e.g. NC4as_T4_v3).

Result:
A symlink is created:
/usr/libexec/datacenter-gpu-manager-4/plugins/cuda13 ->
/usr/libexec/datacenter-gpu-manager-4/plugins/cuda12

The task is idempotent and only applies when:

  • CUDA12 plugins are present
  • CUDA13 plugin path does not exist

This workaround can be removed once CUDA13 plugins are available or
DCGM behavior is corrected.

Issue Tracker:
https://redhat.atlassian.net/browse/RHELHPC-185

Summary by Sourcery

Bug Fixes:

  • Prevent dcgmi diagnostics from failing by creating a cuda13-to-cuda12 plugin symlink when only CUDA 12 plugins are installed and the CUDA 13 path is absent.

@yacao yacao requested review from richm and spetrosi as code owners April 3, 2026 03:31
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai bot commented Apr 3, 2026

Reviewer's Guide

Implements an Ansible-based workaround for DCGM on CUDA 12 systems by conditionally creating a cuda13 -> cuda12 plugin directory symlink so diagnostics continue to work when CUDA 13 plugins are missing.

Flow diagram for conditional cuda13_to_cuda12_symlink_creation

flowchart TD
  A[Run Ansible role tasks/main] --> B[stat path:/usr/libexec/datacenter-gpu-manager-4/plugins/cuda12]
  B -->|dcgm_cuda12_plugin.stat.exists = true| C[stat path:/usr/libexec/datacenter-gpu-manager-4/plugins/cuda13 follow:no]
  B -->|dcgm_cuda12_plugin.stat.exists = false| F[Skip symlink creation]
  C -->|dcgm_cuda13_plugin.stat.exists = true| F
  C -->|dcgm_cuda13_plugin.stat.exists = false| D[file src:/usr/libexec/datacenter-gpu-manager-4/plugins/cuda12 dest:/usr/libexec/datacenter-gpu-manager-4/plugins/cuda13 state:link]
  D --> E[Symlink cuda13 -> cuda12 present]
  F --> E
Loading

File-Level Changes

Change Details Files
Add an idempotent Ansible task sequence to detect CUDA plugin directories and create a cuda13 -> cuda12 symlink when needed to work around DCGM plugin loading failures.
  • Add a stat task to detect presence of the CUDA 12 DCGM plugin directory and register its state.
  • Add a stat task (with symlink following disabled) to detect presence of the CUDA 13 DCGM plugin path and register its state.
  • Add a conditional file task that creates a symlink from the CUDA 12 plugin directory to the CUDA 13 plugin directory only when CUDA 12 exists and CUDA 13 does not.
tasks/main.yml

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • The follow: no option for the second stat task should be nested under stat: (alongside path) rather than at the task’s top level to ensure it is passed to the module correctly.
  • In the symlink creation task, state: link is mis-indented relative to src and dest, which will result in invalid YAML or an incorrect task structure; align it under file: with the other options.
  • Consider extracting the repeated DCGM plugin base path (/usr/libexec/datacenter-gpu-manager-4/plugins) into a variable to avoid duplication and make future path changes easier.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `follow: no` option for the second `stat` task should be nested under `stat:` (alongside `path`) rather than at the task’s top level to ensure it is passed to the module correctly.
- In the symlink creation task, `state: link` is mis-indented relative to `src` and `dest`, which will result in invalid YAML or an incorrect task structure; align it under `file:` with the other options.
- Consider extracting the repeated DCGM plugin base path (`/usr/libexec/datacenter-gpu-manager-4/plugins`) into a variable to avoid duplication and make future path changes easier.

## Individual Comments

### Comment 1
<location path="tasks/main.yml" line_range="615-619" />
<code_context>
+        path: /usr/libexec/datacenter-gpu-manager-4/plugins/cuda12
+      register: dcgm_cuda12_plugin
+
+    - name: Check if cuda13 plugin exists
+      stat:
+        path: /usr/libexec/datacenter-gpu-manager-4/plugins/cuda13
+      register: dcgm_cuda13_plugin
+      follow: no
+
+    - name: Create symlink as workaround for cuda13 plugins load issue
</code_context>
<issue_to_address>
**issue (bug_risk):** The `follow: no` option is being set at the task level, so it will be ignored by the `stat` module.

`follow` must be passed as an argument to the `stat` module, not as a task-level key. Please move it under `stat:` with `path:` so it’s honored, e.g.:

```yaml
    - name: Check if cuda13 plugin exists
      stat:
        path: /usr/libexec/datacenter-gpu-manager-4/plugins/cuda13
        follow: no
      register: dcgm_cuda13_plugin
```
</issue_to_address>

### Comment 2
<location path="tasks/main.yml" line_range="625" />
<code_context>
+      file:
+        src: /usr/libexec/datacenter-gpu-manager-4/plugins/cuda12
+        dest: /usr/libexec/datacenter-gpu-manager-4/plugins/cuda13
+       state: link
+      when:
+        - dcgm_cuda12_plugin.stat.exists
</code_context>
<issue_to_address>
**issue (bug_risk):** The indentation of `state: link` is inconsistent and may break YAML parsing or task execution.

`state` needs to be indented to align with `src` and `dest` under `file`; otherwise it may be parsed at the wrong level. It should be:

```yaml
    - name: Create symlink as workaround for cuda13 plugins load issue
      file:
        src: /usr/libexec/datacenter-gpu-manager-4/plugins/cuda12
        dest: /usr/libexec/datacenter-gpu-manager-4/plugins/cuda13
        state: link
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Current DCGM may attempt to load plugins from a 'cuda13' path even
when only CUDA12 plugins are installed, causing dcgmi diag to fail
with "Cannot load plugins" errors on some VM sizes like NC4as_T4_v3.

Add a workaround to create a symlink from cuda13 to cuda12 when
cuda13 plugins are not present. This can be removed when we update to
CUDA 13.

Signed-off-by: Yaju Cao <yacao@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant