fix: workaround for cuda13 plugin load failure#118
Open
yacao wants to merge 1 commit intolinux-system-roles:mainfrom
Open
fix: workaround for cuda13 plugin load failure#118yacao wants to merge 1 commit intolinux-system-roles:mainfrom
yacao wants to merge 1 commit intolinux-system-roles:mainfrom
Conversation
Reviewer's GuideImplements an Ansible-based workaround for DCGM on CUDA 12 systems by conditionally creating a cuda13 -> cuda12 plugin directory symlink so diagnostics continue to work when CUDA 13 plugins are missing. Flow diagram for conditional cuda13_to_cuda12_symlink_creationflowchart TD
A[Run Ansible role tasks/main] --> B[stat path:/usr/libexec/datacenter-gpu-manager-4/plugins/cuda12]
B -->|dcgm_cuda12_plugin.stat.exists = true| C[stat path:/usr/libexec/datacenter-gpu-manager-4/plugins/cuda13 follow:no]
B -->|dcgm_cuda12_plugin.stat.exists = false| F[Skip symlink creation]
C -->|dcgm_cuda13_plugin.stat.exists = true| F
C -->|dcgm_cuda13_plugin.stat.exists = false| D[file src:/usr/libexec/datacenter-gpu-manager-4/plugins/cuda12 dest:/usr/libexec/datacenter-gpu-manager-4/plugins/cuda13 state:link]
D --> E[Symlink cuda13 -> cuda12 present]
F --> E
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The
follow: nooption for the secondstattask should be nested understat:(alongsidepath) rather than at the task’s top level to ensure it is passed to the module correctly. - In the symlink creation task,
state: linkis mis-indented relative tosrcanddest, which will result in invalid YAML or an incorrect task structure; align it underfile:with the other options. - Consider extracting the repeated DCGM plugin base path (
/usr/libexec/datacenter-gpu-manager-4/plugins) into a variable to avoid duplication and make future path changes easier.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `follow: no` option for the second `stat` task should be nested under `stat:` (alongside `path`) rather than at the task’s top level to ensure it is passed to the module correctly.
- In the symlink creation task, `state: link` is mis-indented relative to `src` and `dest`, which will result in invalid YAML or an incorrect task structure; align it under `file:` with the other options.
- Consider extracting the repeated DCGM plugin base path (`/usr/libexec/datacenter-gpu-manager-4/plugins`) into a variable to avoid duplication and make future path changes easier.
## Individual Comments
### Comment 1
<location path="tasks/main.yml" line_range="615-619" />
<code_context>
+ path: /usr/libexec/datacenter-gpu-manager-4/plugins/cuda12
+ register: dcgm_cuda12_plugin
+
+ - name: Check if cuda13 plugin exists
+ stat:
+ path: /usr/libexec/datacenter-gpu-manager-4/plugins/cuda13
+ register: dcgm_cuda13_plugin
+ follow: no
+
+ - name: Create symlink as workaround for cuda13 plugins load issue
</code_context>
<issue_to_address>
**issue (bug_risk):** The `follow: no` option is being set at the task level, so it will be ignored by the `stat` module.
`follow` must be passed as an argument to the `stat` module, not as a task-level key. Please move it under `stat:` with `path:` so it’s honored, e.g.:
```yaml
- name: Check if cuda13 plugin exists
stat:
path: /usr/libexec/datacenter-gpu-manager-4/plugins/cuda13
follow: no
register: dcgm_cuda13_plugin
```
</issue_to_address>
### Comment 2
<location path="tasks/main.yml" line_range="625" />
<code_context>
+ file:
+ src: /usr/libexec/datacenter-gpu-manager-4/plugins/cuda12
+ dest: /usr/libexec/datacenter-gpu-manager-4/plugins/cuda13
+ state: link
+ when:
+ - dcgm_cuda12_plugin.stat.exists
</code_context>
<issue_to_address>
**issue (bug_risk):** The indentation of `state: link` is inconsistent and may break YAML parsing or task execution.
`state` needs to be indented to align with `src` and `dest` under `file`; otherwise it may be parsed at the wrong level. It should be:
```yaml
- name: Create symlink as workaround for cuda13 plugins load issue
file:
src: /usr/libexec/datacenter-gpu-manager-4/plugins/cuda12
dest: /usr/libexec/datacenter-gpu-manager-4/plugins/cuda13
state: link
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Current DCGM may attempt to load plugins from a 'cuda13' path even when only CUDA12 plugins are installed, causing dcgmi diag to fail with "Cannot load plugins" errors on some VM sizes like NC4as_T4_v3. Add a workaround to create a symlink from cuda13 to cuda12 when cuda13 plugins are not present. This can be removed when we update to CUDA 13. Signed-off-by: Yaju Cao <yacao@redhat.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Enhancement:
DCGM GPU diagnostic (
dcgmi diag) may fail on CUDA 12 systems with:This change adds a workaround to create a symlink from
cuda13tocuda12when cuda13 plugins are not present.Reason:
Current DCGM (4.5.x) may attempt to load plugins from a
cuda13patheven when only CUDA12 plugins are installed. This results in diagnostic
failures on some VM sizes (e.g. NC4as_T4_v3).
Result:
A symlink is created:
/usr/libexec/datacenter-gpu-manager-4/plugins/cuda13 ->
/usr/libexec/datacenter-gpu-manager-4/plugins/cuda12
The task is idempotent and only applies when:
This workaround can be removed once CUDA13 plugins are available or
DCGM behavior is corrected.
Issue Tracker:
https://redhat.atlassian.net/browse/RHELHPC-185
Summary by Sourcery
Bug Fixes: