test: Add comprehensive CUDA and GPU validation test script#117
Open
ggoklani wants to merge 1 commit intolinux-system-roles:mainfrom
Open
test: Add comprehensive CUDA and GPU validation test script#117ggoklani wants to merge 1 commit intolinux-system-roles:mainfrom
ggoklani wants to merge 1 commit intolinux-system-roles:mainfrom
Conversation
Add test-cuda-gpu.sh to validate CUDA toolkit and GPU functionality
on HPC nodes. This addresses high-priority gap in GPU computing
validation.
Features:
- CUDA driver installation verification
- CUDA toolkit version check (12.9)
- nvidia-smi validation
- GPU memory test
- CUDA compiler (nvcc) availability and functionality
- CUDA libraries verification
- NVIDIA driver and persistence daemon checks
- GPU properties validation (count, compute capability, memory)
The script follows project conventions:
- Matches style of test-diagnostics.sh and test-nvidia-docker.sh
- Supports verbose mode (-v) for detailed output
- Gracefully skips on non-GPU systems (exit code 77)
- Exits with proper status codes (0=pass, 1=fail, 77=skip)
Deployed to: {{ __hpc_azure_tests_dir }}/test-cuda-gpu.sh
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Signed-off-by: Gaurav Goklani <ggoklani@redhat.com>
Reviewer's GuideAdds a new CUDA/GPU validation bash test script and wires it into the Ansible role so it is deployed to the HPC Azure tests directory, providing comprehensive checks for CUDA installation, nvcc, NVIDIA drivers, persistence daemon, GPU hardware, and environment configuration with proper exit codes and verbose mode support. Sequence diagram for CUDA/GPU validation script executionsequenceDiagram
actor Admin
participant HPCNode
participant TestScript as test_cuda_gpu_sh
participant OS as System_tools
participant GPU as GPU_hardware
Admin->>HPCNode: ssh to node
Admin->>TestScript: execute test-cuda-gpu.sh [options]
TestScript->>TestScript: parse_arguments
TestScript->>OS: check_nvidia_smi
OS-->>TestScript: nvidia_smi_status
alt nvidia_smi_not_found_or_no_gpu
TestScript-->>Admin: exit 77 (skip non GPU system)
else gpu_detected
TestScript->>OS: check_cuda_driver_and_toolkit
OS-->>TestScript: cuda_versions
TestScript->>OS: check_nvcc_compiler
OS-->>TestScript: nvcc_status
TestScript->>OS: validate_cuda_libraries
OS-->>TestScript: library_status
TestScript->>OS: check_nvidia_driver_and_persistence
OS-->>TestScript: driver_daemon_status
TestScript->>GPU: query_gpu_properties
GPU-->>TestScript: count_compute_capability_memory
alt all_checks_pass
TestScript-->>Admin: exit 0 (all validations passed)
else any_check_fails
TestScript-->>Admin: exit 1 (validation failed)
end
end
Flow diagram for CUDA/GPU validation logic in test-cuda-gpu.shflowchart TD
A[Start] --> B[Parse CLI arguments
support verbose flag]
B --> C[Check nvidia_smi presence
and query GPUs]
C -->|nvidia_smi missing or no GPUs| D[Exit 77
skip non GPU system]
C -->|GPU detected| E[Check CUDA driver and toolkit
verify expected version 12_9]
E --> F[Check nvcc compiler
availability and simple compile]
F --> G[Verify CUDA libraries
required shared libs present]
G --> H[Check NVIDIA driver version
and persistence daemon]
H --> I[Validate GPU properties
count compute capability memory]
I --> J{Any check failed}
J -->|yes| K[Print failure details
set appropriate error messages]
K --> L[Exit 1
validation failed]
J -->|no| M[Print success summary]
M --> N[Exit 0
all validations passed]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- The CUDA version is hard-coded in multiple places (EXPECTED_CUDA_VERSION, /usr/local/cuda-12.9, and the cuda-toolkit-12-9 package name); consider centralizing this in a single variable or making it configurable so future version bumps only require one change.
- The script mixes privilege models by calling
sudowithin a test script that is likely already run as root via automation; consider removing the internalsudocalls and assuming the script is invoked with appropriate privileges to avoid unexpected prompts or failures. - The temporary directory used for the nvcc compilation test is only cleaned up on the success path; add a
trapto ensurerm -rf "$TEMP_DIR"runs on exit or failure to avoid leaking temp directories.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The CUDA version is hard-coded in multiple places (EXPECTED_CUDA_VERSION, /usr/local/cuda-12.9, and the cuda-toolkit-12-9 package name); consider centralizing this in a single variable or making it configurable so future version bumps only require one change.
- The script mixes privilege models by calling `sudo` within a test script that is likely already run as root via automation; consider removing the internal `sudo` calls and assuming the script is invoked with appropriate privileges to avoid unexpected prompts or failures.
- The temporary directory used for the nvcc compilation test is only cleaned up on the success path; add a `trap` to ensure `rm -rf "$TEMP_DIR"` runs on exit or failure to avoid leaking temp directories.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add test-cuda-gpu.sh to validate CUDA toolkit and GPU functionality on HPC nodes. This addresses high-priority gap in GPU computing validation.
Features:
The script follows project conventions:
Deployed to: {{ __hpc_azure_tests_dir }}/test-cuda-gpu.sh
🤖 Generated with Claude Code
Issue Tracker Tickets (Jira or BZ if any):https://redhat.atlassian.net/browse/RHELHPC-197
Summary by Sourcery
Add a new CUDA and GPU validation test script and ensure it is installed on HPC nodes.
New Features:
Tests: