-
Notifications
You must be signed in to change notification settings - Fork 744
feat: add mark health device function #1241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Pei PeiDong <[email protected]>
Signed-off-by: Pei PeiDong <[email protected]>
Signed-off-by: Pei PeiDong <[email protected]>
|
This PR is stale because it has been open 90 days with no activity. This PR will be closed in 30 days unless new comments are made or the stale label is removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds functionality to listen for GPU recovery events using the new NVML event type nvmlEventTypeGpuRecoveryAction introduced in NVIDIA driver version 565. The implementation extends the health checking system to support both healthy and unhealthy device state transitions.
- Modified the health check system to support device recovery events and bidirectional health state changes
- Added new device event types and structures to represent both healthy and unhealthy states
- Updated the plugin server to handle both device health degradation and recovery scenarios
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/rm/devices.go | Introduces DeviceEvent structure and DeviceEventType enum to support both healthy/unhealthy states |
| internal/rm/health.go | Adds support for GPU recovery and unavailable events, updates event handling logic |
| internal/rm/nvml_manager.go | Updates CheckHealth signature to use DeviceEvent instead of Device |
| internal/rm/rm.go | Updates ResourceManager interface to use DeviceEvent in CheckHealth method |
| internal/rm/tegra_manager.go | Updates CheckHealth signature for consistency with interface changes |
| internal/plugin/server.go | Modifies health event handling to support both healthy and unhealthy device transitions |
| type DeviceEventType int | ||
|
|
||
| const ( | ||
| DeviceUnHalthy DeviceEventType = iota |
Copilot
AI
Aug 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant name 'DeviceUnHalthy' contains a spelling error. It should be 'DeviceUnhealthy' (lowercase 'h').
| DeviceUnHalthy DeviceEventType = iota | |
| DeviceUnhealthy DeviceEventType = iota |
| unhealthy <- d | ||
| unhealthy <- &DeviceEvent{ | ||
| Device: d, | ||
| Event: DeviceUnHalthy, |
Copilot
AI
Aug 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant name 'DeviceUnHalthy' contains a spelling error. It should be 'DeviceUnhealthy' (lowercase 'h').
| unhealthy <- d | ||
| unhealthy <- &DeviceEvent{ | ||
| Device: d, | ||
| Event: DeviceUnHalthy, |
Copilot
AI
Aug 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant name 'DeviceUnHalthy' contains a spelling error. It should be 'DeviceUnhealthy' (lowercase 'h').
| unhealthy <- d | ||
| unhealthy <- &DeviceEvent{ | ||
| Device: d, | ||
| Event: DeviceUnHalthy, |
Copilot
AI
Aug 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant name 'DeviceUnHalthy' contains a spelling error. It should be 'DeviceUnhealthy' (lowercase 'h').
| unhealthy <- d | ||
| unhealthy <- &DeviceEvent{ | ||
| Device: d, | ||
| Event: DeviceUnHalthy, |
Copilot
AI
Aug 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant name 'DeviceUnHalthy' contains a spelling error. It should be 'DeviceUnhealthy' (lowercase 'h').
| unhealthy <- d | ||
| unhealthy <- &DeviceEvent{ | ||
| Device: d, | ||
| Event: DeviceUnHalthy, |
Copilot
AI
Aug 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant name 'DeviceUnHalthy' contains a spelling error. It should be 'DeviceUnhealthy' (lowercase 'h').
| unhealthy <- d | ||
| unhealthy <- &DeviceEvent{ | ||
| Device: d, | ||
| Event: DeviceUnHalthy, |
Copilot
AI
Aug 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant name 'DeviceUnHalthy' contains a spelling error. It should be 'DeviceUnhealthy' (lowercase 'h').
| Event: DeviceUnHalthy, | |
| Event: DeviceUnhealthy, |
| klog.Infof("Gpu unavailable event: %+v", e) | ||
| unhealthy <- &DeviceEvent{ | ||
| Device: d, | ||
| Event: DeviceUnHalthy, |
Copilot
AI
Aug 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant name 'DeviceUnHalthy' contains a spelling error. It should be 'DeviceUnhealthy' (lowercase 'h').
| unhealthy <- d | ||
| unhealthy <- &DeviceEvent{ | ||
| Device: d, | ||
| Event: DeviceUnHalthy, |
Copilot
AI
Aug 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant name 'DeviceUnHalthy' contains a spelling error. It should be 'DeviceUnhealthy' (lowercase 'h').
| klog.Infof("'%s' device marked unhealthy: %s", plugin.rm.Resource(), d.ID) | ||
| if err := s.Send(&pluginapi.ListAndWatchResponse{Devices: plugin.apiDevices()}); err != nil { | ||
| return nil | ||
| if d.Event == rm.DeviceUnHalthy { |
Copilot
AI
Aug 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant name 'DeviceUnHalthy' contains a spelling error. It should be 'DeviceUnhealthy' (lowercase 'h').
| if d.Event == rm.DeviceUnHalthy { | |
| if d.Event == rm.DeviceUnhealthy { |
Is it possible to listen to nvmlEventTypeGpuRecoveryAction to determine if a device is available?
Changes between v560 and v565
The following new functionality is exposed on NVIDIA display drivers version 565 Production or later.
• Added new event type nvmlEventTypeGpuRecoveryAction.
• Added new fieldId to query GPU recovery action NVML_FI_DEV_GET_GPU_RECOVERY_ACTION.