Skip to content

Conversation

@ppd324
Copy link

@ppd324 ppd324 commented Apr 24, 2025

Is it possible to listen to nvmlEventTypeGpuRecoveryAction to determine if a device is available?

Changes between v560 and v565
The following new functionality is exposed on NVIDIA display drivers version 565 Production or later.
• Added new event type nvmlEventTypeGpuRecoveryAction.
• Added new fieldId to query GPU recovery action NVML_FI_DEV_GET_GPU_RECOVERY_ACTION.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Apr 24, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Pei PeiDong <[email protected]>
Signed-off-by: Pei PeiDong <[email protected]>
@github-actions
Copy link

This PR is stale because it has been open 90 days with no activity. This PR will be closed in 30 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 24, 2025
@ArangoGutierrez ArangoGutierrez self-assigned this Aug 12, 2025
@ArangoGutierrez ArangoGutierrez requested a review from elezar August 12, 2025 09:46
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds functionality to listen for GPU recovery events using the new NVML event type nvmlEventTypeGpuRecoveryAction introduced in NVIDIA driver version 565. The implementation extends the health checking system to support both healthy and unhealthy device state transitions.

  • Modified the health check system to support device recovery events and bidirectional health state changes
  • Added new device event types and structures to represent both healthy and unhealthy states
  • Updated the plugin server to handle both device health degradation and recovery scenarios

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
internal/rm/devices.go Introduces DeviceEvent structure and DeviceEventType enum to support both healthy/unhealthy states
internal/rm/health.go Adds support for GPU recovery and unavailable events, updates event handling logic
internal/rm/nvml_manager.go Updates CheckHealth signature to use DeviceEvent instead of Device
internal/rm/rm.go Updates ResourceManager interface to use DeviceEvent in CheckHealth method
internal/rm/tegra_manager.go Updates CheckHealth signature for consistency with interface changes
internal/plugin/server.go Modifies health event handling to support both healthy and unhealthy device transitions

type DeviceEventType int

const (
DeviceUnHalthy DeviceEventType = iota
Copy link

Copilot AI Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant name 'DeviceUnHalthy' contains a spelling error. It should be 'DeviceUnhealthy' (lowercase 'h').

Suggested change
DeviceUnHalthy DeviceEventType = iota
DeviceUnhealthy DeviceEventType = iota

Copilot uses AI. Check for mistakes.
unhealthy <- d
unhealthy <- &DeviceEvent{
Device: d,
Event: DeviceUnHalthy,
Copy link

Copilot AI Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant name 'DeviceUnHalthy' contains a spelling error. It should be 'DeviceUnhealthy' (lowercase 'h').

Copilot uses AI. Check for mistakes.
unhealthy <- d
unhealthy <- &DeviceEvent{
Device: d,
Event: DeviceUnHalthy,
Copy link

Copilot AI Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant name 'DeviceUnHalthy' contains a spelling error. It should be 'DeviceUnhealthy' (lowercase 'h').

Copilot uses AI. Check for mistakes.
unhealthy <- d
unhealthy <- &DeviceEvent{
Device: d,
Event: DeviceUnHalthy,
Copy link

Copilot AI Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant name 'DeviceUnHalthy' contains a spelling error. It should be 'DeviceUnhealthy' (lowercase 'h').

Copilot uses AI. Check for mistakes.
unhealthy <- d
unhealthy <- &DeviceEvent{
Device: d,
Event: DeviceUnHalthy,
Copy link

Copilot AI Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant name 'DeviceUnHalthy' contains a spelling error. It should be 'DeviceUnhealthy' (lowercase 'h').

Copilot uses AI. Check for mistakes.
unhealthy <- d
unhealthy <- &DeviceEvent{
Device: d,
Event: DeviceUnHalthy,
Copy link

Copilot AI Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant name 'DeviceUnHalthy' contains a spelling error. It should be 'DeviceUnhealthy' (lowercase 'h').

Copilot uses AI. Check for mistakes.
unhealthy <- d
unhealthy <- &DeviceEvent{
Device: d,
Event: DeviceUnHalthy,
Copy link

Copilot AI Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant name 'DeviceUnHalthy' contains a spelling error. It should be 'DeviceUnhealthy' (lowercase 'h').

Suggested change
Event: DeviceUnHalthy,
Event: DeviceUnhealthy,

Copilot uses AI. Check for mistakes.
klog.Infof("Gpu unavailable event: %+v", e)
unhealthy <- &DeviceEvent{
Device: d,
Event: DeviceUnHalthy,
Copy link

Copilot AI Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant name 'DeviceUnHalthy' contains a spelling error. It should be 'DeviceUnhealthy' (lowercase 'h').

Copilot uses AI. Check for mistakes.
unhealthy <- d
unhealthy <- &DeviceEvent{
Device: d,
Event: DeviceUnHalthy,
Copy link

Copilot AI Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant name 'DeviceUnHalthy' contains a spelling error. It should be 'DeviceUnhealthy' (lowercase 'h').

Copilot uses AI. Check for mistakes.
klog.Infof("'%s' device marked unhealthy: %s", plugin.rm.Resource(), d.ID)
if err := s.Send(&pluginapi.ListAndWatchResponse{Devices: plugin.apiDevices()}); err != nil {
return nil
if d.Event == rm.DeviceUnHalthy {
Copy link

Copilot AI Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant name 'DeviceUnHalthy' contains a spelling error. It should be 'DeviceUnhealthy' (lowercase 'h').

Suggested change
if d.Event == rm.DeviceUnHalthy {
if d.Event == rm.DeviceUnhealthy {

Copilot uses AI. Check for mistakes.
@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants