Skip to content

Conversation

@ArangoGutierrez
Copy link
Collaborator

Fixes #1340

@ArangoGutierrez ArangoGutierrez self-assigned this Aug 12, 2025
Copilot

This comment was marked as outdated.

Copilot

This comment was marked as outdated.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Copilot

This comment was marked as outdated.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds advanced health check configuration to the config file system, addressing issue #1340. It enhances the health checking mechanism to support configuration-based control over device health monitoring.

  • Introduces a new Health configuration type with support for event types, ignored XIDs, and critical XID definitions
  • Maintains backward compatibility with the existing DP_DISABLE_HEALTHCHECKS environment variable
  • Refactors existing health check code to use the new configuration system

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
api/config/v1/health.go Implements the new Health configuration struct with JSON/YAML marshaling support
api/config/v1/health_test.go Comprehensive tests for the Health configuration functionality
api/config/v1/config.go Integrates Health configuration into the main Config struct
internal/rm/health.go Refactors health checking logic to use the new configuration system
internal/rm/health_config_test.go Tests for the refactored health configuration integration
internal/rm/health_test.go Updates existing test function names for clarity

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

if other == nil {
return
}

Copy link

Copilot AI Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The Merge method documentation explains this behavior, but the unconditional assignment of Disabled might be surprising. Consider adding an inline comment to clarify that this is intentional even when other.Disabled is false.

Suggested change
// Intentionally override Disabled unconditionally, even if other.Disabled is false.
// See method documentation above for rationale.

Copilot uses AI. Check for mistakes.

@elezar
Copy link
Member

elezar commented Aug 13, 2025

Before continuing on this, it would be good to iterate on the config API (as well as the envvars as discussed in #1335 (comment)) so that we can provide the best user experience.

The questions that I would like to see answered:

  • What is the precedence between the different config options?
  • How would we add additional options to the the config for future more advanced health checks / management?


// Validate event types
validEventTypes := map[string]bool{
"EventTypeXidCriticalError": true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArangoGutierrez rn, we only look at XID and ECC errors should we also consider GPU power and temp?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good discussion, but out of scope for this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add advanced health check configuration to config file

3 participants