-
Notifications
You must be signed in to change notification settings - Fork 744
Add advanced health check configuration to config file #1360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ffed560 to
619da22
Compare
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
619da22 to
0fff5ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds advanced health check configuration to the config file system, addressing issue #1340. It enhances the health checking mechanism to support configuration-based control over device health monitoring.
- Introduces a new
Healthconfiguration type with support for event types, ignored XIDs, and critical XID definitions - Maintains backward compatibility with the existing
DP_DISABLE_HEALTHCHECKSenvironment variable - Refactors existing health check code to use the new configuration system
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
api/config/v1/health.go |
Implements the new Health configuration struct with JSON/YAML marshaling support |
api/config/v1/health_test.go |
Comprehensive tests for the Health configuration functionality |
api/config/v1/config.go |
Integrates Health configuration into the main Config struct |
internal/rm/health.go |
Refactors health checking logic to use the new configuration system |
internal/rm/health_config_test.go |
Tests for the refactored health configuration integration |
internal/rm/health_test.go |
Updates existing test function names for clarity |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
| if other == nil { | ||
| return | ||
| } | ||
|
|
Copilot
AI
Aug 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The Merge method documentation explains this behavior, but the unconditional assignment of Disabled might be surprising. Consider adding an inline comment to clarify that this is intentional even when other.Disabled is false.
| // Intentionally override Disabled unconditionally, even if other.Disabled is false. | |
| // See method documentation above for rationale. |
Copilot uses AI. Check for mistakes.
|
Before continuing on this, it would be good to iterate on the config API (as well as the envvars as discussed in #1335 (comment)) so that we can provide the best user experience. The questions that I would like to see answered:
|
|
|
||
| // Validate event types | ||
| validEventTypes := map[string]bool{ | ||
| "EventTypeXidCriticalError": true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArangoGutierrez rn, we only look at XID and ECC errors should we also consider GPU power and temp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good discussion, but out of scope for this PR
Fixes #1340