k8s datastore, CRD generation from proto bufs#882
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds CRD support and generation: new protoc-gen-crd tooling pin, extended health_event.proto (status/operation/CRD types), generated CRD YAML and Helm chart, Makefile targets for CRD/plugin/vendor protos, widespread go.mod bumps, and migration of internal status types to protobuf-backed representations across code and tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Kubernetes
participant Operator
participant Datastore
Client->>Kubernetes: create/update HealthEventResource (CRD)
Kubernetes->>Operator: watch/notify resource change
Operator->>Datastore: read/write HealthEvent (protos.HealthEvent / protos.HealthEventStatus)
Datastore->>Datastore: protojson marshal -> JSON map for persistence
Operator->>Operator: evaluate NodeQuarantined / userPodsEvictionStatus
Operator->>Datastore: persist updated status (proto-backed JSON)
Kubernetes-->>Client: status reflected on resource
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
cae5408 to
a780576
Compare
a780576 to
ea71515
Compare
| @@ -15,11 +15,46 @@ | |||
| syntax = "proto3"; | |||
There was a problem hiding this comment.
the reason we moved this to a proto is so that it can be used with the proto crd generator right?
There was a problem hiding this comment.
Correct
We are using https://github.com/yandex/protoc-gen-crd to generate the CRD from the proto file.
Moved HealthEventWithStatus into proto from health_event_extentions.go.
// Status types are now generated from proto definitions for single source of truth.
// See data-models/protobufs/health_event.proto for HealthEventStatus and related types.
type HealthEventWithStatus struct {
CreatedAt time.Time bson:"createdAt"
HealthEvent *protos.HealthEvent bson:"healthevent,omitempty"
HealthEventStatus *protos.HealthEventStatus bson:"healtheventstatus"
}
6f6a6f7 to
0ba16c0
Compare
| expectJSONInclude: []string{"userpodsevictionstatus"}, | ||
| expectJSONOmit: []string{"nodeQuarantined", "faultRemediated", "lastRemediationTimestamp"}, | ||
| expectJSONInclude: []string{"userPodsEvictionStatus"}, | ||
| expectBSONInclude: []string{"userpodsevictionstatus"}, |
There was a problem hiding this comment.
mango DB tags are by default, all lowercase.
Need some input if this has any impact, after moving the healthEventStatus to proto's, they dont have bson tags.
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (11)
labeler/go.mod (2)
12-16:⚠️ Potential issue | 🟠 Major
k8s.io/kubernetes v1.32.10is three minor versions behindk8s.io/api v0.35.1— replace directives are missing.
k8s.io/apiuses a0.X.Yversioning scheme that maps directly tok8s.io/kubernetes v1.X.Y. Herek8s.io/api v0.35.1(= Kubernetes 1.35) whilek8s.io/kubernetes v1.32.10(= Kubernetes 1.32) creates a 3-minor-version skew:
k8s.io/kubernetesis a "super-module" whose owngo.modcontains replace directives that pin all staging packages (k8s.io/api,k8s.io/apimachinery,k8s.io/client-go, …) tov0.32.x. Those replace directives do not propagate to downstream consumers — they must be replicated in thisgo.mod.- Without them, Go's MVS selects
k8s.io/api v0.35.1(the higher explicit requirement), butk8s.io/kubernetes v1.32.10was built againstk8s.io/api v0.32.10, causing API-surface mismatches at link or runtime.- Importing
k8s.io/kubernetesas a direct dependency is widely discouraged; the recommended approach is to depend only onk8s.io/api,k8s.io/client-go, andk8s.io/apimachinery.Either align
k8s.io/kubernetestov1.35.x(matchingk8s.io/api v0.35.1) with the corresponding replace directives, or remove the directk8s.io/kubernetesdependency entirely and use only the staging modules.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@labeler/go.mod` around lines 12 - 16, The go.mod currently lists k8s.io/kubernetes v1.32.10 which skews three minor versions from k8s.io/api v0.35.1; fix by removing the direct k8s.io/kubernetes dependency (preferred): delete the k8s.io/kubernetes line, ensure no code imports k8s.io/kubernetes, run go mod tidy so only k8s.io/api, k8s.io/apimachinery, and k8s.io/client-go remain; alternatively, if you must keep k8s.io/kubernetes, update it to the matching v1.35.x (e.g., v1.35.1) and add the replace directives from the k8s.io/kubernetes go.mod that pin staging modules (k8s.io/api, k8s.io/apimachinery, k8s.io/client-go, etc.) to the corresponding v0.35.x versions so MVS stays consistent.
71-71:⚠️ Potential issue | 🟡 Minor
k8s.io/apiextensions-apiserver v0.34.3is one minor version behind the otherk8s.io/*dependencies at v0.35.1.All
k8s.io/staging modules should track the same minor version. Since this PR introduces CRD support, runningv0.34.3alongsidek8s.io/api@v0.35.1,k8s.io/apimachinery@v0.35.1, andk8s.io/client-go@v0.35.1risks subtle type incompatibilities. Bump tov0.35.1to align with the rest of the Kubernetes dependencies.🔧 Proposed fix
- k8s.io/apiextensions-apiserver v0.34.3 // indirect + k8s.io/apiextensions-apiserver v0.35.1 // indirectThen run
go mod tidyinsidelabeler/to updatego.sum.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@labeler/go.mod` at line 71, Update the k8s.io staging module version in labeler/go.mod: change the k8s.io/apiextensions-apiserver dependency from v0.34.3 to v0.35.1 so it matches the other k8s.io/* deps (k8s.io/api, k8s.io/apimachinery, k8s.io/client-go), then run go mod tidy inside the labeler/ module to refresh go.sum; look for the k8s.io/apiextensions-apiserver entry in labeler/go.mod to make the edit.tilt/simple-health-client/go.mod (1)
22-24:⚠️ Potential issue | 🟡 MinorOrphaned
replacedirective forstore-client— no correspondingrequireentry.
github.com/nvidia/nvsentinel/store-clientappears in neither the direct nor indirectrequireblock of this module. Areplacedirective for a module that is not required is silently ignored by Go toolchain and has no effect. This suggests either:
- A
// indirectrequire entry forstore-clientis missing (i.e.,go mod tidywas not run), or- The directive was copy-pasted from another service's
go.modand is not needed here.Please run
go mod tidyintilt/simple-health-client/to reconcile, and remove the directive ifstore-clientis truly not a transitive dependency.🛠️ Proposed fix if `store-client` is not actually needed
replace github.com/nvidia/nvsentinel/data-models => ../../data-models - -replace github.com/nvidia/nvsentinel/store-client => ../../store-client🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tilt/simple-health-client/go.mod` around lines 22 - 24, The replace directive for module github.com/nvidia/nvsentinel/store-client in the go.mod is orphaned because there is no corresponding require; run `go mod tidy` in tilt/simple-health-client to let Go reconcile missing requires and then remove the `replace github.com/nvidia/nvsentinel/store-client => ../../store-client` entry if tidy shows it is not a transitive dependency. If tidy indicates the module is required indirectly, add the resulting `require github.com/nvidia/nvsentinel/store-client vX.Y.Z // indirect` (as produced by tidy) instead of leaving a stray replace; otherwise delete the replace line referencing store-client.node-drainer/pkg/reconciler/reconciler.go (2)
909-917: 🛠️ Refactor suggestion | 🟠 MajorInconsistent serialization: proto struct used directly instead of protojson round-trip.
updateNodeUserPodsEvictedStatus(line 584) usesprotojson.Marshal→json.Unmarshalto convert the proto message to amap[string]interface{}before storing in MongoDB. Here,protos.OperationStatusis embedded directly as a struct value. This produces different serialization behavior — the BSON encoder will serialize exported Go fields and their struct tags, which may differ from protojson output (e.g.,omitemptyhandling, field naming for nested types). Additionally, proto messages should be passed by pointer, not by value.Consider reusing
updateNodeUserPodsEvictedStatusor applying the same protojson round-trip here for consistency.♻️ Proposed fix — use protojson round-trip or call the shared helper
- update := map[string]any{ - "$set": map[string]any{ - "healtheventstatus.userpodsevictionstatus": protos.OperationStatus{ - Status: &protos.Status{Value: string(model.StatusFailed)}, - Message: reason, - }, - }, - } + failedStatus := &protos.OperationStatus{ + Status: &protos.Status{Value: string(model.StatusFailed)}, + Message: reason, + } + + jsonBytes, err := protojson.Marshal(failedStatus) + if err != nil { + return fmt.Errorf("failed to marshal drain failed status: %w", err) + } + + var statusMap map[string]interface{} + if err := json.Unmarshal(jsonBytes, &statusMap); err != nil { + return fmt.Errorf("failed to unmarshal drain failed status: %w", err) + } + + update := map[string]any{ + "$set": map[string]any{ + "healtheventstatus.userpodsevictionstatus": statusMap, + }, + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/reconciler/reconciler.go` around lines 909 - 917, The update embeds a protos.OperationStatus value directly into the BSON update, causing inconsistent serialization and using a non-pointer proto; change this to perform the same protojson round-trip used by updateNodeUserPodsEvictedStatus (protojson.Marshal the *protos.OperationStatus, then json.Unmarshal into map[string]interface{}) or invoke the existing helper that returns the marshaled map, and ensure you pass a pointer to protos.OperationStatus (e.g., &protos.OperationStatus{...}) so the stored document uses the same JSON field names and omitempty behavior as elsewhere.
148-157:⚠️ Potential issue | 🔴 CriticalGuard the nil chain when accessing
UserPodsEvictionStatus.Status.Value.Line 151 dereferences through two nested pointer fields without nil checks. If an event unmarshals without the
userpodsevictionstatusfield set, this will panic. The parser initializesNodeQuarantinedwith a default value when nil, but does not do the same forUserPodsEvictionStatus, creating an asymmetric nil-guard situation.Add defensive checks before accessing nested Status fields:
- if isTerminalStatus(model.Status(healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status.Value)) { + evictionStatus := healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus + if evictionStatus != nil && evictionStatus.Status != nil && + isTerminalStatus(model.Status(evictionStatus.Status.Value)) {Apply the same guard to other accesses of
UserPodsEvictionStatusat lines 293–295 (handleEventCancellation), and 377, 510, 520, 761 (status mutation methods).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/reconciler/reconciler.go` around lines 148 - 157, The code currently dereferences healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status.Value without nil checks which can panic if UserPodsEvictionStatus or its Status is nil; update the guard in the reconciler pre-processing (the block that calls isTerminalStatus with model.Status(healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status.Value)) to first check that healthEventWithStatus.HealthEventStatus != nil, that UserPodsEvictionStatus != nil, and that UserPodsEvictionStatus.Status != nil before reading Value, and treat missing status as non-terminal; apply the same defensive nil-check pattern to all other locations mentioned (handleEventCancellation and the status mutation methods referenced) to avoid panics when UserPodsEvictionStatus is absent.fault-remediation/go.mod (1)
3-5:⚠️ Potential issue | 🟡 MinorUpdate the
toolchaindirective to the latest Go 1.25 patch for security fixes.
go1.25.7(released 2026-02-04) includes security fixes to thegocommand andcrypto/tls, whilego1.25.6(released 2026-01-15) included fixes toarchive/zip,crypto/tls, andnet/url. The declaredtoolchain go1.25.3predates both and leaves those vulnerabilities in the toolchain used for builds.🛡️ Proposed fix
-toolchain go1.25.3 +toolchain go1.25.7🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fault-remediation/go.mod` around lines 3 - 5, Update the toolchain directive to the latest Go 1.25 patch: replace the current "toolchain go1.25.3" entry in the go.mod (alongside the existing "go 1.25.0" directive) with "toolchain go1.25.7" so builds use the Go 1.25.7 toolchain that includes the security fixes; ensure no other toolchain lines remain conflicting and run go env or a quick build to verify the new toolchain is picked up.Makefile (1)
279-306:⚠️ Potential issue | 🟠 Major
protos-generatewithout a clean step can leave stale generated files. If any.protois removed/renamed, old*.pb.go/*_pb2.pycan stick around and create confusing diffs or even compile-time collisions. Consider either restoringprotos-cleanas a prerequisite or making clean opt-in via a flag.One option: opt-in clean
.PHONY: protos-generate protos-generate: ## Generate protobuf files from .proto sources `@echo` "Generating protobuf files in data-models (Go) and gpu-health-monitor (Python)..." + `@if` [ "$(PROTO_CLEAN)" = "1" ]; then \ + $(MAKE) protos-clean; \ + fi `@echo` "=== Tool Versions ===" `@echo` "Go: $$(go version)" `@echo` "protoc: $$(protoc --version)" `@echo` "protoc-gen-go: $$(protoc-gen-go --version)" `@echo` "protoc-gen-go-grpc: $$(protoc-gen-go-grpc --version)"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Makefile` around lines 279 - 306, The protos-generate Makefile target can leave stale generated files; update the protos-generate target to either depend on an existing protos-clean target or honor an opt-in CLEAN flag: add protos-clean as a prerequisite to protos-generate (so protos-clean runs automatically) or implement a conditional that runs the protos-clean recipe when CLEAN=true (check the protos-generate recipe and reference the protos-clean target and the CLEAN variable), ensuring generated outputs are removed before regenerating to avoid leftover *.pb.go and *_pb2.py artifacts.platform-connectors/go.mod (1)
24-36:⚠️ Potential issue | 🟠 MajorRemove
protoc-gen-crdor pin it intools.goif CRD generation is performed. This module has no direct usage of or reference toprotoc-gen-crd, notools.gofile, and only carries it as a transitive indirect dependency. If platform-connectors does not run CRD generation, remove this dependency to keep the module manifest clean and explicit. If generation does occur, create atools.gofile with appropriate// +build toolsdirective to document the tool dependency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@platform-connectors/go.mod` around lines 24 - 36, The go.mod currently carries github.com/yandex/protoc-gen-crd v1.1.0 as an indirect dependency but the module has no CRD generation code or a tools declaration; either remove the protoc-gen-crd line from go.mod to drop the unused transitive dependency, or if you do generate CRDs, add a tools.go file (with the // +build tools build tag) that imports github.com/yandex/protoc-gen-crd so the tool is explicitly pinned and referenced; update go.mod via go mod tidy after making the change to ensure the module graph is consistent.node-drainer/pkg/evaluator/evaluator.go (1)
62-74:⚠️ Potential issue | 🔴 CriticalUse safe getters or nil guards to prevent panic on unset proto fields. Direct dereference of
healthEvent.HealthEventStatus.UserPodsEvictionStatus.Status.Valuewill panic if either intermediate pointer is nil. Proto3 fields are optional and can be unset in production; tests demonstrate this can occur (e.g., emptyHealthEventStatus{}). Safe proto getters (GetUserPodsEvictionStatus(),GetStatus()) are available in the generated code.Proposed fix using safe getters
statusPtr := healthEvent.HealthEventStatus.NodeQuarantined if statusPtr != nil && statusPtr.Value == string(model.UnQuarantined) { return &DrainActionResult{Action: ActionSkip}, nil } - if isTerminalStatus(model.Status(healthEvent.HealthEventStatus.UserPodsEvictionStatus.Status.Value)) { + evictionStatusValue := "" + if u := healthEvent.HealthEventStatus.GetUserPodsEvictionStatus(); u != nil { + evictionStatusValue = u.GetStatus().GetValue() + } + if isTerminalStatus(model.Status(evictionStatusValue)) { slog.Info("Event for node is in terminal state, skipping", "node", nodeName) return &DrainActionResult{ Action: ActionSkip, }, nil }Also applies to: 88-108
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/evaluator/evaluator.go` around lines 62 - 74, The code dereferences proto fields directly (e.g., healthEvent.HealthEventStatus.UserPodsEvictionStatus.Status.Value and HealthEventStatus.NodeQuarantined) which can be nil and cause panics; replace those direct accesses with the generated safe getters (e.g., GetHealthEventStatus(), GetNodeQuarantined(), GetUserPodsEvictionStatus(), GetStatus(), GetValue()) and add nil-check guards before calling isTerminalStatus or returning a DrainActionResult; update both the initial NodeQuarantined check and the later isTerminalStatus invocation (and the similar occurrences around lines 88–108) to use these getters so the function (including isTerminalStatus/DrainActionResult/ActionSkip paths) will not panic on unset proto fields.metadata-collector/go.mod (1)
12-56:⚠️ Potential issue | 🟠 MajorAddress version misalignment across go.mod files.
The repository has significant dependency version inconsistencies that need resolution:
- grpc: Most modules use v1.79.1, but
tilt/csp-api-mock/go.modpins v1.77.0- genproto/googleapis/rpc: Six different versions detected across modules (v0.0.0-20260209200024-4cfbd4190f57, v0.0.0-20251202230838-ff82c1b0f217, v0.0.0-20260128011058-8636f8732409, v0.0.0-20250603155806-513f23925822, v0.0.0-20251124214823-79d6a2a48846, v0.0.0-20250303144028-a0af3efb3deb)
- golang.org/x packages: Version drift in
api/,plugins/,tests/scale-tests/, andtilt/csp-api-mock/modules;health-monitors/csp-health-monitor/uses oauth2 v0.35.0 vs v0.34.0 elsewhereStandardize versions across all services to follow the semantic versioning principle outlined in the guidelines.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@metadata-collector/go.mod` around lines 12 - 56, Multiple go.mod files have inconsistent dependency versions (notably google.golang.org/grpc, google.golang.org/genproto/googleapis/rpc, and golang.org/x/* packages); standardize by choosing canonical versions (e.g., grpc v1.79.1 and the agreed genproto googleapis/rpc commit like v0.0.0-20260209200024-4cfbd4190f57 or whatever the repo policy picks) and update all go.mod files that deviate (example mismatches: tilt/csp-api-mock pins grpc v1.77.0, health-monitors/csp-health-monitor pins oauth2 v0.35.0). For each module with a mismatch, run go get <module>@<canonical-version> (e.g., go get google.golang.org/grpc@v1.79.1, go get google.golang.org/genproto/googleapis/rpc@<chosen-version>, go get golang.org/x/oauth2@<chosen-version>), then go mod tidy and run tests/CI to verify; repeat until all go.mod files list the canonical versions.fault-quarantine/pkg/eventwatcher/event_watcher.go (1)
240-246:⚠️ Potential issue | 🔴 CriticalCritical: MongoDB filters compare plain strings against nested
protos.Statussubdocuments — queries will never match.After migrating
NodeQuarantinedfrom*model.Status(a string alias) to*protos.Status(a struct with aValuefield), the BSON-stored format changed from a plain string to a nested object:healtheventstatus.nodequarantined: "Quarantined" → healtheventstatus.nodequarantined: {"value": "Quarantined"}However, the filters on lines 243–246 and 300–305 still compare against raw strings using the
$inoperator, which will never match the nested structure.Reading code (line 292) correctly accesses the nested field:
NodeQuarantined.Value, and the update (line 310) correctly serializes the struct, but the filters are misaligned.Proposed fix — use the nested `.value` field in filters
filter := map[string]interface{}{ "healthevent.nodename": nodeName, - "healtheventstatus.nodequarantined": map[string]interface{}{ + "healtheventstatus.nodequarantined.value": map[string]interface{}{ "$in": []interface{}{model.Quarantined, model.UnQuarantined}, }, }And similarly for the
updateFilter:- "healtheventstatus.nodequarantined": map[string]interface{}{ + "healtheventstatus.nodequarantined.value": map[string]interface{}{ "$in": []interface{}{model.Quarantined, model.AlreadyQuarantined}, },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fault-quarantine/pkg/eventwatcher/event_watcher.go` around lines 240 - 246, The MongoDB filters compare raw strings against the protos.Status subdocument, so they never match; update the filters to target the nested field (healtheventstatus.nodequarantined.value) instead of the whole subdocument. Specifically, change the initial filter (the map assigned to filter referencing healtheventstatus.nodequarantined) and the updateFilter used later to use "healtheventstatus.nodequarantined.value" with $in (or direct equality) comparing to the string constants model.Quarantined and model.UnQuarantined; locate the variables named filter and updateFilter and align them with how NodeQuarantined.Value is read elsewhere (NodeQuarantined.Value) so queries match the stored {"value":"..."} shape.
🧹 Nitpick comments (14)
janitor/go.mod (1)
107-108: Aligngoogleapis/apiandgoogleapis/rpcto the same version snapshot.Both sibling submodules from the go-genproto monorepo should ideally track the same upstream snapshot. Currently,
googleapis/rpcis atv0.0.0-20260209200024-4cfbd4190f57(Feb 9, 2026) whilegoogleapis/apiremains atv0.0.0-20251202230838-ff82c1b0f217(Nov 25, 2025). Although dependencies resolve without conflict, bumpinggoogleapis/apito the matching Feb 2026 timestamp improves consistency and reduces the risk of type identity mismatches since these modules can exchange protobuf types.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@janitor/go.mod` around lines 107 - 108, The go.mod currently pins google.golang.org/genproto/googleapis/api at v0.0.0-20251202230838-ff82c1b0f217 while googleapis/rpc is at v0.0.0-20260209200024-4cfbd4190f57; update the googleapis/api dependency to the matching snapshot version v0.0.0-20260209200024-4cfbd4190f57 so both genproto submodules use the same upstream snapshot (change the module entry for google.golang.org/genproto/googleapis/api to the Feb 9, 2026 pseudo-version).janitor-provider/go.mod (1)
107-109: Thegenproto/googleapis/rpcsub-module version trails the basegenprotoby ~8 months — consider runninggo mod tidyto verify the dependency graph resolves cleanly.The pseudo-version timestamps diverge noticeably across these modules:
google.golang.org/genproto: 2025-06-03google.golang.org/genproto/googleapis/api: 2025-12-02google.golang.org/genproto/googleapis/rpc: 2026-02-09Since the base
genprotoand its sub-modules are independently versioned, this alone is not incorrect. However, the significant date gap could surface subtle incompatibilities if any transitively imported code relies on types defined in both modules.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@janitor-provider/go.mod` around lines 107 - 109, The go.mod shows mismatched pseudo-versions for google.golang.org/genproto and its sub-modules (googleapis/api and googleapis/rpc); run `go mod tidy` to refresh the dependency graph and ensure the transitive requirements resolve, then, if tidy doesn't align them, add an explicit `require` for google.golang.org/genproto@<latest-consistent-pseudo-version>` (and/or pin google.golang.org/genproto/googleapis/rpc to the same timestamped pseudo-version) so the base genproto and its googleapis/* sub-modules are consistent, regenerate go.sum, and verify `go build`/`go test` pass.node-drainer/pkg/reconciler/reconciler_integration_test.go (1)
1393-1396: Proto message used by value — consider using a pointer for consistency.
protos.HealthEventStatus{...}is a proto message stored by value in the map. Proto messages contain internal state fields (protoimpl.MessageState) that are not designed to be copied by value. In this test context it works because the struct is only used for JSON round-tripping, but using&protos.HealthEventStatus{...}would be more idiomatic and consistent with proto conventions.♻️ Suggested change
- "healtheventstatus": protos.HealthEventStatus{ + "healtheventstatus": &protos.HealthEventStatus{ NodeQuarantined: &protos.Status{Value: string(opts.nodeQuarantined)}, UserPodsEvictionStatus: &protos.OperationStatus{Status: &protos.Status{Value: string(model.StatusInProgress)}}, },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/reconciler/reconciler_integration_test.go` around lines 1393 - 1396, The test creates a proto value with protos.HealthEventStatus stored by value in the map which can incorrectly copy internal proto state; change the map entry to store a pointer by replacing protos.HealthEventStatus{...} with &protos.HealthEventStatus{...} and keep inner fields (NodeQuarantined, UserPodsEvictionStatus and their Status/OperationStatus values) the same so the map holds *protos.HealthEventStatus instead of a copied struct.node-drainer/pkg/reconciler/reconciler.go (1)
293-295: Unsafe cast takes address of protoValuefield.Line 295 does
(*model.Status)(&nodeQuarantinedStatus.Value)— this creates a pointer directly into the proto message's string field. While type-safe (both arestring), ifNodeQuarantinedis nil (line 293 assigns the pointer), the.Valueaccess on line 295 panics. The upstream parser reportedly validates for nilNodeQuarantined, but this is fragile and differs from line 166 which has the same pattern.Consider converting via a local variable instead of taking the address of the proto field:
♻️ Proposed fix — use a local copy
nodeQuarantinedStatus := healthEventWithStatus.HealthEventStatus.NodeQuarantined - - if r.isEventCancelled(eventID, nodeName, (*model.Status)(&nodeQuarantinedStatus.Value)) { + var nodeQuarantinedPtr *model.Status + if nodeQuarantinedStatus != nil { + s := model.Status(nodeQuarantinedStatus.Value) + nodeQuarantinedPtr = &s + } + + if r.isEventCancelled(eventID, nodeName, nodeQuarantinedPtr) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/reconciler/reconciler.go` around lines 293 - 295, The code dereferences a proto field address unsafely by passing (*model.Status)(&nodeQuarantinedStatus.Value) into r.isEventCancelled, which panics if NodeQuarantined is nil and creates a pointer into the proto's internal memory; fix by nil-checking nodeQuarantinedStatus and creating a local model.Status variable from nodeQuarantinedStatus.Value (e.g., v := model.Status(nodeQuarantinedStatus.Value)) then pass &v to r.isEventCancelled (use the symbols nodeQuarantinedStatus, isEventCancelled, eventID, nodeName to locate the call).fault-remediation/pkg/reconciler/reconciler.go (1)
141-143: TODO comment should reference a GitHub issue, not a developer name.
// TODO - Avinash add a helper function to check for supported actionsnames a developer rather than tracking via an issue. As per coding guidelines, TODO comments should reference issues in Go code.-// TODO - Avinash add a helper function to check for supported actions +// TODO(#<issue-number>): extract a helper to check for supported actions🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fault-remediation/pkg/reconciler/reconciler.go` around lines 141 - 143, Replace the developer-named TODO with an issue-reference TODO in the health event handling block: change the comment "// TODO - Avinash add a helper function to check for supported actions" to reference the GitHub issue that tracks this work (e.g. "TODO(#<issue-number>): add a helper function to check for supported actions") near the healthEventWithStatus check in reconciler.go so the note follows the project's TODO guideline and is discoverable; no code behavior changes required.fault-remediation/pkg/reconciler/reconciler_test.go (1)
538-611:TestShouldSkipEventis missing a table-driven case for the newFaultRemediated.GetValue() == trueskip path.The new guard added at
reconciler.go:142–144creates a new skip path: whenHealthEventStatus.FaultRemediatedis a non-nilBoolValuewith valuetrue. None of the four existing test cases or the table-drivenTestShouldSkipEventtest cover this scenario.// Add to the TestShouldSkipEvent table: { name: "Skip when FaultRemediated is true", nodeName: "test-node-remediated", recommendedAction: protos.RecommendedAction_RESTART_BM, groupConfig: getGroupConfig("restart", nil), shouldSkip: true, description: "Already-remediated events should be skipped", }, // and construct healthEventWithStatus with: // HealthEventStatus: &protos.HealthEventStatus{ // NodeQuarantined: &protos.Status{Value: string(model.Quarantined)}, // FaultRemediated: wrapperspb.Bool(true), // }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fault-remediation/pkg/reconciler/reconciler_test.go` around lines 538 - 611, Add a table-driven case to TestShouldSkipEvent that exercises the new skip path in reconciler.shouldSkipEvent when HealthEventStatus.FaultRemediated is a non-nil true BoolValue: add a test entry named like "Skip when FaultRemediated is true" with RecommendedAction RESTART_BM (or other non-NONE), groupConfig from getGroupConfig("restart", nil), shouldSkip=true, and build the healthEventWithStatus so HealthEventStatus includes FaultRemediated: wrapperspb.Bool(true) (and a NodeQuarantined status if required by the guard) before calling r.shouldSkipEvent; this ensures the new FaultRemediated==true branch is covered.store-client/pkg/datastore/providers/postgresql/health_events_serialization_test.go (1)
68-68: Simplify — take address of composite literal directly.The IIFE on line 68 (and line 80) can be replaced with a direct composite literal address:
♻️ Suggested simplification
- NodeQuarantined: func() *protos.Status { s := protos.Status{Value: string(model.Quarantined)}; return &s }(), + NodeQuarantined: &protos.Status{Value: string(model.Quarantined)},Same for line 80:
- NodeQuarantined: func() *protos.Status { s := protos.Status{Value: string(model.UnQuarantined)}; return &s }(), + NodeQuarantined: &protos.Status{Value: string(model.UnQuarantined)},🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@store-client/pkg/datastore/providers/postgresql/health_events_serialization_test.go` at line 68, Replace the IIFE used to set NodeQuarantined (and the similar pattern at the other occurrence) with a direct composite literal address; e.g. change the current NodeQuarantined: func() *protos.Status { s := protos.Status{Value: string(model.Quarantined)}; return &s }() to NodeQuarantined: &protos.Status{Value: string(model.Quarantined)} (and do the analogous replacement for the other status field).fault-quarantine/pkg/eventwatcher/event_watcher.go (1)
308-311: The$setupdate correctly uses&protos.Status{Value: "Cancelled"}, but prefer using themodel.Cancelledconstant for consistency.Line 310 uses the hardcoded string
"Cancelled"instead ofstring(model.Cancelled), which is used elsewhere in the codebase for this value.♻️ Suggested fix
- "healtheventstatus.nodequarantined": &protos.Status{Value: "Cancelled"}, + "healtheventstatus.nodequarantined": &protos.Status{Value: string(model.Cancelled)},🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fault-quarantine/pkg/eventwatcher/event_watcher.go` around lines 308 - 311, The update map sets healtheventstatus.nodequarantined using &protos.Status{Value: "Cancelled"}; change that to &protos.Status{Value: string(model.Cancelled)} for consistency with the rest of the codebase, update the map construction in event_watcher.go (the variable named update) and ensure the model package is imported where the change is made (or use the correct package alias if already imported).health-events-analyzer/pkg/parser/parser_test.go (2)
419-419: Remove commented-out code.Line 419 contains a commented-out field assignment that appears to be leftover from development. As per coding guidelines, commented-out code should be cleaned up.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-events-analyzer/pkg/parser/parser_test.go` at line 419, Remove the leftover commented-out field assignment in the test: delete the commented line referencing UserPodsEvictionStatus (the "//\tUserPodsEvictionStatus: &protos.OperationStatus{Status: &protos.Status{Value: string(model.StatusSucceeded)}},") from parser_test.go so the test file contains no commented-out dead code; ensure surrounding test data structures remain syntactically valid after removal (e.g., commas for preceding/next struct fields).
486-489: Useproto.Equalfor comparing protobuf messages in this test instead ofreflect.DeepEqual.The test compares
*protos.Statusvalues at lines 125 and 431 usingreflect.DeepEqual. Protobuf messages contain internal fields (sizeCache,unknownFields) that can cause false negatives. The reference test inhealth_events_serialization_test.gocorrectly usesproto.Equalfor this purpose.Import
google.golang.org/protobuf/protoand replace the comparison with a type-aware check, or use a helper that appliesproto.Equalspecifically for proto message types.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-events-analyzer/pkg/parser/parser_test.go` around lines 486 - 489, The test's equality check should use proto.Equal for protobuf messages instead of reflect.DeepEqual: import "google.golang.org/protobuf/proto" and in the test where getValueFromPath() returns got and you compare to tt.want (which can be a *protos.Status), replace the reflect.DeepEqual branch with a type-aware check that if both got and tt.want are proto.Message (or specifically *protos.Status) use proto.Equal(got.(proto.Message), tt.want.(proto.Message)); otherwise fall back to the existing DeepEqual logic and keep the same t.Errorf message when they differ.data-models/Makefile (3)
63-63: DuplicatePROTO_OPTIONSdeclaration.
PROTO_OPTIONSis assigned the same value on both Line 63 and Line 79. The second assignment overwrites the first with an identical value, making one of them redundant.Remove the duplicate declaration
-PROTO_OPTIONS = vendor/github.com/yandex/protoc-gen-crd/library/go/k8s/protoc_gen_crd/proto/crd.proto - - $(PROTO_OPTIONS):(Remove lines 79-80, keeping only the first declaration at Line 63.)
Also applies to: 79-79
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@data-models/Makefile` at line 63, Remove the redundant duplicate PROTO_OPTIONS declaration in the Makefile: keep the first assignment of PROTO_OPTIONS (the one at/near the top) and delete the second identical assignment (the later occurrence) so PROTO_OPTIONS is defined only once; locate the symbol PROTO_OPTIONS in the Makefile and remove the later duplicate line.
93-97: PHONY prerequisite forces CRD regeneration on every build.
install-plugin-remoteis declared.PHONY, so Make always considers it out of date. Because it's a prerequisite of the file target$(CRD_FILE), the CRD YAML will be regenerated (and the plugin re-downloaded viago install) on every invocation, even when the output is already up-to-date. This negates the benefit of using a file target.Consider either:
- Making
install-plugin-remotean order-only prerequisite (| install-plugin-remote), or- Using a stamp file for the plugin installation, or
- Checking if the binary already exists before installing.
Option 1: Use an order-only prerequisite
-$(CRD_FILE): vendor-proto protobufs/health_event.proto install-plugin-remote +$(CRD_FILE): vendor-proto protobufs/health_event.proto | install-plugin-remoteThis ensures the plugin is installed before the CRD is generated, but won't force regeneration if the CRD file is already newer than its real prerequisites.
Option 3: Guard the install with a binary check
install-plugin-remote: + `@if` command -v protoc-gen-crd >/dev/null 2>&1; then \ + echo "protoc-gen-crd already installed, skipping."; \ + else \ `@echo` "Installing protoc-gen-crd from github..." GO111MODULE=on go install github.com/yandex/protoc-gen-crd/cmd/protoc-gen-crd@$(PROTOC_GEN_CRD_VERSION) + fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@data-models/Makefile` around lines 93 - 97, The Makefile rule for generating $(CRD_FILE) lists the phony target install-plugin-remote as a normal prerequisite which forces re-generation on every run; change the dependency so the plugin installation does not mark the rule out-of-date by either making install-plugin-remote an order-only prerequisite (use | install-plugin-remote) in the $(CRD_FILE) rule, or replace the phony with a stamp/installed-file checked by the rule (e.g., create and depend on a plugin-install-stamp file produced after verifying the binary exists), or guard the install-plugin-remote recipe to skip go install when the plugin binary is already present; update the rule that references $(CRD_FILE) and the install-plugin-remote target accordingly.
125-125:protos-cleanshould not depend onvendor-proto.A clean target triggering
go mod vendoris semantically backwards — cleaning should only remove artifacts. Additionally, this dependency is already transitively satisfied through theprotos-generate → compile-crd-server → $(CRD_FILE) → vendor-protochain, so it's also redundant here.Remove `vendor-proto` from `protos-clean` dependencies
-protos-clean: vendor-proto crd-clean +protos-clean: crd-clean🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@data-models/Makefile` at line 125, The Makefile target "protos-clean" incorrectly depends on "vendor-proto" which runs go mod vendor; update the dependency list for the "protos-clean" target to remove "vendor-proto" so it only depends on cleanup targets (e.g., "crd-clean"); keep "vendor-proto" in the generation chain (it remains reachable via "protos-generate → compile-crd-server → $(CRD_FILE) → vendor-proto") and ensure the rule name "protos-clean" no longer lists "vendor-proto" as a prerequisite.data-models/protobufs/health_event.proto (1)
51-56: Clarify the relationship betweenHealthEventWithStatusandHealthEventResource.
HealthEventWithStatusbundlescreatedAt+HealthEvent+HealthEventStatus, whileHealthEventResource(the CRD type) holds onlyHealthEventas spec andHealthEventStatusas status — withoutcreatedAt. A brief comment onHealthEventWithStatusexplaining its intended use case (e.g., database/non-CRD storage) and how it relates to the CRD type would help future maintainers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@data-models/protobufs/health_event.proto` around lines 51 - 56, Add a clear comment above the HealthEventWithStatus message that states this type is a convenience/DB-side wrapper (includes createdAt + HealthEvent + HealthEventStatus) intended for non-CRD storage and processing, and explicitly note that the CRD representation is HealthEventResource which stores HealthEvent under spec and HealthEventStatus under status and therefore does NOT include createdAt; mention any mapping expectations between HealthEventWithStatus and HealthEventResource (e.g., drop createdAt when converting to the CRD spec/status pair) so maintainers understand their relationship and intended uses.
| // Status represents operation and quarantine states for health events. | ||
|
|
||
| // Status represents operation and quarantine states for health events. | ||
| // Use string values to match Go constants exactly. | ||
| message Status { | ||
| string value = 1; | ||
| } |
There was a problem hiding this comment.
Duplicate comment and missing field comment on Status.
Lines 26 and 28 contain the same comment: "Status represents operation and quarantine states for health events." Remove the duplicate.
Additionally, Status.value (Line 31) lacks a field-level comment explaining what values are expected, their semantics, and constraints. The comment block on lines 34-35 listing valid constants should be placed directly above the value field (or the field should reference the constants inline). As per coding guidelines: "Include comprehensive comments for all fields in Protocol Buffer messages."
Proposed fix
-// Status represents operation and quarantine states for health events.
-
// Status represents operation and quarantine states for health events.
// Use string values to match Go constants exactly.
message Status {
+ // value holds the status string. Valid values: NotStarted, InProgress,
+ // Failed, Succeeded, AlreadyDrained, UnQuarantined, Quarantined,
+ // AlreadyQuarantined, Cancelled.
string value = 1;
}
-
-// Common status string constants (for documentation/reference):
-// NotStarted, InProgress, Failed, Succeeded, AlreadyDrained, UnQuarantined, Quarantined, AlreadyQuarantined, Cancelled🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@data-models/protobufs/health_event.proto` around lines 26 - 32, Remove the
duplicate top-line comment for the Status message and add a comprehensive
field-level comment for Status.value describing expected string values,
semantics, and constraints; move or repeat the list of valid constants (the
values referenced in the existing comment block) directly above or inline with
the value field so it's clear which constants Status.value may hold; update the
Status message comment and the value field comment to reference the exact Go
constants by name and clarify whether the field is required/optional and any
canonical forms.
| message OperationStatus { | ||
| Status status = 1; | ||
| string message = 2; | ||
| } | ||
|
|
||
| // HealthEventStatus represents the status fields for a health event. | ||
| message HealthEventStatus { | ||
| Status nodeQuarantined = 1; | ||
| OperationStatus userPodsEvictionStatus = 2; | ||
| google.protobuf.BoolValue faultRemediated = 3; | ||
| google.protobuf.Timestamp lastRemediationTimestamp = 4; | ||
| } | ||
|
|
||
| // HealthEventWithStatus combines HealthEvent and HealthEventStatus for CRD generation. | ||
| message HealthEventWithStatus { | ||
| google.protobuf.Timestamp createdAt = 1; | ||
| HealthEvent healthEvent = 2; | ||
| HealthEventStatus healthEventStatus = 3; | ||
| } |
There was a problem hiding this comment.
New messages are missing per-field comments.
The following new messages have fields without any documentation:
OperationStatus:status,messageHealthEventStatus:nodeQuarantined,userPodsEvictionStatus,faultRemediated,lastRemediationTimestampHealthEventWithStatus:createdAt,healthEvent,healthEventStatusHealthEventResource:spec,status
These fields are part of a CRD surface that will be visible to Kubernetes users (e.g., via kubectl describe). Document each field with its purpose, expected values, and semantics. As per coding guidelines: "Include comprehensive comments for all fields in Protocol Buffer messages."
Example for HealthEventStatus
// HealthEventStatus represents the status fields for a health event.
message HealthEventStatus {
+ // nodeQuarantined indicates the quarantine state of the node (e.g., Quarantined, UnQuarantined).
Status nodeQuarantined = 1;
+ // userPodsEvictionStatus tracks the eviction progress for user pods on the affected node.
OperationStatus userPodsEvictionStatus = 2;
+ // faultRemediated indicates whether the underlying fault has been remediated.
+ // Uses BoolValue wrapper to distinguish "not set" from "false".
google.protobuf.BoolValue faultRemediated = 3;
+ // lastRemediationTimestamp records when the last remediation attempt occurred.
google.protobuf.Timestamp lastRemediationTimestamp = 4;
}Also applies to: 125-135
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@data-models/protobufs/health_event.proto` around lines 38 - 56, Add
comprehensive per-field comments for the newly added protobuf messages so CRD
users can see clear documentation in kubectl; specifically update message
OperationStatus (document fields status — meaning and allowed values, and
message — human-readable details), HealthEventStatus (document nodeQuarantined —
what true/false means, userPodsEvictionStatus — semantics of OperationStatus
here, faultRemediated — when true and what it implies, lastRemediationTimestamp
— timestamp semantics), HealthEventWithStatus (document createdAt, healthEvent
and healthEventStatus fields), and the HealthEventResource message fields spec
and status (describe the desired state vs observed state and any invariants).
Put each comment immediately above the corresponding field using the same style
as other proto comments so they appear in generated CRD docs.
| spec: | ||
| nullable: true |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find and examine the specific CRD file
find . -name "health_event.crd.yaml" -type f 2>/dev/null | head -5Repository: NVIDIA/NVSentinel
Length of output: 140
🏁 Script executed:
# Look for the proto definition mentioned
find . -name "health_event.proto" -type f 2>/dev/null | head -5Repository: NVIDIA/NVSentinel
Length of output: 103
🏁 Script executed:
# Check for CRD generator configuration
find . -name "*crd*" -o -name "*protobuf*" -o -name "*generate*" | grep -E "\.(yaml|yml|json|go)$" | head -20Repository: NVIDIA/NVSentinel
Length of output: 512
🏁 Script executed:
# Check file sizes to know how much to read
wc -l ./distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml ./data-models/protobufs/health_event.proto ./data-models/pkg/model/health_event_crd.goRepository: NVIDIA/NVSentinel
Length of output: 257
🏁 Script executed:
# Read the full CRD file to see the context
cat -n ./distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yamlRepository: NVIDIA/NVSentinel
Length of output: 5692
🏁 Script executed:
# Read the proto definition
cat -n ./data-models/protobufs/health_event.protoRepository: NVIDIA/NVSentinel
Length of output: 5090
🏁 Script executed:
# Read the CRD generator code
cat -n ./data-models/pkg/model/health_event_crd.goRepository: NVIDIA/NVSentinel
Length of output: 1772
🏁 Script executed:
# Check other CRDs for comparison
cat -n ./node-drainer/pkg/customdrain/testdata/drainrequest_crd.yaml | head -50Repository: NVIDIA/NVSentinel
Length of output: 1880
🏁 Script executed:
# Look for any CRD generation scripts or configuration
find . -name "*.go" -o -name "Makefile" -o -name "*.sh" | xargs grep -l "protoc-gen-crd\|health_event" | head -10Repository: NVIDIA/NVSentinel
Length of output: 520
🏁 Script executed:
# Check the Makefile for CRD generation commands
cat -n ./data-models/Makefile | grep -A 10 -B 5 "crd\|protoc"Repository: NVIDIA/NVSentinel
Length of output: 4583
🏁 Script executed:
# Check for any proto extensions or options that control CRD generation
rg -A 5 "protoc_gen_crd\|k8s_crd" ./data-models/protobufs/Repository: NVIDIA/NVSentinel
Length of output: 43
🏁 Script executed:
# Look for other generated CRDs to see if this pattern is consistent
find . -name "*.crd.yaml" -type f -exec grep -l "nullable: true" {} \;Repository: NVIDIA/NVSentinel
Length of output: 140
spec is marked as nullable: true — likely an artifact of the protoc-gen-crd generator and proto3's implicit field optionality.
Making spec nullable allows creating a HealthEventResource with no spec at all (spec: null). This is generated from proto3 message fields without explicit optional markers, which the generator converts to nullable OpenAPI schemas. Kubernetes conventions typically enforce spec as required.
Add required: ["spec"] (and status, which has the same issue on line 123) at the top-level properties in the OpenAPI schema to enforce that these fields must be present. Alternatively, consider updating the proto definition or generator configuration if regenerating the CRD.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`
around lines 30 - 31, The CRD's top-level schema currently allows spec (and
similarly status) to be null; update the HealthEventResource CRD
(health_event.crd.yaml) so the root OpenAPI schema declares required:
["spec","status"] under the top-level properties and remove or override
nullable: true for the spec and status properties (or at minimum keep nullable
false) to prevent creating objects with spec: null/status: null; target the
top-level schema entries named "spec" and "status" in the YAML and add the
required array there (or adjust the proto/gen config if you prefer
regenerating).
| generatedTimestamp: | ||
| type: string | ||
| format: RFC3339 |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Kubernetes CRD openAPIV3Schema valid string formats date-time RFC3339
💡 Result:
In a Kubernetes CRD openAPIV3Schema, an RFC3339 timestamp is represented as an OpenAPI/JSON-Schema string with format: date-time:
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
startsAt:
type: string
format: date-time- The OpenAPI
date-timestring format is defined as RFC3339 “date-time” (with timezone). [1][2] - Kubernetes CRD validation guidance explicitly uses
type: string+format: date-timefor date-time fields (and then optionally compares them via CEL rules). [3] - Kubernetes’ built-in
metav1.Timeis published as OpenAPIstringwith formatdate-timeand marshals in RFC3339. [4]
Example valid values: 2026-02-20T13:45:30Z, 2026-02-20T13:45:30+00:00 (note +0000 is not RFC3339). [1]
Sources:
[1] OpenAPI Format Registry: date-time (RFC3339)
[2] OpenAPI Spec v3.0.4 (date-time per RFC3339)
[3] Kubernetes blog: CRD validation rules (example uses format: date-time)
[4] Kubernetes apimachinery Time OpenAPI schema format and RFC3339 formatting
🏁 Script executed:
find . -path "*/distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml" -type fRepository: NVIDIA/NVSentinel
Length of output: 140
🏁 Script executed:
cat -n ./distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml | sed -n '80,100p'Repository: NVIDIA/NVSentinel
Length of output: 890
🏁 Script executed:
cat -n ./distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml | sed -n '140,160p'Repository: NVIDIA/NVSentinel
Length of output: 652
format: RFC3339 is not a standard OpenAPI format — use date-time for proper Kubernetes validation.
Kubernetes CRD validation uses the OpenAPI 3.0 specification, which defines RFC 3339 timestamps with the format identifier date-time, not RFC3339. Kubernetes will silently ignore the unrecognized RFC3339 value and skip schema validation on these fields.
The same issue applies to lastRemediationTimestamp on line 149.
♻️ Proposed fix
generatedTimestamp:
type: string
- format: RFC3339
+ format: date-time lastRemediationTimestamp:
type: string
- format: RFC3339
+ format: date-timeSince this YAML is generated by protoc-gen-crd, this should be fixed in the CRD generator configuration.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| generatedTimestamp: | |
| type: string | |
| format: RFC3339 | |
| generatedTimestamp: | |
| type: string | |
| format: date-time |
| generatedTimestamp: | |
| type: string | |
| format: RFC3339 | |
| lastRemediationTimestamp: | |
| type: string | |
| format: date-time |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`
around lines 89 - 91, The CRD uses a nonstandard OpenAPI format value "RFC3339"
which Kubernetes will ignore; change the schema entries for generatedTimestamp
and lastRemediationTimestamp to use format: date-time instead of RFC3339, and
update the protoc-gen-crd generator configuration/templates so future generated
YAML emits "date-time" for timestamp fields (search for generatedTimestamp and
lastRemediationTimestamp in health_event.crd.yaml and in the generator templates
to apply the fix).
| nodeQuarantined := healthEventWithStatus.HealthEventStatus.NodeQuarantined | ||
|
|
||
| if nodeQuarantined != nil { | ||
| if *nodeQuarantined == model.UnQuarantined || *nodeQuarantined == model.Cancelled { | ||
| return r.handleCancellationEvent(ctx, nodeName, *nodeQuarantined, r.Watcher, event.ResumeToken) | ||
| if nodeQuarantined.Value == string(model.UnQuarantined) || nodeQuarantined.Value == string(model.Cancelled) { | ||
| return r.handleCancellationEvent(ctx, nodeName, model.Status(nodeQuarantined.Value), r.Watcher, event.ResumeToken) | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for ParseHealthEventFromEvent implementation
rg -n "func ParseHealthEventFromEvent" --type go -A 30Repository: NVIDIA/NVSentinel
Length of output: 2344
🏁 Script executed:
#!/bin/bash
# Check HealthEventWithStatus and HealthEventStatus definitions
rg -n "type HealthEventWithStatus\|type HealthEventStatus" --type go -A 5 -B 2Repository: NVIDIA/NVSentinel
Length of output: 43
🏁 Script executed:
#!/bin/bash
# Look at the actual code around line 118-124 in reconciler.go
cat -n fault-remediation/pkg/reconciler/reconciler.go | sed -n '110,150p'Repository: NVIDIA/NVSentinel
Length of output: 1820
🏁 Script executed:
#!/bin/bash
# Search for the nil guard at line 142 mentioned in review
rg -n "HealthEventStatus != nil" fault-remediation/pkg/reconciler/reconciler.go -A 3 -B 1Repository: NVIDIA/NVSentinel
Length of output: 360
🏁 Script executed:
#!/bin/bash
# Find HealthEventWithStatus struct definition
fd . data-models/ -name "*.go" | xargs grep -l "HealthEventWithStatus" | head -5Repository: NVIDIA/NVSentinel
Length of output: 291
🏁 Script executed:
#!/bin/bash
# Search more broadly for HealthEventWithStatus definition
rg -n "type HealthEventWithStatus" --type goRepository: NVIDIA/NVSentinel
Length of output: 305
🏁 Script executed:
#!/bin/bash
# Get full ParseHealthEventFromEvent implementation
cat -n commons/pkg/eventutil/parser.go | head -100Repository: NVIDIA/NVSentinel
Length of output: 3641
🏁 Script executed:
#!/bin/bash
# Check where HealthEventWithStatus is instantiated
rg -n "HealthEventWithStatus{" --type go -A 3Repository: NVIDIA/NVSentinel
Length of output: 168
🏁 Script executed:
#!/bin/bash
# Check HealthEventWithStatus struct definition in data-models
cat -n data-models/pkg/model/health_event_extentions.go | sed -n '40,60p'Repository: NVIDIA/NVSentinel
Length of output: 533
🏁 Script executed:
#!/bin/bash
# Also check the protos definition
cat -n data-models/pkg/protos/health_event.pb.go | sed -n '330,345p'Repository: NVIDIA/NVSentinel
Length of output: 1011
🏁 Script executed:
#!/bin/bash
# Check store-client definition
cat -n store-client/pkg/datastore/types.go | sed -n '93,110p'Repository: NVIDIA/NVSentinel
Length of output: 937
🏁 Script executed:
#!/bin/bash
# Check the HealthEventStatus type to understand if it's nil by default
cat -n data-models/pkg/protos/health_event.proto | head -50Repository: NVIDIA/NVSentinel
Length of output: 134
🏁 Script executed:
#!/bin/bash
# Verify if there's documentation about when HealthEventStatus is populated
rg -n "HealthEventStatus" data-models/pkg/protos/health_event.proto -A 2 -B 2Repository: NVIDIA/NVSentinel
Length of output: 142
🏁 Script executed:
#!/bin/bash
# Find proto files
find . -name "*.proto" -type f | head -20Repository: NVIDIA/NVSentinel
Length of output: 181
🏁 Script executed:
#!/bin/bash
# Look for health_event definitions in proto directory
fd "health_event" --type fRepository: NVIDIA/NVSentinel
Length of output: 1234
🏁 Script executed:
#!/bin/bash
# Check line 73 context in parser.go more carefully
cat -n commons/pkg/eventutil/parser.go | sed -n '70,80p'Repository: NVIDIA/NVSentinel
Length of output: 445
🏁 Script executed:
#!/bin/bash
# Check what's in the health_event.proto file
cat -n data-models/protobufs/health_event.proto | head -100Repository: NVIDIA/NVSentinel
Length of output: 3867
🏁 Script executed:
#!/bin/bash
# Search for where HealthEventStatus is created/initialized
rg -n "HealthEventStatus\s*=" --type go -B 2 -A 2 | head -50Repository: NVIDIA/NVSentinel
Length of output: 43
🏁 Script executed:
#!/bin/bash
# Check the ParseHealthEventFromEvent more carefully - is the HealthEventStatus dereference safe?
cat -n commons/pkg/eventutil/parser.go | sed -n '60,85p'Repository: NVIDIA/NVSentinel
Length of output: 909
🏁 Script executed:
#!/bin/bash
# Verify if there are other code paths creating HealthEventWithStatus without going through ParseHealthEventFromEvent
rg -n "HealthEventWithStatus\s*{" --type go -B 2 -A 5Repository: NVIDIA/NVSentinel
Length of output: 174
🏁 Script executed:
#!/bin/bash
# Verify if there are other code paths creating HealthEventWithStatus
rg -n "HealthEventWithStatus" --type go | grep -E ":\s*=" | head -20Repository: NVIDIA/NVSentinel
Length of output: 2315
🏁 Script executed:
#!/bin/bash
# Check if HealthEventStatus is ever guaranteed to be non-nil elsewhere in reconciler.go
cat -n fault-remediation/pkg/reconciler/reconciler.go | sed -n '100,130p'Repository: NVIDIA/NVSentinel
Length of output: 1372
Nil pointer dereference risk on HealthEventStatus at line 118.
Line 118 dereferences healthEventWithStatus.HealthEventStatus.NodeQuarantined without checking if HealthEventStatus itself is nil. Although the nil check at line 120 guards nodeQuarantined, it does not guard the HealthEventStatus field. Since HealthEventStatus is a pointer in the protobuf definition and can be nil when unmarshaled from events lacking that field, this dereference will panic if HealthEventStatus is nil.
This is confirmed by the guard at line 142 (if healthEventWithStatus.HealthEventStatus != nil), which proves HealthEventStatus can be nil in the codebase post-parse.
🛡️ Proposed fix
- nodeQuarantined := healthEventWithStatus.HealthEventStatus.NodeQuarantined
- if nodeQuarantined != nil {
- if nodeQuarantined.Value == string(model.UnQuarantined) || nodeQuarantined.Value == string(model.Cancelled) {
- return r.handleCancellationEvent(ctx, nodeName, model.Status(nodeQuarantined.Value), r.Watcher, event.ResumeToken)
- }
- }
+ if healthEventWithStatus.HealthEventStatus != nil {
+ nodeQuarantined := healthEventWithStatus.HealthEventStatus.NodeQuarantined
+ if nodeQuarantined != nil {
+ if nodeQuarantined.Value == string(model.UnQuarantined) || nodeQuarantined.Value == string(model.Cancelled) {
+ return r.handleCancellationEvent(ctx, nodeName, model.Status(nodeQuarantined.Value), r.Watcher, event.ResumeToken)
+ }
+ }
+ }Also, the TODO comment at line 141 names a developer ("Avinash") instead of referencing an issue. Replace with // TODO - #<issue-number> add a helper function... per coding guidelines.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| nodeQuarantined := healthEventWithStatus.HealthEventStatus.NodeQuarantined | |
| if nodeQuarantined != nil { | |
| if *nodeQuarantined == model.UnQuarantined || *nodeQuarantined == model.Cancelled { | |
| return r.handleCancellationEvent(ctx, nodeName, *nodeQuarantined, r.Watcher, event.ResumeToken) | |
| if nodeQuarantined.Value == string(model.UnQuarantined) || nodeQuarantined.Value == string(model.Cancelled) { | |
| return r.handleCancellationEvent(ctx, nodeName, model.Status(nodeQuarantined.Value), r.Watcher, event.ResumeToken) | |
| } | |
| } | |
| if healthEventWithStatus.HealthEventStatus != nil { | |
| nodeQuarantined := healthEventWithStatus.HealthEventStatus.NodeQuarantined | |
| if nodeQuarantined != nil { | |
| if nodeQuarantined.Value == string(model.UnQuarantined) || nodeQuarantined.Value == string(model.Cancelled) { | |
| return r.handleCancellationEvent(ctx, nodeName, model.Status(nodeQuarantined.Value), r.Watcher, event.ResumeToken) | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@fault-remediation/pkg/reconciler/reconciler.go` around lines 118 - 124,
healthEventWithStatus.HealthEventStatus is dereferenced without a nil check
causing a potential panic; before reading HealthEventStatus.NodeQuarantined,
guard that healthEventWithStatus.HealthEventStatus != nil (e.g., check and
assign nodeQuarantined only after that nil check) and then keep the existing
logic that calls r.handleCancellationEvent(ctx, nodeName,
model.Status(nodeQuarantined.Value), r.Watcher, event.ResumeToken) when
nodeQuarantined is non-nil and its Value equals UnQuarantined or Cancelled;
additionally replace the TODO that names a developer with the required issue
reference format (// TODO - #<issue-number> add a helper function...) per
guidelines.
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
Outdated
Show resolved
Hide resolved
| golang.org/x/sys v0.38.0 // indirect | ||
| golang.org/x/text v0.31.0 // indirect | ||
| google.golang.org/genproto/googleapis/rpc v0.0.0-20251124214823-79d6a2a48846 // indirect | ||
| github.com/yandex/protoc-gen-crd v1.1.0 // indirect |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Explore repository structure
ls -la | head -20
find . -maxdepth 2 -type f -name "go.mod" | head -10Repository: NVIDIA/NVSentinel
Length of output: 1467
🏁 Script executed:
#!/bin/bash
# Check the specific nccl-loopback go.mod file
cat preflight-checks/nccl-loopback/go.modRepository: NVIDIA/NVSentinel
Length of output: 890
🏁 Script executed:
#!/bin/bash
# Check data-models go.mod file
cat data-models/go.modRepository: NVIDIA/NVSentinel
Length of output: 1379
🏁 Script executed:
#!/bin/bash
# Search for protoc-gen-crd imports in generated .pb.go files
find . -name "*.pb.go" -type f | head -20
rg "yandex/protoc-gen-crd" --type=go -B2 -A2Repository: NVIDIA/NVSentinel
Length of output: 1458
🏁 Script executed:
#!/bin/bash
# Search for CRD annotations in proto files
find . -name "*.proto" -type f | xargs grep -l "k8s_crd" 2>/dev/null | head -10Repository: NVIDIA/NVSentinel
Length of output: 103
🏁 Script executed:
#!/bin/bash
# Check health_event.proto for k8s_crd option usage
cat data-models/protobufs/health_event.proto | head -60Repository: NVIDIA/NVSentinel
Length of output: 2209
🏁 Script executed:
#!/bin/bash
# Check nccl-loopback's actual imports from data-models
find preflight-checks/nccl-loopback -name "*.go" -type f | xargs grep -l "data-models" | head -5Repository: NVIDIA/NVSentinel
Length of output: 166
🏁 Script executed:
#!/bin/bash
# Check what's imported from data-models in nccl-loopback
grep -r "github.com/nvidia/nvsentinel/data-models" preflight-checks/nccl-loopback --include="*.go"Repository: NVIDIA/NVSentinel
Length of output: 282
protoc-gen-crd build-tool package has leaked into a runtime preflight check module.
The dependency chain is confirmed: nccl-loopback imports types from data-models/pkg/protos, which contains health_event.pb.go. This generated file imports github.com/yandex/protoc-gen-crd/library/go/k8s/protoc_gen_crd/proto because the original health_event.proto includes the CRD proto descriptor to support k8s_crd message options.
Since nccl-loopback is a runtime preflight check tool with no CRD generation responsibility, carrying a transitive dependency on a protobuf compiler plugin package:
- Unnecessarily couples the binary to Kubernetes CRD tooling.
- Increases supply-chain exposure.
- Violates module minimality principles.
The root cause is that CRD-annotated protos and base data-model protos coexist in the same data-models module, making it impossible for consumers to import runtime types without pulling in CRD generation infrastructure.
Factor CRD-specific proto options into a dedicated sub-package or separate module within data-models so runtime consumers like nccl-loopback can import base types without the CRD generation dependency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@preflight-checks/nccl-loopback/go.mod` at line 15, The build-tool package
github.com/yandex/protoc-gen-crd leaked into the runtime binary because
data-models’ generated file health_event.pb.go includes CRD proto options; fix
by separating CRD-annotated protos into a dedicated package/module so runtime
consumers (nccl-loopback) can depend only on base types: create a new
data-models submodule (e.g., data-models/base or data-models/proto-runtime) that
contains the plain health_event proto (without k8s_crd imports/options),
regenerate protobufs for that module so health_event.pb.go no longer imports
github.com/yandex/protoc-gen-crd, update nccl-loopback to import the new base
module/package, and remove the transitive protoc-gen-crd dependency from the
runtime module’s go.mod.
ac6d456 to
7298063
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
store-client/pkg/testutils/event_builder.go (1)
15-15:⚠️ Potential issue | 🟡 MinorAdd a package-level godoc comment.
The
testutilspackage has no doc comment. As per coding guidelines, package-level godoc is required for all Go packages.📝 Proposed fix
+// Package testutils provides database-agnostic event construction utilities for use in tests. package testutilsAs per coding guidelines: "Package-level godoc required for all Go packages".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@store-client/pkg/testutils/event_builder.go` at line 15, Add a package-level godoc comment for the testutils package by inserting a descriptive comment block immediately above the "package testutils" declaration; the comment should briefly describe the purpose of the testutils package (what helpers/utilities it provides and intended audience/usage) so it satisfies the package-level documentation requirement.node-drainer/pkg/reconciler/reconciler.go (1)
151-157:⚠️ Potential issue | 🔴 CriticalNew nil-pointer dereference on
UserPodsEvictionStatusinPreprocessAndEnqueueEvent
healthEventWithStatusis populated viajson.Unmarshalfrom a raw MongoDB document. Ifuserpodsevictionstatusis absent ornullin that document,healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatuswill be a nil*protos.OperationStatus. Accessing.Statusat lines 151 and 154 will then panic.This specific occurrence in
PreprocessAndEnqueueEventis not covered by the existing review comment on the same pattern inexecuteSkip,executeMarkAlreadyDrained,executeUpdateStatus, andhandleCancelledEvent.🐛 Proposed fix
- if isTerminalStatus(model.Status(healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status)) { - slog.Debug("Skipping event - already in terminal state", - "node", healthEventWithStatus.HealthEvent.NodeName, - "status", healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status) - - return nil - } + if evictionStatus := healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus; evictionStatus != nil && + isTerminalStatus(model.Status(evictionStatus.Status)) { + slog.Debug("Skipping event - already in terminal state", + "node", healthEventWithStatus.HealthEvent.NodeName, + "status", evictionStatus.Status) + + return nil + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/reconciler/reconciler.go` around lines 151 - 157, PreprocessAndEnqueueEvent dereferences healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus without guarding for nil, which can panic when the MongoDB document omits userpodsevictionstatus; update PreprocessAndEnqueueEvent to check that healthEventWithStatus.HealthEventStatus != nil and healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus != nil before calling isTerminalStatus or accessing .Status, and only call slog.Debug with the status when the pointer is non-nil (otherwise treat as non-terminal or log that the eviction status is missing) so the function no longer panics on null/missing UserPodsEvictionStatus.
♻️ Duplicate comments (11)
tilt/simple-health-client/go.mod (1)
14-18: Verifyprotoc-gen-crdis truly required at runtimeSame concern as prior: if this is only for CRD annotations in generated protos, consider keeping it build-only to avoid leaking tooling into the client module.
#!/bin/bash # Check where protoc-gen-crd is imported or referenced. rg -n "protoc-gen-crd" -g '*.go' rg -n "yandex/protoc-gen-crd" -g 'go.mod'Based on learnings: Keep Go dependencies minimal and up-to-date.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tilt/simple-health-client/go.mod` around lines 14 - 18, The go.mod entry for github.com/yandex/protoc-gen-crd appears to be a tooling dependency (protoc plugin) rather than a runtime import; confirm there are no imports of "protoc-gen-crd" in Go sources (search for protoc-gen-crd or yandex/protoc-gen-crd), and if it is only used for code generation remove it from the module's runtime deps—either delete the indirect line from go.mod or move it into a dedicated tooling declaration (create a tools.go with a build tag like // +build tools and a blank import of github.com/yandex/protoc-gen-crd, or put the plugin into a separate tools/go.mod) so the client module does not carry the generator as a runtime dependency; update go.sum accordingly and run go mod tidy to finalize.data-models/protobufs/health_event.proto (1)
30-44:⚠️ Potential issue | 🟡 MinorAdd per-field comments for the new proto messages.
These fields are part of the CRD surface but are undocumented. Please add field-level comments for OperationStatus, HealthEventStatus, HealthEventWithStatus, and HealthEventResource.
✍️ Suggested field comments
message OperationStatus { + // status indicates the operation state (e.g., NotStarted, InProgress, Succeeded, Failed). string status = 1; + // message provides human-readable details for the operation status. string message = 2; } message HealthEventStatus { + // nodeQuarantined captures the node quarantine state (e.g., Quarantined, UnQuarantined). string nodeQuarantined = 1; + // quarantineFinishTimestamp records when quarantine completed. google.protobuf.Timestamp quarantineFinishTimestamp = 2; + // userPodsEvictionStatus tracks eviction progress for user pods. OperationStatus userPodsEvictionStatus = 3; + // drainFinishTimestamp records when drain completed. google.protobuf.Timestamp drainFinishTimestamp = 4; + // faultRemediated indicates whether the fault was remediated (BoolValue distinguishes unset). google.protobuf.BoolValue faultRemediated = 5; + // lastRemediationTimestamp records the last remediation attempt time. google.protobuf.Timestamp lastRemediationTimestamp = 6; } message HealthEventWithStatus { + // createdAt records when the health event was created. google.protobuf.Timestamp createdAt = 1; + // healthEvent is the event spec payload. HealthEvent healthEvent = 2; + // healthEventStatus is the observed status for the event. HealthEventStatus healthEventStatus = 3; } message HealthEventResource { option (protoc_gen_crd.k8s_crd) = { api_group: "healthevents.dgxc.nvidia.com", kind: "HealthEventResource", plural: "healtheventresources", singular: "healtheventresource", categories: ["nvidia", "gpu"] }; + // spec defines the desired state for the health event. HealthEvent spec = 1; + // status reflects the observed state for the health event. HealthEventStatus status = 2; }As per coding guidelines, "Include comprehensive comments for all fields in Protocol Buffer messages."
Also applies to: 46-51, 118-129
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@data-models/protobufs/health_event.proto` around lines 30 - 44, The new protobuf messages lack per-field comments; add clear field-level comments for OperationStatus, HealthEventStatus, HealthEventWithStatus, and HealthEventResource describing purpose and semantics of each field (e.g., what "status" and "message" in OperationStatus represent, meaning of nodeQuarantined, quarantineFinishTimestamp, userPodsEvictionStatus, drainFinishTimestamp, faultRemediated, lastRemediationTimestamp in HealthEventStatus, and each field in HealthEventWithStatus and HealthEventResource). Locate the messages by name in the proto and add concise, guideline-compliant comments above each field to document expected values, units (timestamps), and any nullable/optional semantics.distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml (2)
89-91:⚠️ Potential issue | 🟡 Minor
format: RFC3339is not a recognized OpenAPI 3.0 format — should bedate-time.Kubernetes CRD validation uses OpenAPI 3.0 spec, where RFC 3339 timestamps use the format identifier
date-time. The unrecognizedRFC3339value will be silently ignored, bypassing schema validation ongeneratedTimestamp(line 91),quarantineFinishTimestamp(line 130),drainFinishTimestamp(line 141), andlastRemediationTimestamp(line 147). Fix this in the protoc-gen-crd generator configuration.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml` around lines 89 - 91, The CRD uses an invalid OpenAPI format "RFC3339" for timestamp fields; update the protoc-gen-crd generator configuration so those fields emit format: "date-time" instead. Locate the generator template/config that produces the fields named generatedTimestamp, quarantineFinishTimestamp, drainFinishTimestamp, and lastRemediationTimestamp and replace the RFC3339 format token with the OpenAPI 3.0 compliant "date-time" token so schema validation is enforced.
29-31:⚠️ Potential issue | 🟡 Minor
specandstatusare bothnullable: true— allows creating resources withspec: null.Kubernetes conventions typically require
specto be present. Thenullable: trueon bothspec(line 31) andstatus(line 123) comes from proto3 implicit optionality, but should be overridden to enforce resource validity. Consider addingrequired: ["spec"]at the top-level schema or adjusting the generator config.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml` around lines 29 - 31, The CRD schema currently marks properties.spec and properties.status as nullable, allowing creation of resources with spec: null; update the top-level OpenAPI schema so that spec is required (add required: ["spec"] at the top-level schema) and remove or set nullable: false for properties.spec (and optionally properties.status) to enforce presence and validity; adjust the CRD generator configuration if needed so proto3 optionality does not emit nullable: true for spec/status.health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)
27-28:⚠️ Potential issue | 🔴 CriticalSame missing
crd_pb2dependency as in the dcgm-diag module — runtime import failure.Identical to the issue in
preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py: the Go-stylecrd_pb2import path does not resolve in Python. All downstream code importing this module will fail at runtime.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py` around lines 27 - 28, The generated import for crd_pb2 in health_event_pb2.py uses an invalid Go-style path (github.com.yandex.protoc_gen_crd.library.go.k8s.protoc_gen_crd.proto.crd_pb2) which causes runtime ImportError; update the import in health_event_pb2.py to the correct Python package/module name used in this repo (or add a guarded try/except that falls back to the local crd_pb2 module), i.e., replace or wrap the failing import line that references github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2 so that the module imports the actual crd_pb2 Python module available at runtime.preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi (1)
5-6:⚠️ Potential issue | 🟡 Minor
crd_pb2import in.pyistub will break type-checking tools.Same root cause as the
.pyfiles — the Go-style import path doesn't resolve in Python. While this won't block runtime, it will cause errors in mypy/pyright if type checking is run in CI. This will be fixed when the.pyimport issue is resolved.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi` around lines 5 - 6, The .pyi stub imports a Go-style path "github.com.yandex.protoc_gen_crd.library.go.k8s.protoc_gen_crd.proto.crd_pb2" (bound to _crd_pb2) which breaks type-checkers; change the import in health_event_pb2.pyi to match the valid Python module path used by the corresponding .py (or a relative import) so that _crd_pb2 resolves for mypy/pyright, or remove/replace the annotation with a forward-declared type if the module isn’t needed for type checking; update the import that defines _crd_pb2 in the top-level of the stub to the correct Python package name consistent with the generated .py files.preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py (1)
27-28:⚠️ Potential issue | 🔴 CriticalMissing
crd_pb2Python module — runtimeModuleNotFoundErroron import.Line 28 imports
crd_pb2via a Go-style module path (github.com.yandex.protoc_gen_crd...) that does not exist in the Python environment. This will cause aModuleNotFoundErrorat runtime, blocking all modules that importhealth_event_pb2(e.g.,health.py,errors.py,config.py,__main__.py).Either vendor the compiled
crd_pb2.pyPython module under the expected package path and add it topyproject.toml, or regenerate the protobuf stubs with a protoc configuration that avoids the Go-only CRD import for Python targets.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py` around lines 27 - 28, The generated health_event_pb2.py imports a Go-style module name (github...crd_pb2) that doesn't exist in Python and causes ModuleNotFoundError; fix by either vendoring a Python crd_pb2 module under the expected package path so the import in health_event_pb2.py resolves (and add it to pyproject.toml/packaging), or re-run protoc to regenerate the protobuf stubs for Python without the Go CRD import (adjust proto import paths/options so crd_pb2 is a proper Python package import); update health_event_pb2.py only by replacing the invalid import with the correct Python package path if you vendor the module.api/Makefile (1)
102-116: Missingcleanphony target — add an alias for convention compliance.The
protos-cleantarget exists but there's no top-levelcleantarget, which is expected by the repo's Makefile conventions (flagged by checkmake). Additionally,test(line 112) runs without-race, diverging from other modules likedata-modelsthat usegotestsumwith-raceand coverage.♻️ Proposed additions
.PHONY: lint-test lint-test: lint test ## Run lint and tests. `@echo` "Lint and tests completed." + +.PHONY: clean +clean: protos-clean ## Remove generated code.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/Makefile` around lines 102 - 116, Add a top-level phony clean target as an alias to the existing protos-clean to satisfy checkmake conventions: declare .PHONY: clean and implement clean: protos-clean (so callers can run make clean). Also update the test target to run tests with the race detector and match other modules—replace the plain go test invocation in the test target with a gotestsum (or go test) invocation that includes -race (e.g., gotestsum -- -race ./... or go test -race ./...) and keep verbose output; reference the existing targets/names protos-clean and test and tools gotestsum and the -race flag when making these edits.preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py (1)
27-28:⚠️ Potential issue | 🔴 CriticalSame missing
crd_pb2dependency — identical to dcgm-diag and gpu-health-monitor.All three Python modules share the same generated
health_event_pb2.pywith the unresolvable Go-stylecrd_pb2import. Fix this at the protobuf generation level so all Python targets are corrected simultaneously.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py` around lines 27 - 28, The generated health_event_pb2.py contains a broken Go-style import for crd_pb2; fix the .proto generation so Python outputs a proper python import for the CRD proto: add or set option python_package = "desired.python.module.name" (or ensure the CRD proto defines a python_package) in the CRD proto and regenerate so that health_event_pb2 imports crd_pb2 as a normal Python module (not the Go path), or adjust the protobuf generation rule (protoc invocation) to emit crd_pb2 into the same Python package/module namespace used by health_event_pb2; target the health_event_pb2.py file and the CRD proto (crd.proto / crd_pb2) when making this change.Makefile (1)
281-281:⚠️ Potential issue | 🟡 Minor
protos-generateno longer depends onprotos-clean— stale files may persist.Removing the
protos-cleanprerequisite means orphaned.pb.gofiles from renamed or deleted.protomessages won't be cleaned before regeneration. The author has acknowledged this needs refactoring due to vendor dependencies. Ensure this is addressed before merging to avoid subtle build issues from stale generated code.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Makefile` at line 281, The Makefile target protos-generate currently no longer depends on protos-clean, which allows stale .pb.go files to persist; restore cleaning by reintroducing protos-clean as a prerequisite of protos-generate or by invoking the cleaning logic at the start of the protos-generate recipe (e.g., call the protos-clean target or run the same cleanup commands before generation), ensuring any renamed or deleted .proto-derived files are removed prior to running the generate steps so generated outputs remain consistent.node-drainer/pkg/reconciler/reconciler.go (1)
375-380: Nil-pointer dereferences onUserPodsEvictionStatus— already flaggedThe patterns at lines 376–377, 509–510, 519–520, and 780–781 (
podsEvictionStatus := healthEvent.HealthEventStatus.UserPodsEvictionStatusfollowed bypodsEvictionStatus.Status = ...) all dereference a nil*protos.OperationStatusif the field is unset.Also applies to: 509-513, 519-520, 780-784
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/reconciler/reconciler.go` around lines 375 - 380, The code dereferences healthEvent.HealthEventStatus.UserPodsEvictionStatus (assigned to podsEvictionStatus) without checking for nil, causing possible nil-pointer panics; before setting podsEvictionStatus.Status (in the blocks around model.UnQuarantined and the other occurrences), add a nil check and initialize a new protos.OperationStatus when nil (so podsEvictionStatus = &protos.OperationStatus{Status: ...} or set Status after ensuring non-nil), then pass the non-nil podsEvictionStatus into r.updateNodeUserPodsEvictedStatus; apply this pattern to all occurrences referenced (the podsEvictionStatus assignments near updateNodeUserPodsEvictedStatus calls and similar blocks).
🧹 Nitpick comments (5)
store-client/pkg/testutils/event_builder.go (1)
36-36: Optional: drop themodel.Statusdependency in favour of a plainstring.
quarantineStatusis typed asmodel.Statusacross the struct, the builder field, the default value inNewTestEventBuilder, and theWithQuarantineStatusmethod, butprotos.HealthEventStatus.NodeQuarantinedis a plainstring. Sincemodel.Statusis itself atype … string, this creates an avoidable coupling to the legacy model package in a pure test-utility file.♻️ Proposed refactor
type TestEventBuilder struct { ... - quarantineStatus model.Status + quarantineStatus string } func NewTestEventBuilder() *TestEventBuilder { return &TestEventBuilder{ - quarantineStatus: model.Status("NotQuarantined"), + quarantineStatus: "NotQuarantined", ... } } -func (b *TestEventBuilder) WithQuarantineStatus(status model.Status) *TestEventBuilder { +func (b *TestEventBuilder) WithQuarantineStatus(status string) *TestEventBuilder { b.quarantineStatus = status return b } type testEvent struct { ... - quarantineStatus model.Status + quarantineStatus string }The
UnmarshalDocumentcall on line 133 then becomes simplyNodeQuarantined: e.quarantineStatuswithout the explicit cast.Also applies to: 82-85, 108-108
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@store-client/pkg/testutils/event_builder.go` at line 36, Change the quarantineStatus field and its API from model.Status to plain string: update the struct field quarantineStatus, the NewTestEventBuilder default value, and the WithQuarantineStatus(builder) method signature to use string instead of model.Status, and adjust usages (e.g. the UnmarshalDocument call where NodeQuarantined is set) to assign e.quarantineStatus directly without casting; update any other references in this file (including the builder field and default at NewTestEventBuilder and the WithQuarantineStatus implementation) to remove the legacy model dependency.health-events-analyzer/pkg/parser/parser_test.go (1)
419-419: Remove stale commented-out code.It adds noise and can drift from actual behavior.
🧹 Suggested cleanup
-// UserPodsEvictionStatus: &protos.OperationStatus{Status: &protos.Status{Value: string(model.StatusSucceeded)}},🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-events-analyzer/pkg/parser/parser_test.go` at line 419, Remove the stale commented-out line containing "UserPodsEvictionStatus: &protos.OperationStatus{Status: &protos.Status{Value: string(model.StatusSucceeded)}},” from parser_test.go (it’s dead/commented test data), keeping the surrounding test code intact and ensuring the test still compiles and passes; specifically remove that commented entry referencing UserPodsEvictionStatus, protos.OperationStatus, protos.Status and model.StatusSucceeded to eliminate noise.distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml (1)
149-149: EmptyadditionalPrinterColumns— consider adding useful columns forkubectl get.With no additional printer columns,
kubectl get healtheventresourceswill only showNAMEandAGE. Adding columns likenodeName,checkName,isHealthy, orcomponentClassfrom.specwould significantly improve operational usability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml` at line 149, The CRD currently sets additionalPrinterColumns: [] so kubectl get healtheventresources only shows NAME and AGE; update the CRD's additionalPrinterColumns array to declare useful columns (e.g., nodeName mapping to .spec.nodeName, checkName to .spec.checkName, isHealthy to .spec.isHealthy, componentClass to .spec.componentClass) with appropriate JSONPaths, types (string/boolean), and description entries so those fields appear in `kubectl get` output; modify the additionalPrinterColumns block in the health_event.crd.yaml (the additionalPrinterColumns entry) to include these column definitions.data-models/Makefile (1)
68-71:install-plugin-remoteis a PHONY target used as a prerequisite of$(CRD_FILE)— it re-runsgo installon every build.Since
install-plugin-remoteis.PHONY, it will always execute when$(CRD_FILE)is built, even if the plugin is already installed at the correct version. Consider using a stamp file or checking the installed version first.♻️ Proposed improvement
-.PHONY: install-plugin-remote -install-plugin-remote: +# Stamp file to avoid re-installing on every build +PROTOC_GEN_CRD_STAMP = .protoc-gen-crd-$(PROTOC_GEN_CRD_VERSION).stamp + +$(PROTOC_GEN_CRD_STAMP): `@echo` "Installing protoc-gen-crd from github..." GO111MODULE=on go install github.com/yandex/protoc-gen-crd/cmd/protoc-gen-crd@$(PROTOC_GEN_CRD_VERSION) + `@touch` $@ + +.PHONY: install-plugin-remote +install-plugin-remote: $(PROTOC_GEN_CRD_STAMP)Then update the
$(CRD_FILE)prerequisite to use$(PROTOC_GEN_CRD_STAMP)instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@data-models/Makefile` around lines 68 - 71, The install-plugin-remote PHONY target always re-runs go install for github.com/yandex/protoc-gen-crd@$(PROTOC_GEN_CRD_VERSION); change this to create and update a stamp file (e.g. $(PROTOC_GEN_CRD_STAMP)) only when installation/version changes: implement install-plugin-remote to check the installed protoc-gen-crd version (or run go install and write the current $(PROTOC_GEN_CRD_VERSION) into $(PROTOC_GEN_CRD_STAMP) on success) and then update the $(CRD_FILE) prerequisite to depend on $(PROTOC_GEN_CRD_STAMP) instead of the PHONY target so the plugin is not reinstalled on every build.node-drainer/pkg/reconciler/reconciler.go (1)
929-934:setDrainFailedStatusbypasses the protojson+map serialization pattern used everywhere else
updateNodeUserPodsEvictedStatusmarshals the status viaprotojson→json.Unmarshal→map[string]interface{}before writing to MongoDB, butsetDrainFailedStatuswrites a rawprotos.OperationStatus{}struct literal directly into the update payload. For the currentStatus/Messagestring-only struct the BSON driver produces the same{status: ..., message: ...}output, but this will silently diverge ifOperationStatusgains any proto well-known-type fields (e.g., aTimestamp,Duration, oroneof) that the BSON driver doesn't know how to encode correctly.♻️ Proposed refactor — extract a shared helper
- update := map[string]any{ - "$set": map[string]any{ - "healtheventstatus.userpodsevictionstatus": protos.OperationStatus{ - Status: string(model.StatusFailed), - Message: reason, - }, - }, - } + failedStatus := &protos.OperationStatus{ + Status: string(model.StatusFailed), + Message: reason, + } + jsonBytes, err := protojson.Marshal(failedStatus) + if err != nil { + return fmt.Errorf("failed to marshal drain failed status: %w", err) + } + var evictionStatusMap map[string]interface{} + if err := json.Unmarshal(jsonBytes, &evictionStatusMap); err != nil { + return fmt.Errorf("failed to unmarshal drain failed status JSON: %w", err) + } + update := map[string]any{ + "$set": map[string]any{ + "healtheventstatus.userpodsevictionstatus": evictionStatusMap, + }, + }Alternatively, if the suggestion on the
protojsoncomment above is adopted (using explicit field paths), this becomes trivial:"$set": map[string]any{ "healtheventstatus.userpodsevictionstatus.status": string(model.StatusFailed), "healtheventstatus.userpodsevictionstatus.message": reason, },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/reconciler/reconciler.go` around lines 929 - 934, The setDrainFailedStatus update writes a raw protos.OperationStatus struct into the MongoDB "$set" payload, bypassing the protojson→json.Unmarshal→map pattern used by updateNodeUserPodsEvictedStatus and risking future encoding mismatches for proto well-known types; change setDrainFailedStatus to produce the same map-encoded value as updateNodeUserPodsEvictedStatus by marshaling the protos.OperationStatus via protojson, json.Unmarshal into a map[string]interface{}, and place that map under "healtheventstatus.userpodsevictionstatus" in the "$set" payload (or alternatively set explicit scalar paths like "healtheventstatus.userpodsevictionstatus.status" and ".message" if you prefer field-path updates).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@data-models/Makefile`:
- Line 63: Remove the duplicate PROTO_OPTIONS assignment in the Makefile: locate
the two identical declarations of PROTO_OPTIONS (one at the current start and
the other later in the file) and delete the redundant one so that PROTO_OPTIONS
is declared only once pointing to
vendor/github.com/yandex/protoc-gen-crd/library/go/k8s/protoc_gen_crd/proto/crd.proto;
ensure no other rules rely on the second declaration before removing it.
- Around line 124-125: The make target protos-clean should not depend on
vendor-proto because clean targets must not trigger go mod vendor; remove
vendor-proto from the protos-clean prerequisites and instead ensure vendor-proto
is a prerequisite of protos-generate (it’s safe since compile-crd-server ->
$(CRD_FILE) already transitively requires vendor-proto), so update the Makefile
to drop vendor-proto from the protos-clean dependency list and add it to
protos-generate if not already present.
In `@distros/kubernetes/nvsentinel/values.yaml`:
- Around line 110-111: Add an inline comment documenting the new Helm value
k8sdatastoreCrds.enabled (global.k8sdatastoreCrds.enabled) in values.yaml,
explaining that setting it to true enables CRD installation and providing a
short usage example or note (e.g., "set to true to install CRDs; leave false if
CRDs are managed externally"). Place this comment immediately above or beside
the k8sdatastoreCrds.enabled entry so readers see the enablement requirement and
example when scanning values.yaml.
In `@node-drainer/pkg/evaluator/evaluator.go`:
- Around line 68-69: The code dereferences
healthEvent.HealthEventStatus.UserPodsEvictionStatus without checking for nil
before accessing .Status; update the check inside the evaluator (around the
isTerminalStatus call) to first ensure
healthEvent.HealthEventStatus.UserPodsEvictionStatus != nil and only then call
isTerminalStatus(model.Status(healthEvent.HealthEventStatus.UserPodsEvictionStatus.Status));
retain the existing slog.Info("Event for node is in terminal state, skipping",
"node", nodeName) behavior when the guard indicates a terminal state.
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 573-586: The code marshals userPodsEvictionStatus with
protojson.Marshal then unmarshals into evictionStatusMap and uses that in
updateFields, but protojson output is unstable (camelCase keys, omitempty) which
can cause mismatched MongoDB field names and silent omitted clears; instead,
construct the update map explicitly from the OperationStatus fields (e.g., build
a map[string]any with explicit keys like "status" and "message" under
"healtheventstatus.userpodsevictionstatus") and assign that to updateFields so
updates always set/clear the exact fields predictably (modify the logic around
userPodsEvictionStatus, evictionStatusMap and updateFields to populate explicit
keys rather than relying on protojson).
- Around line 615-632: In observeEvictionDurationIfSucceeded add a nil check for
the userPodsEvictionStatus parameter before dereferencing its Status field:
return early if userPodsEvictionStatus == nil to avoid panics when callers pass
healthEvent.HealthEventStatus.UserPodsEvictionStatus (which is a
*protos.OperationStatus and can be nil); keep the rest of the logic (parsing
healthEvent, checking QuarantineFinishTimestamp, computing evictionDuration and
observing the metric) unchanged and only proceed when the pointer is non-nil and
Status equals model.StatusSucceeded.
---
Outside diff comments:
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 151-157: PreprocessAndEnqueueEvent dereferences
healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus without guarding
for nil, which can panic when the MongoDB document omits userpodsevictionstatus;
update PreprocessAndEnqueueEvent to check that
healthEventWithStatus.HealthEventStatus != nil and
healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus != nil before
calling isTerminalStatus or accessing .Status, and only call slog.Debug with the
status when the pointer is non-nil (otherwise treat as non-terminal or log that
the eviction status is missing) so the function no longer panics on null/missing
UserPodsEvictionStatus.
In `@store-client/pkg/testutils/event_builder.go`:
- Line 15: Add a package-level godoc comment for the testutils package by
inserting a descriptive comment block immediately above the "package testutils"
declaration; the comment should briefly describe the purpose of the testutils
package (what helpers/utilities it provides and intended audience/usage) so it
satisfies the package-level documentation requirement.
---
Duplicate comments:
In `@api/Makefile`:
- Around line 102-116: Add a top-level phony clean target as an alias to the
existing protos-clean to satisfy checkmake conventions: declare .PHONY: clean
and implement clean: protos-clean (so callers can run make clean). Also update
the test target to run tests with the race detector and match other
modules—replace the plain go test invocation in the test target with a gotestsum
(or go test) invocation that includes -race (e.g., gotestsum -- -race ./... or
go test -race ./...) and keep verbose output; reference the existing
targets/names protos-clean and test and tools gotestsum and the -race flag when
making these edits.
In `@data-models/protobufs/health_event.proto`:
- Around line 30-44: The new protobuf messages lack per-field comments; add
clear field-level comments for OperationStatus, HealthEventStatus,
HealthEventWithStatus, and HealthEventResource describing purpose and semantics
of each field (e.g., what "status" and "message" in OperationStatus represent,
meaning of nodeQuarantined, quarantineFinishTimestamp, userPodsEvictionStatus,
drainFinishTimestamp, faultRemediated, lastRemediationTimestamp in
HealthEventStatus, and each field in HealthEventWithStatus and
HealthEventResource). Locate the messages by name in the proto and add concise,
guideline-compliant comments above each field to document expected values, units
(timestamps), and any nullable/optional semantics.
In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`:
- Around line 89-91: The CRD uses an invalid OpenAPI format "RFC3339" for
timestamp fields; update the protoc-gen-crd generator configuration so those
fields emit format: "date-time" instead. Locate the generator template/config
that produces the fields named generatedTimestamp, quarantineFinishTimestamp,
drainFinishTimestamp, and lastRemediationTimestamp and replace the RFC3339
format token with the OpenAPI 3.0 compliant "date-time" token so schema
validation is enforced.
- Around line 29-31: The CRD schema currently marks properties.spec and
properties.status as nullable, allowing creation of resources with spec: null;
update the top-level OpenAPI schema so that spec is required (add required:
["spec"] at the top-level schema) and remove or set nullable: false for
properties.spec (and optionally properties.status) to enforce presence and
validity; adjust the CRD generator configuration if needed so proto3 optionality
does not emit nullable: true for spec/status.
In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py`:
- Around line 27-28: The generated import for crd_pb2 in health_event_pb2.py
uses an invalid Go-style path
(github.com.yandex.protoc_gen_crd.library.go.k8s.protoc_gen_crd.proto.crd_pb2)
which causes runtime ImportError; update the import in health_event_pb2.py to
the correct Python package/module name used in this repo (or add a guarded
try/except that falls back to the local crd_pb2 module), i.e., replace or wrap
the failing import line that references
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2
so that the module imports the actual crd_pb2 Python module available at
runtime.
In `@Makefile`:
- Line 281: The Makefile target protos-generate currently no longer depends on
protos-clean, which allows stale .pb.go files to persist; restore cleaning by
reintroducing protos-clean as a prerequisite of protos-generate or by invoking
the cleaning logic at the start of the protos-generate recipe (e.g., call the
protos-clean target or run the same cleanup commands before generation),
ensuring any renamed or deleted .proto-derived files are removed prior to
running the generate steps so generated outputs remain consistent.
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 375-380: The code dereferences
healthEvent.HealthEventStatus.UserPodsEvictionStatus (assigned to
podsEvictionStatus) without checking for nil, causing possible nil-pointer
panics; before setting podsEvictionStatus.Status (in the blocks around
model.UnQuarantined and the other occurrences), add a nil check and initialize a
new protos.OperationStatus when nil (so podsEvictionStatus =
&protos.OperationStatus{Status: ...} or set Status after ensuring non-nil), then
pass the non-nil podsEvictionStatus into r.updateNodeUserPodsEvictedStatus;
apply this pattern to all occurrences referenced (the podsEvictionStatus
assignments near updateNodeUserPodsEvictedStatus calls and similar blocks).
In `@preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py`:
- Around line 27-28: The generated health_event_pb2.py imports a Go-style module
name (github...crd_pb2) that doesn't exist in Python and causes
ModuleNotFoundError; fix by either vendoring a Python crd_pb2 module under the
expected package path so the import in health_event_pb2.py resolves (and add it
to pyproject.toml/packaging), or re-run protoc to regenerate the protobuf stubs
for Python without the Go CRD import (adjust proto import paths/options so
crd_pb2 is a proper Python package import); update health_event_pb2.py only by
replacing the invalid import with the correct Python package path if you vendor
the module.
In `@preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py`:
- Around line 27-28: The generated health_event_pb2.py contains a broken
Go-style import for crd_pb2; fix the .proto generation so Python outputs a
proper python import for the CRD proto: add or set option python_package =
"desired.python.module.name" (or ensure the CRD proto defines a python_package)
in the CRD proto and regenerate so that health_event_pb2 imports crd_pb2 as a
normal Python module (not the Go path), or adjust the protobuf generation rule
(protoc invocation) to emit crd_pb2 into the same Python package/module
namespace used by health_event_pb2; target the health_event_pb2.py file and the
CRD proto (crd.proto / crd_pb2) when making this change.
In `@preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi`:
- Around line 5-6: The .pyi stub imports a Go-style path
"github.com.yandex.protoc_gen_crd.library.go.k8s.protoc_gen_crd.proto.crd_pb2"
(bound to _crd_pb2) which breaks type-checkers; change the import in
health_event_pb2.pyi to match the valid Python module path used by the
corresponding .py (or a relative import) so that _crd_pb2 resolves for
mypy/pyright, or remove/replace the annotation with a forward-declared type if
the module isn’t needed for type checking; update the import that defines
_crd_pb2 in the top-level of the stub to the correct Python package name
consistent with the generated .py files.
In `@tilt/simple-health-client/go.mod`:
- Around line 14-18: The go.mod entry for github.com/yandex/protoc-gen-crd
appears to be a tooling dependency (protoc plugin) rather than a runtime import;
confirm there are no imports of "protoc-gen-crd" in Go sources (search for
protoc-gen-crd or yandex/protoc-gen-crd), and if it is only used for code
generation remove it from the module's runtime deps—either delete the indirect
line from go.mod or move it into a dedicated tooling declaration (create a
tools.go with a build tag like // +build tools and a blank import of
github.com/yandex/protoc-gen-crd, or put the plugin into a separate
tools/go.mod) so the client module does not carry the generator as a runtime
dependency; update go.sum accordingly and run go mod tidy to finalize.
---
Nitpick comments:
In `@data-models/Makefile`:
- Around line 68-71: The install-plugin-remote PHONY target always re-runs go
install for github.com/yandex/protoc-gen-crd@$(PROTOC_GEN_CRD_VERSION); change
this to create and update a stamp file (e.g. $(PROTOC_GEN_CRD_STAMP)) only when
installation/version changes: implement install-plugin-remote to check the
installed protoc-gen-crd version (or run go install and write the current
$(PROTOC_GEN_CRD_VERSION) into $(PROTOC_GEN_CRD_STAMP) on success) and then
update the $(CRD_FILE) prerequisite to depend on $(PROTOC_GEN_CRD_STAMP) instead
of the PHONY target so the plugin is not reinstalled on every build.
In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`:
- Line 149: The CRD currently sets additionalPrinterColumns: [] so kubectl get
healtheventresources only shows NAME and AGE; update the CRD's
additionalPrinterColumns array to declare useful columns (e.g., nodeName mapping
to .spec.nodeName, checkName to .spec.checkName, isHealthy to .spec.isHealthy,
componentClass to .spec.componentClass) with appropriate JSONPaths, types
(string/boolean), and description entries so those fields appear in `kubectl
get` output; modify the additionalPrinterColumns block in the
health_event.crd.yaml (the additionalPrinterColumns entry) to include these
column definitions.
In `@health-events-analyzer/pkg/parser/parser_test.go`:
- Line 419: Remove the stale commented-out line containing
"UserPodsEvictionStatus: &protos.OperationStatus{Status: &protos.Status{Value:
string(model.StatusSucceeded)}},” from parser_test.go (it’s dead/commented test
data), keeping the surrounding test code intact and ensuring the test still
compiles and passes; specifically remove that commented entry referencing
UserPodsEvictionStatus, protos.OperationStatus, protos.Status and
model.StatusSucceeded to eliminate noise.
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 929-934: The setDrainFailedStatus update writes a raw
protos.OperationStatus struct into the MongoDB "$set" payload, bypassing the
protojson→json.Unmarshal→map pattern used by updateNodeUserPodsEvictedStatus and
risking future encoding mismatches for proto well-known types; change
setDrainFailedStatus to produce the same map-encoded value as
updateNodeUserPodsEvictedStatus by marshaling the protos.OperationStatus via
protojson, json.Unmarshal into a map[string]interface{}, and place that map
under "healtheventstatus.userpodsevictionstatus" in the "$set" payload (or
alternatively set explicit scalar paths like
"healtheventstatus.userpodsevictionstatus.status" and ".message" if you prefer
field-path updates).
In `@store-client/pkg/testutils/event_builder.go`:
- Line 36: Change the quarantineStatus field and its API from model.Status to
plain string: update the struct field quarantineStatus, the NewTestEventBuilder
default value, and the WithQuarantineStatus(builder) method signature to use
string instead of model.Status, and adjust usages (e.g. the UnmarshalDocument
call where NodeQuarantined is set) to assign e.quarantineStatus directly without
casting; update any other references in this file (including the builder field
and default at NewTestEventBuilder and the WithQuarantineStatus implementation)
to remove the legacy model dependency.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (22)
commons/go.sumis excluded by!**/*.sumdata-models/go.sumis excluded by!**/*.sumdata-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.goevent-exporter/go.sumis excluded by!**/*.sumfault-quarantine/go.sumis excluded by!**/*.sumfault-remediation/go.sumis excluded by!**/*.sumhealth-events-analyzer/go.sumis excluded by!**/*.sumhealth-monitors/csp-health-monitor/go.sumis excluded by!**/*.sumhealth-monitors/kubernetes-object-monitor/go.sumis excluded by!**/*.sumhealth-monitors/syslog-health-monitor/go.sumis excluded by!**/*.sumjanitor-provider/go.sumis excluded by!**/*.sumjanitor/go.sumis excluded by!**/*.sumlabeler/go.sumis excluded by!**/*.summetadata-collector/go.sumis excluded by!**/*.sumnode-drainer/go.sumis excluded by!**/*.sumplatform-connectors/go.sumis excluded by!**/*.sumpreflight-checks/nccl-loopback/go.sumis excluded by!**/*.sumpreflight/go.sumis excluded by!**/*.sumstore-client/go.sumis excluded by!**/*.sumtests/go.sumis excluded by!**/*.sumtests/scale-tests/event-generator/go.sumis excluded by!**/*.sumtilt/simple-health-client/go.sumis excluded by!**/*.sum
📒 Files selected for processing (59)
.gitignore.versions.yamlMakefileapi/Makefilecommons/go.modcommons/pkg/eventutil/parser.gocommons/pkg/eventutil/parser_test.godata-models/Makefiledata-models/go.moddata-models/pkg/model/health_event_crd.godata-models/pkg/model/health_event_extentions.godata-models/protobufs/health_event.protodata-models/tools.godistros/kubernetes/nvsentinel/Chart.yamldistros/kubernetes/nvsentinel/charts/k8s-datastore/Chart.yamldistros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yamldistros/kubernetes/nvsentinel/charts/k8s-datastore/values.yamldistros/kubernetes/nvsentinel/values.yamlevent-exporter/go.modfault-quarantine/go.modfault-quarantine/pkg/eventwatcher/event_watcher.gofault-quarantine/pkg/reconciler/reconciler_e2e_test.gofault-remediation/go.modfault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/remediation/remediation.gohealth-events-analyzer/go.modhealth-events-analyzer/pkg/parser/parser_test.gohealth-events-analyzer/pkg/reconciler/reconciler_test.gohealth-monitors/csp-health-monitor/go.modhealth-monitors/gpu-health-monitor/Makefilehealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyihealth-monitors/kubernetes-object-monitor/go.modhealth-monitors/syslog-health-monitor/go.modjanitor-provider/go.modjanitor/go.modlabeler/go.modmetadata-collector/go.modnode-drainer/go.modnode-drainer/pkg/evaluator/evaluator.gonode-drainer/pkg/reconciler/reconciler.gonode-drainer/pkg/reconciler/reconciler_integration_test.goplatform-connectors/go.modpreflight-checks/dcgm-diag/Makefilepreflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pypreflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyipreflight-checks/nccl-allreduce/Makefilepreflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pypreflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyipreflight-checks/nccl-loopback/go.modpreflight/go.modstore-client/go.modstore-client/pkg/datastore/providers/postgresql/health_events_serialization_test.gostore-client/pkg/testutils/event_builder.gotests/go.modtests/scale-tests/event-generator/go.modtilt/simple-health-client/go.mod
✅ Files skipped from review due to trivial changes (1)
- distros/kubernetes/nvsentinel/charts/k8s-datastore/values.yaml
🚧 Files skipped from review as they are similar to previous changes (26)
- .gitignore
- .versions.yaml
- platform-connectors/go.mod
- health-monitors/gpu-health-monitor/Makefile
- fault-remediation/pkg/reconciler/reconciler_test.go
- preflight/go.mod
- tests/scale-tests/event-generator/go.mod
- fault-remediation/pkg/reconciler/reconciler_e2e_test.go
- health-monitors/kubernetes-object-monitor/go.mod
- janitor/go.mod
- labeler/go.mod
- commons/go.mod
- fault-remediation/pkg/reconciler/reconciler.go
- preflight-checks/nccl-loopback/go.mod
- distros/kubernetes/nvsentinel/Chart.yaml
- janitor-provider/go.mod
- health-monitors/syslog-health-monitor/go.mod
- data-models/tools.go
- distros/kubernetes/nvsentinel/charts/k8s-datastore/Chart.yaml
- node-drainer/pkg/reconciler/reconciler_integration_test.go
- store-client/go.mod
- metadata-collector/go.mod
- health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
- fault-remediation/go.mod
- health-events-analyzer/go.mod
- commons/pkg/eventutil/parser.go
| k8sdatastoreCrds: | ||
| enabled: false |
There was a problem hiding this comment.
Document the new Helm value with an inline comment (incl. enablement note/example).
global.k8sdatastoreCrds.enabled is introduced without inline documentation, so it’s unclear that setting it to true is required to enable CRD installation.
💡 Suggested inline comment
k8sdatastoreCrds:
- enabled: false
+ # Enable k8s datastore CRD chart installation (set to true to install CRDs; e.g., enabled: true)
+ enabled: falseAs per coding guidelines "Document all values in Helm chart values.yaml with inline comments" and "Note truthy value requirements in Helm chart documentation where applicable".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| k8sdatastoreCrds: | |
| enabled: false | |
| k8sdatastoreCrds: | |
| # Enable k8s datastore CRD chart installation (set to true to install CRDs; e.g., enabled: true) | |
| enabled: false |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@distros/kubernetes/nvsentinel/values.yaml` around lines 110 - 111, Add an
inline comment documenting the new Helm value k8sdatastoreCrds.enabled
(global.k8sdatastoreCrds.enabled) in values.yaml, explaining that setting it to
true enables CRD installation and providing a short usage example or note (e.g.,
"set to true to install CRDs; leave false if CRDs are managed externally").
Place this comment immediately above or beside the k8sdatastoreCrds.enabled
entry so readers see the enablement requirement and example when scanning
values.yaml.
7298063 to
712d208
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
fault-quarantine/pkg/reconciler/reconciler_e2e_test.go (1)
5257-5262:⚠️ Potential issue | 🟡 MinorMissed
string()conversion fornodequarantinedinTestMetrics_NodeQuarantineDuration.This is the only remaining inline event construction that still stores
model.StatusInProgressas the rawmodel.Statustype instead ofstring(model.StatusInProgress). Every other occurrence updated in this PR (lines 213, 1457, 1497, 2237, 2318, 3419) applies thestring()cast explicitly.While
UnmarshalDocumentuses a JSON round-trip (making both forms produce identical output in the current test flow), any code path that performs a direct.(string)type assertion onhealtheventstatus.nodequarantinedwould panic with the rawmodel.Statusvalue but succeed with a plainstring.🛠️ Proposed fix
- "nodequarantined": model.StatusInProgress, + "nodequarantined": string(model.StatusInProgress),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fault-quarantine/pkg/reconciler/reconciler_e2e_test.go` around lines 5257 - 5262, In TestMetrics_NodeQuarantineDuration update the inline event construction that sets healtheventstatus.nodequarantined (for eventID1) to store a string rather than the raw model.Status: replace the raw model.Status value with string(model.StatusInProgress) so the healtheventstatus map contains a string and not a model.Status, ensuring downstream code that does .(string) assertions on healtheventstatus.nodequarantined (and the TestMetrics_NodeQuarantineDuration test) won't panic.commons/pkg/eventutil/parser.go (1)
67-74:⚠️ Potential issue | 🔴 CriticalNil dereference:
HealthEventStatusis not guarded before accessing.NodeQuarantined.
HealthEventStatusis*protos.HealthEventStatus. If the incoming document lacks thehealtheventstatuskey entirely,json.Unmarshalleaves the pointernil. Line 72 then panics. The existing nil guard at lines 67–69 only coversHealthEvent;HealthEventStatusneeds the same treatment.🛠️ Proposed fix
// Safety check - ensure HealthEvent is not nil if healthEventWithStatus.HealthEvent == nil { return healthEventWithStatus, fmt.Errorf("health event is nil after unmarshaling") } + // Safety check - initialize HealthEventStatus if absent from document + if healthEventWithStatus.HealthEventStatus == nil { + healthEventWithStatus.HealthEventStatus = &protos.HealthEventStatus{} + } + // Set default value for NodeQuarantined if nil (e.g., for new events) if healthEventWithStatus.HealthEventStatus.NodeQuarantined == "" {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@commons/pkg/eventutil/parser.go` around lines 67 - 74, The code dereferences healthEventWithStatus.HealthEventStatus without a nil check; before accessing or setting NodeQuarantined, verify if healthEventWithStatus.HealthEventStatus is nil and if so initialize it (e.g., new protos.HealthEventStatus{}), then set NodeQuarantined to string(model.StatusNotStarted) when empty; update the block around HealthEvent/HealthEventStatus checks to guard HealthEventStatus the same way as HealthEvent to avoid panics.node-drainer/pkg/reconciler/reconciler.go (1)
151-155:⚠️ Potential issue | 🟠 MajorGuard terminal-status check against nil eviction status.
If
UserPodsEvictionStatus(or evenHealthEventStatus) is nil, this dereference panics during preprocessing and will abort ingestion. Add a nil guard and only treat the status as terminal when present.🐛 Proposed fix
- if isTerminalStatus(model.Status(healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status)) { + if healthEventWithStatus.HealthEventStatus != nil { + if status := healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus; status != nil && + isTerminalStatus(model.Status(status.Status)) { slog.Debug("Skipping event - already in terminal state", "node", healthEventWithStatus.HealthEvent.NodeName, - "status", healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status) + "status", status.Status) - return nil + return nil + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/reconciler/reconciler.go` around lines 151 - 155, The nil-guard is missing before dereferencing HealthEventStatus.UserPodsEvictionStatus.Status which can panic; update the precondition in the reconciler so you first check that healthEventWithStatus.HealthEventStatus != nil and healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus != nil and only then call isTerminalStatus(model.Status(...Status)); leave the existing slog.Debug path unchanged but ensure you only treat the event as terminal when those two fields are present (use the same healthEventWithStatus, HealthEventStatus, UserPodsEvictionStatus and isTerminalStatus identifiers to locate the code).
♻️ Duplicate comments (11)
fault-remediation/pkg/reconciler/reconciler.go (2)
118-121:⚠️ Potential issue | 🔴 CriticalGuard
HealthEventStatusbefore dereferencingNodeQuarantined.Line 118 can panic if
healthEventWithStatus.HealthEventStatusis nil;shouldSkipEventalready treats it as possibly nil, so this is still a risk.🛡️ Proposed fix
- nodeQuarantined := healthEventWithStatus.HealthEventStatus.NodeQuarantined - - if nodeQuarantined == string(model.UnQuarantined) || nodeQuarantined == string(model.Cancelled) { - return r.handleCancellationEvent(ctx, nodeName, model.Status(nodeQuarantined), r.Watcher, event.ResumeToken) - } + if healthEventWithStatus.HealthEventStatus != nil { + nodeQuarantined := healthEventWithStatus.HealthEventStatus.NodeQuarantined + if nodeQuarantined == string(model.UnQuarantined) || nodeQuarantined == string(model.Cancelled) { + return r.handleCancellationEvent(ctx, nodeName, model.Status(nodeQuarantined), r.Watcher, event.ResumeToken) + } + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fault-remediation/pkg/reconciler/reconciler.go` around lines 118 - 121, The code dereferences healthEventWithStatus.HealthEventStatus.NodeQuarantined without guarding that HealthEventStatus may be nil; update the conditional in the reconciler to first check healthEventWithStatus.HealthEventStatus != nil before reading NodeQuarantined, then compare the value against model.UnQuarantined and model.Cancelled and call r.handleCancellationEvent only when the status exists and matches; reference the HealthEventStatus field and NodeQuarantined value, and keep the call to r.handleCancellationEvent(ctx, nodeName, model.Status(nodeQuarantined), r.Watcher, event.ResumeToken) unchanged when invoked.
139-141:⚠️ Potential issue | 🟡 MinorReplace the TODO with an issue reference.
The TODO names a developer instead of a tracking issue, which violates the repo guideline.
✍️ Suggested update
- // TODO - Avinash add a helper function to check for supported actions + // TODO - #<issue-number> add a helper function to check for supported actionsAs per coding guidelines, "TODO comments should reference issues in Go code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fault-remediation/pkg/reconciler/reconciler.go` around lines 139 - 141, Replace the developer-named TODO above the healthEventWithStatus check with a TODO that references the tracking issue (e.g. "TODO(#<issue-number>): ..." or a short issue URL) per the repo guideline; remove the name "Avinash" and ensure the comment succinctly states the needed work (e.g., add helper to check supported actions) so reviewers can find the issue when seeing the TODO near healthEventWithStatus.HealthEventStatus.FaultRemediated.preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyi (1)
5-6: Duplicate: dependency verification covered inhealth_event_pb2.py.No extra action here; see the runtime import check in the generated
.pyfile for ensuring the CRD proto module is available.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyi` around lines 5 - 6, This .pyi duplicate declaration can be removed: delete the redundant import "from github.com.yandex.protoc_gen_crd.library.go.k8s.protoc_gen_crd.proto import crd_pb2 as _crd_pb2" from health_event_pb2.pyi and rely on the runtime import/verification already present in health_event_pb2.py (the existing runtime check will ensure the CRD proto module is available).distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml (2)
30-31:specis stillnullable: true— prior unresolved comment.
nullable: trueonspec(andstatusat line 123) allows creating aHealthEventResourcewithspec: null, which is non-idiomatic for Kubernetes CRDs. This was flagged in a prior review and remains unaddressed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml` around lines 30 - 31, The CRD currently allows spec and status to be null for the HealthEventResource by setting "nullable: true" (see the spec and status schema entries); change those entries so spec and status are non-nullable (remove "nullable: true" or set to "nullable: false") in the HealthEventResource CRD schema to prevent creating resources with spec: null or status: null, ensuring the spec/status schema blocks in the CRD are explicitly required and validated.
89-91:format: RFC3339is still non-standard OpenAPI — prior unresolved comment.
format: RFC3339is not a recognized OpenAPI 3.0 format identifier (the correct one isdate-time). Kubernetes ignores unrecognized format values, so timestamp validation is skipped. This was flagged in a prior review and remains unaddressed. Same issue on lines 129–131, 140–142, and 145–147.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml` around lines 89 - 91, The CRD uses the non‑standard OpenAPI format value "RFC3339" for timestamp properties (e.g. generatedTimestamp); replace every occurrence of format: RFC3339 with the OpenAPI‑recognized format: date-time while keeping type: string so Kubernetes/OpenAPI will validate timestamps correctly (apply this change for generatedTimestamp and the other fields that currently use format: RFC3339).health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
5-6:⚠️ Potential issue | 🟠 Majorcrd_pb2 import likely missing from the Python package.
The stub mirrors the same missing dependency as the generated
.pymodule; keep the stubs consistent with the fix you apply there.#!/bin/bash # Verify whether crd_pb2 exists in the repo/package. fd "crd_pb2.py" health-monitors/gpu-health-monitor rg -n "crd_pb2" health-monitors/gpu-health-monitor -g '*.py'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi` around lines 5 - 6, The stub import for crd_pb2 in health_event_pb2.pyi is pointing to a non-existent package path; update the stub to match the actual Python package used by the generated .py (or add the missing crd_pb2 module to the package) so both the .py and .pyi are consistent: locate the import in health_event_pb2.pyi (currently referencing github.com.yandex...crd_pb2) and replace it with the correct Python module name (or a relative/package import that matches the generated health_event_pb2.py), and ensure the corresponding crd_pb2.py exists in the package so type stubs and runtime imports resolve.api/Makefile (1)
104-116:⚠️ Potential issue | 🟡 MinorAdd a
cleanphony target to satisfy checkmake.
checkmakestill requirescleanin this module. Consider aliasing toprotos-cleanor a no‑op target.♻️ Proposed fix
.PHONY: lint lint: ## Run golangci-lint on all Go files. `@echo` "Running golangci-lint..." `@golangci-lint` run ./... +.PHONY: clean +clean: protos-clean ## Remove generated code. + .PHONY: test test: ## Run Go unit tests. `@echo` "Running tests..." `@go` test -v ./...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/Makefile` around lines 104 - 116, Add a PHONY clean target to the Makefile so checkmake passes: create a "clean" target (and add it to .PHONY) that either calls the existing "protos-clean" target or is a no-op placeholder, ensuring it coexists with the current "lint", "test", and "lint-test" targets; update the .PHONY list to include "clean" and make "clean" simply delegate to "protos-clean" or echo a harmless message.node-drainer/pkg/reconciler/reconciler.go (2)
637-655:⚠️ Potential issue | 🟡 MinorAdd a nil guard before reading
userPodsEvictionStatus.The helper dereferences the pointer unconditionally; if any caller passes nil this will panic.
🐛 Proposed fix
func (r *Reconciler) observeEvictionDurationIfSucceeded(event datastore.Event, userPodsEvictionStatus *protos.OperationStatus) { + if userPodsEvictionStatus == nil { + return + } if userPodsEvictionStatus.Status != string(model.StatusSucceeded) { return }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/reconciler/reconciler.go` around lines 637 - 655, The function observeEvictionDurationIfSucceeded dereferences userPodsEvictionStatus without a nil check which can panic if a caller passes nil; add a nil guard at the top of observeEvictionDurationIfSucceeded (e.g., if userPodsEvictionStatus == nil { return }) and then perform the existing Status comparison (userPodsEvictionStatus.Status != string(model.StatusSucceeded)) only after that guard so the function safely returns when the pointer is nil.
28-28:⚠️ Potential issue | 🟠 MajorAvoid
protojson→ map for persistence (unstable keys / omitempty).
protojson.Marshaloutput is explicitly unstable and omits zero values, which can silently skip clears on$set. Prefer building an explicit update map.♻️ Proposed fix
- // Marshal the proto message to JSON, then unmarshal to map[string]interface{} - jsonBytes, err := protojson.Marshal(userPodsEvictionStatus) - if err != nil { - return fmt.Errorf("failed to marshal userPodsEvictionStatus: %w", err) - } - - var evictionStatusMap map[string]interface{} - if err := json.Unmarshal(jsonBytes, &evictionStatusMap); err != nil { - return fmt.Errorf("failed to unmarshal eviction status JSON: %w", err) - } - - updateFields := map[string]any{ - "healtheventstatus.userpodsevictionstatus": evictionStatusMap, - } + updateFields := map[string]any{ + "healtheventstatus.userpodsevictionstatus.status": userPodsEvictionStatus.Status, + "healtheventstatus.userpodsevictionstatus.message": userPodsEvictionStatus.Message, + }protojson Marshal output stability and default field name / omitempty behaviorAlso applies to: 587-609, 612-613, 627-627
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/reconciler/reconciler.go` at line 28, You are currently using protojson.Marshal to convert protobuf messages into maps for persistence (the import "google.golang.org/protobuf/encoding/protojson" and its uses around the indicated locations); protojson output is unstable and omits zero-values, so replace each protojson->map conversion with an explicit update map constructed from the protobuf message fields (use the message's getters or direct field access) so zero/empty values are represented when you intend to clear fields in the datastore; locate the code that calls protojson.Marshal and the subsequent map construction around the referenced areas and build a deterministic map (e.g., map[string]interface{}) by enumerating each field you want persisted/cleared, then pass that explicit map to your persistence/update function instead of the protojson-derived map.health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)
27-28:⚠️ Potential issue | 🟠 Majorcrd_pb2 import likely missing from the Python package.
github.com.yandex...crd_pb2is not typically available in the gpu‑health‑monitor environment; this will fail at runtime unless it’s vendored and declared. Regenerate the protobufs without the CRD import or vendor/add the dependency.#!/bin/bash # Verify whether crd_pb2 exists in the repo/package. fd "crd_pb2.py" health-monitors/gpu-health-monitor rg -n "crd_pb2" health-monitors/gpu-health-monitor -g '*.py'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py` around lines 27 - 28, The generated protobuf file health_event_pb2.py imports an unavailable module named github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2 (the long alias from the line importing crd_pb2); either remove that CRD dependency by regenerating the protos without the CRD import or vendor/add the missing crd_pb2 Python package into the project so the import resolves at runtime. Locate health_event_pb2.py and the import statement for github.com.yandex...crd_pb2, then (a) re-run your protobuf generation with the CRD proto excluded/omitted or (b) add the generated crd_pb2.py into the package (and update packaging/import paths), and verify with the suggested fd/rg checks that crd_pb2.py is present.data-models/protobufs/health_event.proto (1)
31-51:⚠️ Potential issue | 🟡 MinorAdd field-level comments for the new CRD-facing messages.
OperationStatus, HealthEventStatus, HealthEventWithStatus, and HealthEventResource fields are undocumented; CRD users will see these in generated docs. Please add per‑field comments describing semantics and expected values.
📝 Example (apply similarly to all new fields)
message OperationStatus { - string status = 1; - string message = 2; + // status holds the operation status string (e.g., InProgress, Succeeded, Failed). + string status = 1; + // message provides a human‑readable status detail. + string message = 2; } message HealthEventStatus { - string nodeQuarantined = 1; + // nodeQuarantined indicates the quarantine state of the node. + string nodeQuarantined = 1; google.protobuf.Timestamp quarantineFinishTimestamp = 2; - OperationStatus userPodsEvictionStatus = 3; + // userPodsEvictionStatus tracks user‑pod eviction progress. + OperationStatus userPodsEvictionStatus = 3; google.protobuf.Timestamp drainFinishTimestamp = 4; - google.protobuf.BoolValue faultRemediated = 5; + // faultRemediated indicates whether the underlying fault was remediated. + google.protobuf.BoolValue faultRemediated = 5; google.protobuf.Timestamp lastRemediationTimestamp = 6; }As per coding guidelines, "Include comprehensive comments for all fields in Protocol Buffer messages."
Also applies to: 118-129
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@data-models/protobufs/health_event.proto` around lines 31 - 51, Add descriptive per-field comments for the new CRD-facing protobuf messages: OperationStatus, HealthEventStatus, HealthEventWithStatus (and any HealthEventResource fields referenced). For each field (OperationStatus.status, OperationStatus.message, HealthEventStatus.nodeQuarantined, HealthEventStatus.quarantineFinishTimestamp, HealthEventStatus.userPodsEvictionStatus, HealthEventStatus.drainFinishTimestamp, HealthEventStatus.faultRemediated, HealthEventStatus.lastRemediationTimestamp, HealthEventWithStatus.createdAt, HealthEventWithStatus.healthEvent, HealthEventWithStatus.healthEventStatus) add a one-line comment directly above the field that explains its semantics, allowed/typical values or formats (e.g., enum-like strings, timestamps are RFC3339/google.protobuf.Timestamp, boolean meaning), and any unit or scope expectations so generated CRD docs are clear. Ensure comments follow existing proto style (single-line comments) and mirror the example used elsewhere in the repo.
🧹 Nitpick comments (2)
preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi (1)
5-6:_crd_pb2is imported but unused in the type stubs.The
_crd_pb2import (line 6) isn't referenced by any type annotation in this.pyifile — it's only needed at the descriptor level in the companion.pyfile. Since this is auto-generated, no action needed here, but the same runtime availability concern for thegithub.com.yandex.protoc_gen_crd...package applies (flagged in the.pyfile review).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi` around lines 5 - 6, Remove the unused import _crd_pb2 from the health_event_pb2.pyi type stub to avoid a spurious unused-import; update the top of the file to delete the line importing "from github.com.yandex.protoc_gen_crd.library.go.k8s.protoc_gen_crd.proto import crd_pb2 as _crd_pb2" (or alternatively mark it with a "# type: ignore" if the import must remain for tooling), and ensure any runtime dependency checks or notes already raised for the companion .py (the github.com.yandex.protoc_gen_crd package) are addressed separately so the .pyi stays minimal.commons/pkg/eventutil/parser_test.go (1)
27-157: Missing test case for completely absenthealtheventstatusfield.Every test case provides
"healtheventstatus": {...}(some with an empty map). There is no case where the key is entirely absent from the document. That path leavesHealthEventStatusnil afterjson.Unmarshaland triggers the nil dereference inParseHealthEventFromEventat line 72 ofparser.go. Adding a test for this scenario would catch the bug (and validate the fix once applied).♻️ Proposed additional test case
{ name: "handle completely absent healtheventstatus", event: datastore.Event{ "operationType": "insert", "fullDocument": map[string]interface{}{ "healthevent": map[string]interface{}{ "nodename": "absent-status-node", "checkname": "GpuXidError", "componentclass": "GPU", }, }, }, expectError: false, checkResult: func(t *testing.T, result model.HealthEventWithStatus) { assert.Equal(t, "absent-status-node", result.HealthEvent.NodeName) assert.EqualValues(t, model.StatusNotStarted, result.HealthEventStatus.NodeQuarantined) }, },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@commons/pkg/eventutil/parser_test.go` around lines 27 - 157, Add a unit test that covers a document/event where "healtheventstatus" is entirely absent (e.g., name "handle completely absent healtheventstatus") and assert that ParseHealthEventFromEvent returns a HealthEventWithStatus with HealthEvent.NodeName set and HealthEventStatus.NodeQuarantined defaulted to model.StatusNotStarted; then fix ParseHealthEventFromEvent in parser.go to guard against nil HealthEventStatus after json.Unmarshal by initializing a default model.HealthEventStatus (with NodeQuarantined = model.StatusNotStarted) when the parsed status is nil or its NodeQuarantined is empty, ensuring no nil dereference occurs in the code paths that access result.HealthEventStatus.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`:
- Around line 49-70: The schema's enum constraints on fields recommendedAction
(and similarly processingStrategy) are ineffective because
x-kubernetes-int-or-string: true disables enum validation and format: enum is
invalid; remove the enum and format: enum entries or replace them with a
Kubernetes CEL validation using x-kubernetes-validations that enforces allowed
integer/string values for recommendedAction/processingStrategy, and ensure the
CRD uses either the advisory enum removal or an explicit
x-kubernetes-validations CEL rule to enforce the proto enum semantics.
In `@node-drainer/pkg/evaluator/evaluator.go`:
- Around line 62-66: Guard the pointer before dereferencing: check
healthEvent.HealthEventStatus != nil before reading NodeQuarantined and only
perform the comparison to string(model.UnQuarantined) when the pointer is
non-nil; e.g. wrap the existing statusStr logic in an if
healthEvent.HealthEventStatus != nil { ... } block (referencing
HealthEventStatus and NodeQuarantined) so you don't panic and keep the later
UserPodsEvictionStatus nil guard and existing DrainActionResult/ActionSkip
behavior intact.
In `@preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py`:
- Around line 27-28: The generated Python file health_event_pb2.py erroneously
imports
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2
(the crd_pb2 symbol) which is unused and causes ModuleNotFoundError; edit
data-models/protobufs/health_event.proto to remove the import
"github.com/yandex/protoc-gen-crd/library/go/k8s/protoc_gen_crd/proto/crd.proto"
(used only for Go CRD generation), then re-run protoc to regenerate
health_event_pb2.py so the offending import line is gone and the module no
longer references
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2.
In `@preflight-checks/dcgm-diag/Makefile`:
- Line 35: protos-clean should be side‑effect free: remove vendor-proto from the
protos-clean target dependency so it does not fetch or vendor code; ensure
vendor-proto remains only as a dependency of protos-generate (or add
vendor-proto to the protos-generate prerequisites if not already present) and
verify protos-clean only removes generated artifacts without invoking vendoring.
In `@preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py`:
- Line 28: The generated health_event_pb2.py imports a non-existent module
github.com.yandex.protoc_gen_crd.library.go.k8s.protoc_gen_crd.proto.crd_pb2
which will raise ImportError; regenerate or provide the missing Python CRD proto
package so that crd_pb2 and any intermediate packages (with __init__.py) exist
on the Python path, or adjust the generated import to the correct Python package
name and location; locate the import in health_event_pb2.py (the failing import
line referencing crd_pb2), re-run the protobuf compilation for the CRD proto
target to produce crd_pb2.py (or vendor the compiled crd_pb2 into the
nccl_allreduce package) and ensure the package is installed or added to
PYTHONPATH so the import resolves at runtime.
---
Outside diff comments:
In `@commons/pkg/eventutil/parser.go`:
- Around line 67-74: The code dereferences
healthEventWithStatus.HealthEventStatus without a nil check; before accessing or
setting NodeQuarantined, verify if healthEventWithStatus.HealthEventStatus is
nil and if so initialize it (e.g., new protos.HealthEventStatus{}), then set
NodeQuarantined to string(model.StatusNotStarted) when empty; update the block
around HealthEvent/HealthEventStatus checks to guard HealthEventStatus the same
way as HealthEvent to avoid panics.
In `@fault-quarantine/pkg/reconciler/reconciler_e2e_test.go`:
- Around line 5257-5262: In TestMetrics_NodeQuarantineDuration update the inline
event construction that sets healtheventstatus.nodequarantined (for eventID1) to
store a string rather than the raw model.Status: replace the raw model.Status
value with string(model.StatusInProgress) so the healtheventstatus map contains
a string and not a model.Status, ensuring downstream code that does .(string)
assertions on healtheventstatus.nodequarantined (and the
TestMetrics_NodeQuarantineDuration test) won't panic.
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 151-155: The nil-guard is missing before dereferencing
HealthEventStatus.UserPodsEvictionStatus.Status which can panic; update the
precondition in the reconciler so you first check that
healthEventWithStatus.HealthEventStatus != nil and
healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus != nil and only
then call isTerminalStatus(model.Status(...Status)); leave the existing
slog.Debug path unchanged but ensure you only treat the event as terminal when
those two fields are present (use the same healthEventWithStatus,
HealthEventStatus, UserPodsEvictionStatus and isTerminalStatus identifiers to
locate the code).
---
Duplicate comments:
In `@api/Makefile`:
- Around line 104-116: Add a PHONY clean target to the Makefile so checkmake
passes: create a "clean" target (and add it to .PHONY) that either calls the
existing "protos-clean" target or is a no-op placeholder, ensuring it coexists
with the current "lint", "test", and "lint-test" targets; update the .PHONY list
to include "clean" and make "clean" simply delegate to "protos-clean" or echo a
harmless message.
In `@data-models/protobufs/health_event.proto`:
- Around line 31-51: Add descriptive per-field comments for the new CRD-facing
protobuf messages: OperationStatus, HealthEventStatus, HealthEventWithStatus
(and any HealthEventResource fields referenced). For each field
(OperationStatus.status, OperationStatus.message,
HealthEventStatus.nodeQuarantined, HealthEventStatus.quarantineFinishTimestamp,
HealthEventStatus.userPodsEvictionStatus,
HealthEventStatus.drainFinishTimestamp, HealthEventStatus.faultRemediated,
HealthEventStatus.lastRemediationTimestamp, HealthEventWithStatus.createdAt,
HealthEventWithStatus.healthEvent, HealthEventWithStatus.healthEventStatus) add
a one-line comment directly above the field that explains its semantics,
allowed/typical values or formats (e.g., enum-like strings, timestamps are
RFC3339/google.protobuf.Timestamp, boolean meaning), and any unit or scope
expectations so generated CRD docs are clear. Ensure comments follow existing
proto style (single-line comments) and mirror the example used elsewhere in the
repo.
In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`:
- Around line 30-31: The CRD currently allows spec and status to be null for the
HealthEventResource by setting "nullable: true" (see the spec and status schema
entries); change those entries so spec and status are non-nullable (remove
"nullable: true" or set to "nullable: false") in the HealthEventResource CRD
schema to prevent creating resources with spec: null or status: null, ensuring
the spec/status schema blocks in the CRD are explicitly required and validated.
- Around line 89-91: The CRD uses the non‑standard OpenAPI format value
"RFC3339" for timestamp properties (e.g. generatedTimestamp); replace every
occurrence of format: RFC3339 with the OpenAPI‑recognized format: date-time
while keeping type: string so Kubernetes/OpenAPI will validate timestamps
correctly (apply this change for generatedTimestamp and the other fields that
currently use format: RFC3339).
In `@fault-remediation/pkg/reconciler/reconciler.go`:
- Around line 118-121: The code dereferences
healthEventWithStatus.HealthEventStatus.NodeQuarantined without guarding that
HealthEventStatus may be nil; update the conditional in the reconciler to first
check healthEventWithStatus.HealthEventStatus != nil before reading
NodeQuarantined, then compare the value against model.UnQuarantined and
model.Cancelled and call r.handleCancellationEvent only when the status exists
and matches; reference the HealthEventStatus field and NodeQuarantined value,
and keep the call to r.handleCancellationEvent(ctx, nodeName,
model.Status(nodeQuarantined), r.Watcher, event.ResumeToken) unchanged when
invoked.
- Around line 139-141: Replace the developer-named TODO above the
healthEventWithStatus check with a TODO that references the tracking issue (e.g.
"TODO(#<issue-number>): ..." or a short issue URL) per the repo guideline;
remove the name "Avinash" and ensure the comment succinctly states the needed
work (e.g., add helper to check supported actions) so reviewers can find the
issue when seeing the TODO near
healthEventWithStatus.HealthEventStatus.FaultRemediated.
In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py`:
- Around line 27-28: The generated protobuf file health_event_pb2.py imports an
unavailable module named
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2
(the long alias from the line importing crd_pb2); either remove that CRD
dependency by regenerating the protos without the CRD import or vendor/add the
missing crd_pb2 Python package into the project so the import resolves at
runtime. Locate health_event_pb2.py and the import statement for
github.com.yandex...crd_pb2, then (a) re-run your protobuf generation with the
CRD proto excluded/omitted or (b) add the generated crd_pb2.py into the package
(and update packaging/import paths), and verify with the suggested fd/rg checks
that crd_pb2.py is present.
In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi`:
- Around line 5-6: The stub import for crd_pb2 in health_event_pb2.pyi is
pointing to a non-existent package path; update the stub to match the actual
Python package used by the generated .py (or add the missing crd_pb2 module to
the package) so both the .py and .pyi are consistent: locate the import in
health_event_pb2.pyi (currently referencing github.com.yandex...crd_pb2) and
replace it with the correct Python module name (or a relative/package import
that matches the generated health_event_pb2.py), and ensure the corresponding
crd_pb2.py exists in the package so type stubs and runtime imports resolve.
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 637-655: The function observeEvictionDurationIfSucceeded
dereferences userPodsEvictionStatus without a nil check which can panic if a
caller passes nil; add a nil guard at the top of
observeEvictionDurationIfSucceeded (e.g., if userPodsEvictionStatus == nil {
return }) and then perform the existing Status comparison
(userPodsEvictionStatus.Status != string(model.StatusSucceeded)) only after that
guard so the function safely returns when the pointer is nil.
- Line 28: You are currently using protojson.Marshal to convert protobuf
messages into maps for persistence (the import
"google.golang.org/protobuf/encoding/protojson" and its uses around the
indicated locations); protojson output is unstable and omits zero-values, so
replace each protojson->map conversion with an explicit update map constructed
from the protobuf message fields (use the message's getters or direct field
access) so zero/empty values are represented when you intend to clear fields in
the datastore; locate the code that calls protojson.Marshal and the subsequent
map construction around the referenced areas and build a deterministic map
(e.g., map[string]interface{}) by enumerating each field you want
persisted/cleared, then pass that explicit map to your persistence/update
function instead of the protojson-derived map.
In `@preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyi`:
- Around line 5-6: This .pyi duplicate declaration can be removed: delete the
redundant import "from
github.com.yandex.protoc_gen_crd.library.go.k8s.protoc_gen_crd.proto import
crd_pb2 as _crd_pb2" from health_event_pb2.pyi and rely on the runtime
import/verification already present in health_event_pb2.py (the existing runtime
check will ensure the CRD proto module is available).
---
Nitpick comments:
In `@commons/pkg/eventutil/parser_test.go`:
- Around line 27-157: Add a unit test that covers a document/event where
"healtheventstatus" is entirely absent (e.g., name "handle completely absent
healtheventstatus") and assert that ParseHealthEventFromEvent returns a
HealthEventWithStatus with HealthEvent.NodeName set and
HealthEventStatus.NodeQuarantined defaulted to model.StatusNotStarted; then fix
ParseHealthEventFromEvent in parser.go to guard against nil HealthEventStatus
after json.Unmarshal by initializing a default model.HealthEventStatus (with
NodeQuarantined = model.StatusNotStarted) when the parsed status is nil or its
NodeQuarantined is empty, ensuring no nil dereference occurs in the code paths
that access result.HealthEventStatus.
In `@preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi`:
- Around line 5-6: Remove the unused import _crd_pb2 from the
health_event_pb2.pyi type stub to avoid a spurious unused-import; update the top
of the file to delete the line importing "from
github.com.yandex.protoc_gen_crd.library.go.k8s.protoc_gen_crd.proto import
crd_pb2 as _crd_pb2" (or alternatively mark it with a "# type: ignore" if the
import must remain for tooling), and ensure any runtime dependency checks or
notes already raised for the companion .py (the github.com.yandex.protoc_gen_crd
package) are addressed separately so the .pyi stays minimal.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (31)
Makefileapi/Makefilecommons/pkg/eventutil/parser.gocommons/pkg/eventutil/parser_test.godata-models/Makefiledata-models/pkg/model/health_event_extentions.godata-models/protobufs/health_event.protodistros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yamldistros/kubernetes/nvsentinel/charts/k8s-datastore/values.yamldistros/kubernetes/nvsentinel/values.yamlfault-quarantine/pkg/eventwatcher/event_watcher.gofault-quarantine/pkg/reconciler/reconciler_e2e_test.gofault-remediation/go.modfault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/remediation/remediation.gohealth-events-analyzer/pkg/parser/parser_test.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyinode-drainer/pkg/evaluator/evaluator.gonode-drainer/pkg/reconciler/reconciler.gonode-drainer/pkg/reconciler/reconciler_integration_test.gopreflight-checks/dcgm-diag/Makefilepreflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pypreflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyipreflight-checks/nccl-allreduce/Makefilepreflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pypreflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyistore-client/pkg/datastore/providers/postgresql/health_events_serialization_test.gostore-client/pkg/testutils/event_builder.go
🚧 Files skipped from review as they are similar to previous changes (8)
- fault-remediation/pkg/remediation/remediation.go
- fault-quarantine/pkg/eventwatcher/event_watcher.go
- fault-remediation/pkg/reconciler/reconciler_e2e_test.go
- distros/kubernetes/nvsentinel/charts/k8s-datastore/values.yaml
- node-drainer/pkg/reconciler/reconciler_integration_test.go
- distros/kubernetes/nvsentinel/values.yaml
- Makefile
- health-events-analyzer/pkg/parser/parser_test.go
| recommendedAction: | ||
| enum: | ||
| - 0 | ||
| - "NONE" | ||
| - 2 | ||
| - "COMPONENT_RESET" | ||
| - 5 | ||
| - "CONTACT_SUPPORT" | ||
| - 6 | ||
| - "RUN_FIELDDIAG" | ||
| - 15 | ||
| - "RESTART_VM" | ||
| - 24 | ||
| - "RESTART_BM" | ||
| - 25 | ||
| - "REPLACE_VM" | ||
| - 26 | ||
| - "RUN_DCGMEUD" | ||
| - 99 | ||
| - "UNKNOWN" | ||
| format: enum | ||
| x-kubernetes-int-or-string: true |
There was a problem hiding this comment.
enum constraint combined with x-kubernetes-int-or-string: true is silently ignored by Kubernetes.
Kubernetes structural schema validation treats x-kubernetes-int-or-string: true as a special marker and does not apply enum constraints to such fields. The recommendedAction and processingStrategy enums will not be enforced — any arbitrary integer or string can be stored in these fields. Additionally, format: enum (line 69) is not a recognized OpenAPI format and has no validation effect.
If proto enum validation is important, consider a CEL validation rule (x-kubernetes-validations) as an alternative, or document that these fields are advisory-only.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`
around lines 49 - 70, The schema's enum constraints on fields recommendedAction
(and similarly processingStrategy) are ineffective because
x-kubernetes-int-or-string: true disables enum validation and format: enum is
invalid; remove the enum and format: enum entries or replace them with a
Kubernetes CEL validation using x-kubernetes-validations that enforces allowed
integer/string values for recommendedAction/processingStrategy, and ensure the
CRD uses either the advisory enum removal or an explicit
x-kubernetes-validations CEL rule to enforce the proto enum semantics.
| from google.protobuf import wrappers_pb2 as google_dot_protobuf_dot_wrappers__pb2 | ||
| from github.com.yandex.protoc_gen_crd.library.go.k8s.protoc_gen_crd.proto import crd_pb2 as github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate any generated CRD Python modules
fd -a "crd_pb2.py"
# Check for a matching package hierarchy (e.g., github/com/...)
fd -a "__init__.py" -g "github/**" || true
# Verify dependency declaration in Poetry config
fd -a "pyproject.toml" -x rg -n "crd|yandex|protoc_gen_crd" {} || trueRepository: NVIDIA/NVSentinel
Length of output: 157
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate any generated CRD Python modules
echo "=== Searching for crd_pb2.py files ==="
fd -a "crd_pb2.py" || echo "No crd_pb2.py found"
# Check for github package directory at repo root or subdirs
echo "=== Checking for github directory structure ==="
fd -a -t d "github" || echo "No github directory found"
# Look for pyproject.toml
echo "=== Locating pyproject.toml ==="
fd -a "pyproject.toml"
# Check repo root structure
echo "=== Repository root structure ==="
ls -la | head -20
# Search for crd references in pyproject.toml if it exists
echo "=== Searching for crd/yandex/protoc_gen_crd in pyproject.toml ==="
fd -a "pyproject.toml" -x cat {} | rg -i "crd|yandex|protoc_gen_crd" || echo "No matches found"Repository: NVIDIA/NVSentinel
Length of output: 1727
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check the dcgm-diag pyproject.toml
echo "=== dcgm-diag pyproject.toml ==="
cat /home/jailuser/git/preflight-checks/dcgm-diag/pyproject.toml
echo -e "\n=== Checking if health_event_pb2.py exists ==="
ls -la /home/jailuser/git/preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py 2>/dev/null || echo "File not found"
echo -e "\n=== First 50 lines of health_event_pb2.py ==="
head -50 /home/jailuser/git/preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py 2>/dev/null || echo "File not found"
# Search for any references to github.com.yandex or crd in dcgm-diag
echo -e "\n=== Searching for crd/yandex references in dcgm-diag ==="
rg -i "github.com.yandex|crd|yandex" /home/jailuser/git/preflight-checks/dcgm-diag/ || echo "No matches"Repository: NVIDIA/NVSentinel
Length of output: 11249
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Locating health_event.proto ==="
fd -a "health_event.proto"
echo -e "\n=== Content of health_event.proto ==="
cat /home/jailuser/git/preflight-checks/dcgm-diag/dcgm_diag/protos/health_event.proto 2>/dev/null || echo "Not found"
echo -e "\n=== Checking if crd.proto exists anywhere ==="
fd -a "crd.proto"
echo -e "\n=== Looking for go.mod or other dependency files ==="
fd -a "go.mod|go.sum" /home/jailuser/git/preflight-checks/dcgm-diag/ || echo "None found"Repository: NVIDIA/NVSentinel
Length of output: 306
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Content of health_event.proto ==="
cat /home/jailuser/git/data-models/protobufs/health_event.proto
echo -e "\n=== Data-models directory structure ==="
ls -la /home/jailuser/git/data-models/protobufs/ | head -20
echo -e "\n=== Checking for crd dependency in data-models ==="
rg -i "crd|yandex" /home/jailuser/git/data-models/ --max-count=10 || echo "None found"
echo -e "\n=== Looking for go.mod in data-models ==="
find /home/jailuser/git/data-models -name "go.mod" -o -name "go.sum" -o -name "Makefile" | head -10Repository: NVIDIA/NVSentinel
Length of output: 7766
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Checking if crd_pb2 symbol is used in health_event_pb2.py ==="
grep -n "crd_pb2\|_crd_pb2" /home/jailuser/git/preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py | head -20
echo -e "\n=== Checking if health_event_pb2 is imported by Python code ==="
rg -r "import.*health_event_pb2|from.*health_event_pb2" /home/jailuser/git/preflight-checks/dcgm-diag/ || echo "No imports found"
echo -e "\n=== Full grep for crd in generated _pb2.py ==="
grep -c "github_dot_com_dot_yandex" /home/jailuser/git/preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py || echo "0 matches"
echo -e "\n=== Check if the import statement will fail by trying to parse it ==="
python3 -c "import sys; sys.path.insert(0, '/home/jailuser/git/preflight-checks/dcgm-diag'); from dcgm_diag.protos import health_event_pb2" 2>&1 || echo "Import failed as expected"Repository: NVIDIA/NVSentinel
Length of output: 1016
Remove the unused crd_pb2 import from the proto file and regenerate Python code.
The generated import on line 28 (from github.com.yandex...crd_pb2) causes a ModuleNotFoundError because the crd module is a Go dependency, not available in Python. The symbol is imported but never used in the generated Python code. Remove import "github.com/yandex/protoc-gen-crd/library/go/k8s/protoc_gen_crd/proto/crd.proto"; from data-models/protobufs/health_event.proto (the proto imports it for Go CRD generation, not Python), then regenerate the _pb2.py files to resolve this runtime blocker.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py` around lines
27 - 28, The generated Python file health_event_pb2.py erroneously imports
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2
(the crd_pb2 symbol) which is unused and causes ModuleNotFoundError; edit
data-models/protobufs/health_event.proto to remove the import
"github.com/yandex/protoc-gen-crd/library/go/k8s/protoc_gen_crd/proto/crd.proto"
(used only for Go CRD generation), then re-run protoc to regenerate
health_event_pb2.py so the offending import line is gone and the module no
longer references
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2.
preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py
Outdated
Show resolved
Hide resolved
712d208 to
aa8342c
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
store-client/pkg/testutils/event_builder.go (2)
15-16:⚠️ Potential issue | 🟡 MinorAdd a package-level doc comment.
The
testutilspackage lacks a// Package testutils ...doc comment. As per coding guidelines, all Go packages require package-level godoc.📝 Proposed fix
+// Package testutils provides database-agnostic test helpers for constructing +// and populating HealthEvent fixtures without coupling tests to a specific datastore backend. package testutilsAs per coding guidelines: "Package-level godoc required for all Go packages."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@store-client/pkg/testutils/event_builder.go` around lines 15 - 16, Add a package-level godoc comment immediately above the "package testutils" declaration: start the comment with "// Package testutils ..." and provide a brief description of the package’s purpose and what exported helpers (e.g., event builders, test utilities) it provides, following Go godoc conventions (short summary sentence, optional additional details). Ensure the comment is a plain // comment block describing the package rather than a block comment, and keep it concise and relevant to identifiers in this package.
43-43:⚠️ Potential issue | 🟡 MinorUse the
model.UnQuarantinedconstant instead of the hardcoded string literal.The
modelpackage defines Status constants (e.g.,model.UnQuarantined,model.Quarantined), but the code usesmodel.Status("NotQuarantined"), which doesn't match any defined constant. Usemodel.UnQuarantinedto align with the established quarantine status values.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@store-client/pkg/testutils/event_builder.go` at line 43, Replace the hardcoded status literal used when constructing the event (the field named quarantineStatus in the event builder) with the predefined constant from the model package: use model.UnQuarantined instead of model.Status("NotQuarantined"), ensuring the builder sets the established Status constant rather than an unmatched string.node-drainer/pkg/reconciler/reconciler.go (1)
145-167:⚠️ Potential issue | 🔴 CriticalGuard against nil UserPodsEvictionStatus before terminal-status check.
HealthEventStatusorUserPodsEvictionStatuscan be nil; dereferencing will panic. Use the generated getter and guard before accessingStatus.Proposed fix
- if isTerminalStatus(model.Status(healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status)) { + podsEvictionStatus := healthEventWithStatus.HealthEventStatus.GetUserPodsEvictionStatus() + if podsEvictionStatus != nil && + isTerminalStatus(model.Status(podsEvictionStatus.Status)) { slog.Debug("Skipping event - already in terminal state", "node", healthEventWithStatus.HealthEvent.NodeName, - "status", healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status) + "status", podsEvictionStatus.Status) return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/reconciler/reconciler.go` around lines 145 - 167, The terminal-status check currently dereferences HealthEventStatus.UserPodsEvictionStatus and can panic if either is nil; before calling isTerminalStatus, use the generated getters (e.g. healthEventWithStatus.GetHealthEventStatus(), GetUserPodsEvictionStatus(), GetStatus()) and guard for nil so you only call isTerminalStatus(model.Status(...)) when UserPodsEvictionStatus and its Status are non-nil; similarly replace direct field access in the subsequent slog.Debug and the handleEventCancellation call with the safe getters or nil-checked values to avoid panics.
♻️ Duplicate comments (8)
api/Makefile (1)
102-116:⚠️ Potential issue | 🟡 MinorAdd a
cleanphony alias to satisfy checkmake.
checkmake still expects acleantarget; add a minimal alias toprotos-clean.Proposed patch
.PHONY: lint-test lint-test: lint test ## Run lint and tests. `@echo` "Lint and tests completed." + +.PHONY: clean +clean: protos-clean ## Remove generated code.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/Makefile` around lines 102 - 116, The Makefile is missing a clean phony target required by checkmake; add a `.PHONY: clean` entry and a minimal `clean` target that delegates to the existing `protos-clean` target (so add `clean: protos-clean`) near the other phony targets (next to `lint`, `test`, `lint-test`) to satisfy the check without changing behavior.health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
5-7:⚠️ Potential issue | 🟡 Minor
crd_pb2stub import is unresolved.
Type-checkers/stub loading will fail unless this module is vendored/declared. Regenerate stubs without the CRD import or add the dependency.#!/bin/bash # Verify whether crd_pb2 exists and is declared as a dependency. fd -a "crd_pb2.py" fd -a "pyproject.toml" -x rg -n "protoc_gen_crd|crd_pb2|yandex" {} || true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi` around lines 5 - 7, The stub import for crd_pb2 in health_event_pb2.pyi is unresolved; either remove/regenerate the stubs so they don't import github.com.yandex...crd_pb2 or vendor/declare that module as a Python dependency. Locate the import line referencing "crd_pb2 as _crd_pb2" in health_event_pb2.pyi and fix by regenerating the .pyi without the CRD import (so protoc doesn't emit the github import) or add a vendored stub/package for the module (crd_pb2) to the repo and ensure it's listed in packaging/pyproject so type-checkers can resolve it.node-drainer/pkg/reconciler/reconciler.go (1)
28-28:⚠️ Potential issue | 🟠 MajorAvoid protojson for persistence updates; build the map explicitly.
protojson.Marshaloutput is documented as unstable and omits zero-values, which can leave stale fields (e.g.,message) in MongoDB. Build the map directly to ensure deterministic keys and field clearing.Proposed fix
- // Marshal the proto message to JSON, then unmarshal to map[string]interface{} - jsonBytes, err := protojson.Marshal(userPodsEvictionStatus) - if err != nil { - return fmt.Errorf("failed to marshal userPodsEvictionStatus: %w", err) - } - - var evictionStatusMap map[string]interface{} - if err := json.Unmarshal(jsonBytes, &evictionStatusMap); err != nil { - return fmt.Errorf("failed to unmarshal eviction status JSON: %w", err) - } + evictionStatusMap := map[string]any{ + "status": userPodsEvictionStatus.Status, + "message": userPodsEvictionStatus.Message, + }protojson Marshal output stability omitempty behaviorAlso applies to: 588-615
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/reconciler/reconciler.go` at line 28, The code is using protojson.Marshal to generate update payloads for persistence which is unstable and omits zero-values (risking stale fields like message); replace the protojson.Marshal-based update with an explicit map construction that sets each persisted key deterministically (including empty/zero values to clear fields) before calling the MongoDB update path used in reconciler.go (search for protojson.Marshal usage and the MongoDB update call in the reconciler's update/save functions); ensure fields like "message" are always present in the map (set to "" or the zero value when absent) and remove the protojson import once replaced.preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py (1)
27-31:⚠️ Potential issue | 🔴 CriticalRuntime import failure risk from
crd_pb2.
This CRD Python module isn’t typically available; without vendoring/declaring it, imports will fail. Regenerate without the CRD import or vendor/declare the dependency.#!/bin/bash # Verify whether crd_pb2 exists and is declared as a dependency. fd -a "crd_pb2.py" fd -a "pyproject.toml" -x rg -n "protoc_gen_crd|crd_pb2|yandex" {} || true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py` around lines 27 - 31, The generated health_event_pb2.py includes an unconditional import of the CRD module (alias github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2) which may not be present at runtime; either re-generate the protobuf sources without that CRD dependency or remove/guard that import and any references to it in the DESCRIPTOR so the module does not require crd_pb2 at import time (locate the import statement for crd_pb2 and the DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(...) usage in health_event_pb2.py and eliminate or conditionalize the CRD-related pieces, or vendor/declare the crd_pb2 dependency if regeneration isn't possible).health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)
27-31:⚠️ Potential issue | 🔴 CriticalRuntime import failure:
crd_pb2is not available in the Python env.
This generated import will raiseModuleNotFoundErrorunless the CRD Python module is vendored and declared. Regenerate without the CRD import or vendor/declare the dependency in Poetry.#!/bin/bash # Verify whether crd_pb2 exists and is declared as a dependency. fd -a "crd_pb2.py" fd -a "pyproject.toml" -x rg -n "protoc_gen_crd|crd_pb2|yandex" {} || true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py` around lines 27 - 31, The generated health_event_pb2.py imports a non-existent module github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2 which causes ModuleNotFoundError; either regenerate the protobufs without referencing crd.proto so the import is removed (re-run proto generation for HealthEvent without the CRD include), or vendor the CRD Python module and declare it in pyproject.toml so that github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2 becomes importable; look for the import line and the DESCRIPTOR AddSerializedFile block in health_event_pb2.py to validate the change.data-models/protobufs/health_event.proto (1)
30-50:⚠️ Potential issue | 🟡 MinorAdd per-field comments for new CRD-facing fields.
The new message fields lack field-level documentation (OperationStatus, HealthEventStatus, HealthEventWithStatus, HealthEventResource). These are surfaced in generated CRD docs, so each field should describe meaning, expected values, and semantics.
As per coding guidelines, "Include comprehensive comments for all fields in Protocol Buffer messages."📝 Example pattern
message OperationStatus { + // status is the operation state (e.g., NotStarted, InProgress, Failed, Succeeded, AlreadyDrained). string status = 1; + // message provides a human-readable explanation for the current status. string message = 2; }Also applies to: 120-129
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@data-models/protobufs/health_event.proto` around lines 30 - 50, The new protobuf message fields (OperationStatus, HealthEventStatus, HealthEventWithStatus and HealthEventResource) lack per-field documentation used by generated CRD docs; update each field declaration to include a brief comment describing the field's purpose, expected values/range/semantics, and any important invariants or lifecycle notes (e.g., whether a timestamp is set by server, what values "status" can take, meaning of boolean wrappers like faultRemediated, and ownership/validation rules for nested messages). Locate the messages OperationStatus, HealthEventStatus, HealthEventWithStatus and the HealthEventResource message and add concise comments for every field (e.g., status, message, nodeQuarantined, quarantineFinishTimestamp, userPodsEvictionStatus, drainFinishTimestamp, faultRemediated, lastRemediationTimestamp, createdAt, healthEvent, healthEventStatus) following the existing comment style so CRD docs clearly describe meaning and expected semantics.node-drainer/pkg/evaluator/evaluator.go (1)
62-62:⚠️ Potential issue | 🔴 CriticalGuard
healthEvent.HealthEventStatusbefore dereference.Line 62 dereferences
healthEvent.HealthEventStatusdirectly; if the status is missing this will panic before your laterUserPodsEvictionStatusguard. Add a nil check and short-circuit with a safe action.🛠️ Proposed fix
- statusStr := healthEvent.HealthEventStatus.NodeQuarantined + if healthEvent.HealthEventStatus == nil { + slog.Error("HealthEventStatus is nil, cannot evaluate event", + "node", nodeName) + return &DrainActionResult{ + Action: ActionWait, + WaitDelay: time.Minute, + }, nil + } + statusStr := healthEvent.HealthEventStatus.NodeQuarantined🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/evaluator/evaluator.go` at line 62, The code dereferences healthEvent.HealthEventStatus when assigning statusStr (line with statusStr := healthEvent.HealthEventStatus.NodeQuarantined); add a nil guard on healthEvent.HealthEventStatus before accessing NodeQuarantined and short-circuit to a safe default action (e.g., treat as non-quarantined or return early) so the later UserPodsEvictionStatus logic doesn't panic; update the block around the statusStr assignment in evaluator.go to check healthEvent.HealthEventStatus != nil, handle the nil case deterministically, then continue to use UserPodsEvictionStatus as planned.preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py (1)
27-31:⚠️ Potential issue | 🔴 CriticalEnsure
crd_pb2is generated and packaged for Python.Line 28 imports
github.com.yandex.protoc_gen_crd...crd_pb2. If that module isn’t generated into the Python package (including intermediate__init__.pyfiles), importinghealth_event_pb2will raiseImportErrorat runtime. Please generate the CRD proto Python module (or vendor it) and ensure it’s on the module path.🔎 Verification script
#!/bin/bash # Check whether crd_pb2.py exists in the repo and where it lives fd -a 'crd_pb2.py' --type f # Check for expected package path in nccl_allreduce or data-models vendor outputs fd -a '__init__.py' --type f | rg -n 'github\.com/yandex/protoc_gen_crd'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py` around lines 27 - 31, The generated health_event_pb2 import references github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2 (crd_pb2) which must exist on the Python path; generate (or vendor) the Python CRD proto module and ensure package __init__.py files create the github.com.yandex.protoc_gen_crd... package namespace so the import in health_event_pb2 (the google_dot_protobuf_wrappers import and the github...crd_pb2 import) resolves at runtime; regenerate the CRD proto with protoc (or copy the generated crd_pb2.py and accompanying __init__.py files) into the repository and add the package directory to packaging/setup so it’s installed with the module.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 145-167: The terminal-status check currently dereferences
HealthEventStatus.UserPodsEvictionStatus and can panic if either is nil; before
calling isTerminalStatus, use the generated getters (e.g.
healthEventWithStatus.GetHealthEventStatus(), GetUserPodsEvictionStatus(),
GetStatus()) and guard for nil so you only call
isTerminalStatus(model.Status(...)) when UserPodsEvictionStatus and its Status
are non-nil; similarly replace direct field access in the subsequent slog.Debug
and the handleEventCancellation call with the safe getters or nil-checked values
to avoid panics.
In `@store-client/pkg/testutils/event_builder.go`:
- Around line 15-16: Add a package-level godoc comment immediately above the
"package testutils" declaration: start the comment with "// Package testutils
..." and provide a brief description of the package’s purpose and what exported
helpers (e.g., event builders, test utilities) it provides, following Go godoc
conventions (short summary sentence, optional additional details). Ensure the
comment is a plain // comment block describing the package rather than a block
comment, and keep it concise and relevant to identifiers in this package.
- Line 43: Replace the hardcoded status literal used when constructing the event
(the field named quarantineStatus in the event builder) with the predefined
constant from the model package: use model.UnQuarantined instead of
model.Status("NotQuarantined"), ensuring the builder sets the established Status
constant rather than an unmatched string.
---
Duplicate comments:
In `@api/Makefile`:
- Around line 102-116: The Makefile is missing a clean phony target required by
checkmake; add a `.PHONY: clean` entry and a minimal `clean` target that
delegates to the existing `protos-clean` target (so add `clean: protos-clean`)
near the other phony targets (next to `lint`, `test`, `lint-test`) to satisfy
the check without changing behavior.
In `@data-models/protobufs/health_event.proto`:
- Around line 30-50: The new protobuf message fields (OperationStatus,
HealthEventStatus, HealthEventWithStatus and HealthEventResource) lack per-field
documentation used by generated CRD docs; update each field declaration to
include a brief comment describing the field's purpose, expected
values/range/semantics, and any important invariants or lifecycle notes (e.g.,
whether a timestamp is set by server, what values "status" can take, meaning of
boolean wrappers like faultRemediated, and ownership/validation rules for nested
messages). Locate the messages OperationStatus, HealthEventStatus,
HealthEventWithStatus and the HealthEventResource message and add concise
comments for every field (e.g., status, message, nodeQuarantined,
quarantineFinishTimestamp, userPodsEvictionStatus, drainFinishTimestamp,
faultRemediated, lastRemediationTimestamp, createdAt, healthEvent,
healthEventStatus) following the existing comment style so CRD docs clearly
describe meaning and expected semantics.
In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py`:
- Around line 27-31: The generated health_event_pb2.py imports a non-existent
module
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2
which causes ModuleNotFoundError; either regenerate the protobufs without
referencing crd.proto so the import is removed (re-run proto generation for
HealthEvent without the CRD include), or vendor the CRD Python module and
declare it in pyproject.toml so that
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2
becomes importable; look for the import line and the DESCRIPTOR
AddSerializedFile block in health_event_pb2.py to validate the change.
In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi`:
- Around line 5-7: The stub import for crd_pb2 in health_event_pb2.pyi is
unresolved; either remove/regenerate the stubs so they don't import
github.com.yandex...crd_pb2 or vendor/declare that module as a Python
dependency. Locate the import line referencing "crd_pb2 as _crd_pb2" in
health_event_pb2.pyi and fix by regenerating the .pyi without the CRD import (so
protoc doesn't emit the github import) or add a vendored stub/package for the
module (crd_pb2) to the repo and ensure it's listed in packaging/pyproject so
type-checkers can resolve it.
In `@node-drainer/pkg/evaluator/evaluator.go`:
- Line 62: The code dereferences healthEvent.HealthEventStatus when assigning
statusStr (line with statusStr :=
healthEvent.HealthEventStatus.NodeQuarantined); add a nil guard on
healthEvent.HealthEventStatus before accessing NodeQuarantined and short-circuit
to a safe default action (e.g., treat as non-quarantined or return early) so the
later UserPodsEvictionStatus logic doesn't panic; update the block around the
statusStr assignment in evaluator.go to check healthEvent.HealthEventStatus !=
nil, handle the nil case deterministically, then continue to use
UserPodsEvictionStatus as planned.
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Line 28: The code is using protojson.Marshal to generate update payloads for
persistence which is unstable and omits zero-values (risking stale fields like
message); replace the protojson.Marshal-based update with an explicit map
construction that sets each persisted key deterministically (including
empty/zero values to clear fields) before calling the MongoDB update path used
in reconciler.go (search for protojson.Marshal usage and the MongoDB update call
in the reconciler's update/save functions); ensure fields like "message" are
always present in the map (set to "" or the zero value when absent) and remove
the protojson import once replaced.
In `@preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py`:
- Around line 27-31: The generated health_event_pb2.py includes an unconditional
import of the CRD module (alias
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2)
which may not be present at runtime; either re-generate the protobuf sources
without that CRD dependency or remove/guard that import and any references to it
in the DESCRIPTOR so the module does not require crd_pb2 at import time (locate
the import statement for crd_pb2 and the DESCRIPTOR =
_descriptor_pool.Default().AddSerializedFile(...) usage in health_event_pb2.py
and eliminate or conditionalize the CRD-related pieces, or vendor/declare the
crd_pb2 dependency if regeneration isn't possible).
In `@preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py`:
- Around line 27-31: The generated health_event_pb2 import references
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2
(crd_pb2) which must exist on the Python path; generate (or vendor) the Python
CRD proto module and ensure package __init__.py files create the
github.com.yandex.protoc_gen_crd... package namespace so the import in
health_event_pb2 (the google_dot_protobuf_wrappers import and the
github...crd_pb2 import) resolves at runtime; regenerate the CRD proto with
protoc (or copy the generated crd_pb2.py and accompanying __init__.py files)
into the repository and add the package directory to packaging/setup so it’s
installed with the module.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (31)
Makefileapi/Makefilecommons/pkg/eventutil/parser.gocommons/pkg/eventutil/parser_test.godata-models/Makefiledata-models/pkg/model/health_event_extentions.godata-models/protobufs/health_event.protodistros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yamldistros/kubernetes/nvsentinel/charts/k8s-datastore/values.yamldistros/kubernetes/nvsentinel/values.yamlfault-quarantine/pkg/eventwatcher/event_watcher.gofault-quarantine/pkg/reconciler/reconciler_e2e_test.gofault-remediation/go.modfault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/remediation/remediation.gohealth-events-analyzer/pkg/parser/parser_test.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyinode-drainer/pkg/evaluator/evaluator.gonode-drainer/pkg/reconciler/reconciler.gonode-drainer/pkg/reconciler/reconciler_integration_test.gopreflight-checks/dcgm-diag/Makefilepreflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pypreflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyipreflight-checks/nccl-allreduce/Makefilepreflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pypreflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyistore-client/pkg/datastore/providers/postgresql/health_events_serialization_test.gostore-client/pkg/testutils/event_builder.go
🚧 Files skipped from review as they are similar to previous changes (14)
- fault-remediation/pkg/reconciler/reconciler_e2e_test.go
- fault-quarantine/pkg/reconciler/reconciler_e2e_test.go
- fault-remediation/pkg/reconciler/reconciler.go
- preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyi
- fault-remediation/pkg/reconciler/reconciler_test.go
- node-drainer/pkg/reconciler/reconciler_integration_test.go
- distros/kubernetes/nvsentinel/values.yaml
- preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi
- commons/pkg/eventutil/parser.go
- distros/kubernetes/nvsentinel/charts/k8s-datastore/values.yaml
- preflight-checks/dcgm-diag/Makefile
- distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml
- fault-remediation/go.mod
- commons/pkg/eventutil/parser_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
node-drainer/pkg/reconciler/reconciler.go (2)
150-156:⚠️ Potential issue | 🔴 CriticalNil dereference panic if
UserPodsEvictionStatusis absent from the document
UserPodsEvictionStatusis a*protos.OperationStatusfield. Documents that have never been processed bysetInitialStatusAndEnqueuewill have this field absent, causingunmarshalGenericEventto leave the pointer nil. Accessing.Statusat line 150 without a preceding nil check will panic.Every other call site in this file guards the pointer first (lines 516, 534, 797). This newly added path does not.
🐛 Proposed fix
- if isTerminalStatus(model.Status(healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status)) { + evictionStatus := healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus + if evictionStatus != nil && isTerminalStatus(model.Status(evictionStatus.Status)) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/reconciler/reconciler.go` around lines 150 - 156, The code dereferences healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status without checking for nil; update the early-return block (the one calling isTerminalStatus(...)) to first guard that healthEventWithStatus.HealthEventStatus != nil and healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus != nil before accessing .Status (mirror the nil checks used at other call sites in this file), and only call isTerminalStatus when the pointer is non-nil; if the pointer is nil, skip the terminal-state check and continue (i.e., do not dereference).
953-960:⚠️ Potential issue | 🟠 Major
setDrainFailedStatusbypasses the explicit-map pattern, risking BSON field name drift
updateNodeUserPodsEvictedStatuswas specifically refactored to use an explicitmap[string]interface{}{"status": ..., "message": ...}to guarantee predictable, lowercase field names in MongoDB.setDrainFailedStatusembeds aprotos.OperationStatus{}struct value directly into the$setpayload.The BSON encoder uses lowercased field names when no
bsontag is present, so it matches today — but any future exported field added toprotos.OperationStatuswill be serialized silently, and the inconsistency makes the intent opaque.♻️ Proposed fix — use the same explicit map approach
update := map[string]any{ "$set": map[string]any{ - "healtheventstatus.userpodsevictionstatus": protos.OperationStatus{ - Status: string(model.StatusFailed), - Message: reason, - }, + "healtheventstatus.userpodsevictionstatus": map[string]interface{}{ + "status": string(model.StatusFailed), + "message": reason, + }, }, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/reconciler/reconciler.go` around lines 953 - 960, The update payload in setDrainFailedStatus currently embeds a protos.OperationStatus struct directly, risking implicit BSON field names; change it to mirror updateNodeUserPodsEvictedStatus by constructing an explicit map[string]any for the "$set" value (e.g., "$set": map[string]any{"healtheventstatus.userpodsevictionstatus": map[string]any{"status": string(model.StatusFailed), "message": reason}}) so only the intended "status" and "message" lowercase fields are written; locate setDrainFailedStatus and replace the protos.OperationStatus literal with the explicit map form referencing protos.OperationStatus fields for values.
♻️ Duplicate comments (1)
node-drainer/pkg/reconciler/reconciler.go (1)
631-649: Nil guard foruserPodsEvictionStatusstill absentThe nil check requested in the previous review was reportedly addressed, but the current function body at line 634 still accesses
userPodsEvictionStatus.Statusunconditionally. While all current callers pass a non-nil pointer (enforced by guards in the calling functions), the defensive guard is absent and the function signature accepts a pointer.🛡️ Proposed guard
func (r *Reconciler) observeEvictionDurationIfSucceeded(event datastore.Event, userPodsEvictionStatus *protos.OperationStatus) { + if userPodsEvictionStatus == nil { + return + } if userPodsEvictionStatus.Status != string(model.StatusSucceeded) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/reconciler/reconciler.go` around lines 631 - 649, Add a defensive nil guard at the start of observeEvictionDurationIfSucceeded to return immediately if the pointer userPodsEvictionStatus is nil before accessing userPodsEvictionStatus.Status; locate the function observeEvictionDurationIfSucceeded in reconciler.go and insert a check for userPodsEvictionStatus == nil (return) so subsequent code can safely read userPodsEvictionStatus.Status and avoid panics.
🧹 Nitpick comments (2)
api/Makefile (1)
53-53:make allnow implies lint + tests — consider whether that's intentional for local use.Running
lint-testas part ofallis great for CI but means a plainmakeinvocation also blocks on linting and testing, which can slow local iteration. If the intent is CI-first, this is fine as-is. A common pattern is to keepall: protos-generate buildfor local builds and add a separateci: all lint-testtarget for pipeline use.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/Makefile` at line 53, The current Makefile target "all" includes "lint-test" which forces linting and tests on every plain `make`; change the Makefile so "all" depends only on "protos-generate" and "build" (remove "lint-test" from the "all" prerequisite list) and add a new CI-specific target (e.g., `ci`) that depends on "all" and "lint-test" so CI runs linting/tests but local `make` stays fast; update targets "all", "lint-test", "protos-generate", and "build" accordingly.node-drainer/pkg/reconciler/reconciler.go (1)
163-169: Prefer a localmodel.Statusvariable over the unsafe pointer cast
(*model.Status)(&healthEventWithStatus.HealthEventStatus.NodeQuarantined)works only becausemodel.Statusistype Status string, but the cast will silently break if that type alias changes. The same pattern appears at line 294.♻️ Suggested refactor (applies to both line 165 and line 294)
+ nodeQuarantinedStatus := model.Status(healthEventWithStatus.HealthEventStatus.NodeQuarantined) if shouldSkip := r.handleEventCancellation( - documentID, nodeName, (*model.Status)(&healthEventWithStatus.HealthEventStatus.NodeQuarantined)); shouldSkip { + documentID, nodeName, &nodeQuarantinedStatus); shouldSkip {And at line 294:
+ nodeQuarantinedModelStatus := model.Status(nodeQuarantinedStatus) - if r.isEventCancelled(eventID, nodeName, (*model.Status)(&nodeQuarantinedStatus)) { + if r.isEventCancelled(eventID, nodeName, &nodeQuarantinedModelStatus) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/reconciler/reconciler.go` around lines 163 - 169, Replace the unsafe pointer cast to model.Status by creating a local variable of type model.Status, assign it the string value from healthEventWithStatus.HealthEventStatus.NodeQuarantined (and the analogous value used at the other site), and pass a pointer to that local variable into r.handleEventCancellation; update both call sites (the call using healthEventWithStatus.HealthEventStatus.NodeQuarantined and the similar call around line 294) so they reference the new local model.Status variable instead of using (*model.Status)(...).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/Makefile`:
- Around line 104-107: The lint target currently invokes golangci-lint from
PATH; add a version-pinned variable (e.g., GOLANGCI_LINT_VERSION set to v2.5.0)
and a download/install target (e.g., download-golangci-lint or
install-golangci-lint) that uses go-install-tool to install
golangci-lint@$(GOLANGCI_LINT_VERSION) into $(LOCALBIN), then update the lint
target to call $(LOCALBIN)/golangci-lint (or the variable holding that path)
instead of the bare golangci-lint; ensure references use the existing patterns
for other tools like protoc-gen-go/protoc-gen-go-grpc and that the version
matches .versions.yaml.
---
Outside diff comments:
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 150-156: The code dereferences
healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status without
checking for nil; update the early-return block (the one calling
isTerminalStatus(...)) to first guard that
healthEventWithStatus.HealthEventStatus != nil and
healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus != nil before
accessing .Status (mirror the nil checks used at other call sites in this file),
and only call isTerminalStatus when the pointer is non-nil; if the pointer is
nil, skip the terminal-state check and continue (i.e., do not dereference).
- Around line 953-960: The update payload in setDrainFailedStatus currently
embeds a protos.OperationStatus struct directly, risking implicit BSON field
names; change it to mirror updateNodeUserPodsEvictedStatus by constructing an
explicit map[string]any for the "$set" value (e.g., "$set":
map[string]any{"healtheventstatus.userpodsevictionstatus":
map[string]any{"status": string(model.StatusFailed), "message": reason}}) so
only the intended "status" and "message" lowercase fields are written; locate
setDrainFailedStatus and replace the protos.OperationStatus literal with the
explicit map form referencing protos.OperationStatus fields for values.
---
Duplicate comments:
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 631-649: Add a defensive nil guard at the start of
observeEvictionDurationIfSucceeded to return immediately if the pointer
userPodsEvictionStatus is nil before accessing userPodsEvictionStatus.Status;
locate the function observeEvictionDurationIfSucceeded in reconciler.go and
insert a check for userPodsEvictionStatus == nil (return) so subsequent code can
safely read userPodsEvictionStatus.Status and avoid panics.
---
Nitpick comments:
In `@api/Makefile`:
- Line 53: The current Makefile target "all" includes "lint-test" which forces
linting and tests on every plain `make`; change the Makefile so "all" depends
only on "protos-generate" and "build" (remove "lint-test" from the "all"
prerequisite list) and add a new CI-specific target (e.g., `ci`) that depends on
"all" and "lint-test" so CI runs linting/tests but local `make` stays fast;
update targets "all", "lint-test", "protos-generate", and "build" accordingly.
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 163-169: Replace the unsafe pointer cast to model.Status by
creating a local variable of type model.Status, assign it the string value from
healthEventWithStatus.HealthEventStatus.NodeQuarantined (and the analogous value
used at the other site), and pass a pointer to that local variable into
r.handleEventCancellation; update both call sites (the call using
healthEventWithStatus.HealthEventStatus.NodeQuarantined and the similar call
around line 294) so they reference the new local model.Status variable instead
of using (*model.Status)(...).
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
api/Makefilenode-drainer/pkg/evaluator/evaluator.gonode-drainer/pkg/reconciler/reconciler.go
🚧 Files skipped from review as they are similar to previous changes (1)
- node-drainer/pkg/evaluator/evaluator.go
| .PHONY: lint | ||
| lint: ## Run golangci-lint on all Go files. | ||
| @echo "Running golangci-lint..." | ||
| @golangci-lint run ./... |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check .versions.yaml for golangci-lint entry
echo "=== Checking .versions.yaml for golangci-lint ==="
if [ -f ".versions.yaml" ]; then
cat .versions.yaml
else
echo ".versions.yaml not found in repo root"
fiRepository: NVIDIA/NVSentinel
Length of output: 2502
🏁 Script executed:
#!/bin/bash
# Check api/Makefile for tool management patterns
echo "=== Checking api/Makefile ==="
if [ -f "api/Makefile" ]; then
cat -n api/Makefile | head -150
else
echo "api/Makefile not found"
fiRepository: NVIDIA/NVSentinel
Length of output: 5507
🏁 Script executed:
#!/bin/bash
# Search for golangci-lint references in Makefiles
echo "=== Searching for golangci-lint in Makefiles ==="
rg "golangci" --type makefile -C 3Repository: NVIDIA/NVSentinel
Length of output: 159
Add version-pinned golangci-lint to api/Makefile following the same pattern as other Go tools.
golangci-lint is defined in .versions.yaml at v2.5.0 but is invoked directly from PATH in the lint target without version pinning. All other Go tools (protoc-gen-go, protoc-gen-go-grpc) are pinned and installed via go-install-tool into $(LOCALBIN), ensuring reproducible builds across machines and CI runners. The lint target should follow the same pattern.
Add the variable and download target:
♻️ Proposed changes
# Tool Binaries
PROTOC_GEN_GO ?= $(LOCALBIN)/protoc-gen-go
PROTOC_GEN_GO_GRPC ?= $(LOCALBIN)/protoc-gen-go-grpc
+GOLANGCI_LINT ?= $(LOCALBIN)/golangci-lint
# Tool Versions (load from ../.versions.yaml)
# Requires yq to be installed: brew install yq (macOS) or see https://github.com/mikefarah/yq
YQ := $(shell command -v yq 2> /dev/null)
ifndef YQ
$(error yq is required but not found. Install it with: brew install yq (macOS) or see https://github.com/mikefarah/yq)
endif
VERSIONS_FILE := ../.versions.yaml
PROTOC_GEN_GO_VERSION := $(shell $(YQ) '.protobuf.protoc_gen_go' $(VERSIONS_FILE))
PROTOC_GEN_GO_GRPC_VERSION := $(shell $(YQ) '.protobuf.protoc_gen_go_grpc' $(VERSIONS_FILE))
+GOLANGCI_LINT_VERSION := $(shell $(YQ) '.go_tools.golangci_lint' $(VERSIONS_FILE)) .PHONY: lint
-lint: ## Run golangci-lint on all Go files.
+lint: $(GOLANGCI_LINT) ## Run golangci-lint on all Go files.
`@echo` "Running golangci-lint..."
- `@golangci-lint` run ./...
+ @$(GOLANGCI_LINT) run ./... $(PROTOC_GEN_GO_GRPC):
$(call go-install-tool,$(PROTOC_GEN_GO_GRPC),google.golang.org/grpc/cmd/protoc-gen-go-grpc,$(PROTOC_GEN_GO_GRPC_VERSION))
+$(GOLANGCI_LINT):
+ $(call go-install-tool,$(GOLANGCI_LINT),github.com/golangci/golangci-lint/cmd/golangci-lint,$(GOLANGCI_LINT_VERSION))
+
# go-install-tool macro🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/Makefile` around lines 104 - 107, The lint target currently invokes
golangci-lint from PATH; add a version-pinned variable (e.g.,
GOLANGCI_LINT_VERSION set to v2.5.0) and a download/install target (e.g.,
download-golangci-lint or install-golangci-lint) that uses go-install-tool to
install golangci-lint@$(GOLANGCI_LINT_VERSION) into $(LOCALBIN), then update the
lint target to call $(LOCALBIN)/golangci-lint (or the variable holding that
path) instead of the bare golangci-lint; ensure references use the existing
patterns for other tools like protoc-gen-go/protoc-gen-go-grpc and that the
version matches .versions.yaml.
| // from the change stream before passing to the reconciler | ||
| // NOTE: Always use a pointer for proto structs (e.g., *protos.HealthEventStatus)! | ||
| // Proto-generated structs may embed sync.Mutex or other non-copyable fields. | ||
| // Using a value type can cause unsafe copies and subtle bugs (see Go linter: copylocksdefault). |
There was a problem hiding this comment.
Had to change protos.HealthEventStatus to be a ptr, earlier it was value
eaba40f to
310f037
Compare
|
/ok to test 310f037 |
310f037 to
fa86eed
Compare
506b9cb to
dc66871
Compare
|
@ivelichkovich @lalitadithya Sorry, rebase was messy, had to squash all the commits for a clean rebase. |
|
sorry @yavinash007 , we merged in a bunch of dependabot updates earlier today. Can you rebase again? The next depedenbot updates aren't for a week, so we can get this merged in before that. sorry again |
| expected := bson.M{ | ||
| "healtheventstatus.faultremediated": true, | ||
| "healtheventstatus.lastremediationtimestamp": now, | ||
| "healtheventstatus.faultremediated": map[string]interface{}{ |
There was a problem hiding this comment.
shouldn't this be a string with the value true?
Signed-off-by: Avinash Yeddula <avinash.yeddula@omniva.com>
Signed-off-by: Avinash Yeddula <avinash.yeddula@omniva.com>
dc66871 to
d0a73ef
Compare
|
Yes, I'm running on MAC/arm, likely those build flags has something to do with it. I reverted it to base line. |
|
@lalitadithya Could take a look a the latest commit pls, made couple of changes while debugging e2e tests. All the e2e tests are passing now. Can we run tests on this MR pls ? If you dont have any major concerns |
|
/ok to test d0a73ef |
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
|
/ok to test 352df82 |
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
|
/ok to test fae77ef |
Summary
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit