Skip to content

k8s datastore, CRD generation from proto bufs#882

Merged
lalitadithya merged 4 commits intoNVIDIA:mainfrom
yavinash007:k8s-data-store
Mar 4, 2026
Merged

k8s datastore, CRD generation from proto bufs#882
lalitadithya merged 4 commits intoNVIDIA:mainfrom
yavinash007:k8s-data-store

Conversation

@yavinash007
Copy link
Contributor

@yavinash007 yavinash007 commented Feb 18, 2026

Summary

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • New Features
    • Added a HealthEventResource CRD and Helm chart to expose health events in Kubernetes (spec/status, new enums).
  • Chores
    • Added CRD/protobuf generation and vendor tooling targets; updated build and lint workflows; updated .gitignore and Go/protobuf/grpc toolchain versions.
  • Tests
    • Updated tests and test data to use protobuf-backed HealthEvent and status representations.
  • Bug Fixes
    • Improved nil-safety, serialization/marshaling, and status/timestamp handling (string-based statuses).

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 18, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds CRD support and generation: new protoc-gen-crd tooling pin, extended health_event.proto (status/operation/CRD types), generated CRD YAML and Helm chart, Makefile targets for CRD/plugin/vendor protos, widespread go.mod bumps, and migration of internal status types to protobuf-backed representations across code and tests.

Changes

Cohort / File(s) Summary
Repo config & top-level build
\.gitignore, .versions.yaml, Makefile
Unignored vendor/; pinned protoc_gen_crd: 'v1.1.0'; added vendor-proto and lint-test-api Make targets; updated protos-generate deps and license linter ignores.
Data-models proto & tooling
data-models/protobufs/health_event.proto, data-models/tools.go, data-models/Makefile
Extended proto with OperationStatus, HealthEventStatus, HealthEventWithStatus, HealthEventResource (CRD) and imports; added tools file for CRD proto; Makefile: CRD generation targets, plugin install, vendor-proto, crd-clean, compile-crd-server, and added -I vendor.
Go model types
data-models/pkg/model/health_event_crd.go, data-models/pkg/model/health_event_extentions.go
Added CRD resource types; removed prior value types; switched HealthEventWithStatus.HealthEventStatus to *protos.HealthEventStatus.
Makefiles across modules
api/Makefile, health-monitors/*/Makefile, preflight-checks/*/Makefile, data-models/Makefile, ...
Introduced/propagated vendor-proto targets, wired protoc includes to data-models/vendor, integrated CRD generation into protos flow, and added lint/test orchestration in api.
Helm chart & CRD manifests
distros/kubernetes/nvsentinel/Chart.yaml, distros/kubernetes/nvsentinel/values.yaml, .../charts/k8s-datastore/*, .../crds/health_event.crd.yaml
Added k8sdatastore-crds chart and values flag; committed generated HealthEventResource CRD YAML into Helm chart.
Dependency updates (many go.mod)
commons/go.mod, data-models/go.mod, */go.mod (multiple modules)
Added indirect github.com/yandex/protoc-gen-crd v1.1.0; bumped numerous golang.org/x/, google.golang.org/, and protobuf/grpc-related versions across modules.
Status migration (code & tests)
commons/pkg/*, store-client/*, node-drainer/*, fault-*/*, health-events-analyzer/*, store-client/pkg/testutils/*, tests/...`
Replaced internal/status value types with protobuf types/pointers (*protos.HealthEventStatus, *protos.OperationStatus); NodeQuarantined changed to string; timestamps/wrappers updated; tests and serialization adapted.
Reconciler / evaluator changes
node-drainer/pkg/reconciler/*, node-drainer/pkg/evaluator/*, fault-remediation/*, fault-quarantine/*
Updated DB payloads and update functions to proto-backed JSON/maps, changed comparisons to string-based checks, added nil-guards for wrappers/timestamps, and updated some function signatures to accept proto types.
Generated protobuf outputs (Python/pyi)
*/protos/health_event_pb2.py, */protos/health_event_pb2.pyi
Regenerated Python protobuf modules and stubs to include new OperationStatus, HealthEventStatus, HealthEventWithStatus, HealthEventResource, and CRD imports/descriptor updates.
Tests & serialization
multiple *_test.go, store-client/pkg/datastore/providers/postgresql/*, store-client/pkg/testutils/*
Updated tests to construct/use protobuf-backed statuses (wrapperspb/timestamppb), changed assertions to proto.Equal or value-based checks, and adapted expected JSON/BSON forms.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Kubernetes
    participant Operator
    participant Datastore

    Client->>Kubernetes: create/update HealthEventResource (CRD)
    Kubernetes->>Operator: watch/notify resource change
    Operator->>Datastore: read/write HealthEvent (protos.HealthEvent / protos.HealthEventStatus)
    Datastore->>Datastore: protojson marshal -> JSON map for persistence
    Operator->>Operator: evaluate NodeQuarantined / userPodsEvictionStatus
    Operator->>Datastore: persist updated status (proto-backed JSON)
    Kubernetes-->>Client: status reflected on resource
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐇 I nibbled at protos through the night,

CRDs unfurled in YAML light,
Pointers turned to strings and time,
Charts and vendors joined the rhyme,
Hop — the cluster hums in proto-dream.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'k8s datastore, CRD generation from proto bufs' clearly and specifically summarizes the main changes: introducing Kubernetes datastore functionality and CRD generation from protobuf definitions, which aligns with the comprehensive changeset involving CRD YAML generation, proto definitions, Helm chart additions, and related build infrastructure updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@@ -15,11 +15,46 @@
syntax = "proto3";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason we moved this to a proto is so that it can be used with the proto crd generator right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

We are using https://github.com/yandex/protoc-gen-crd to generate the CRD from the proto file.

Moved HealthEventWithStatus into proto from health_event_extentions.go.

// Status types are now generated from proto definitions for single source of truth.
// See data-models/protobufs/health_event.proto for HealthEventStatus and related types.
type HealthEventWithStatus struct {
CreatedAt time.Time bson:"createdAt"
HealthEvent *protos.HealthEvent bson:"healthevent,omitempty"
HealthEventStatus *protos.HealthEventStatus bson:"healtheventstatus"
}

expectJSONInclude: []string{"userpodsevictionstatus"},
expectJSONOmit: []string{"nodeQuarantined", "faultRemediated", "lastRemediationTimestamp"},
expectJSONInclude: []string{"userPodsEvictionStatus"},
expectBSONInclude: []string{"userpodsevictionstatus"},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mango DB tags are by default, all lowercase.

Need some input if this has any impact, after moving the healthEventStatus to proto's, they dont have bson tags.

@yavinash007 yavinash007 marked this pull request as ready for review February 20, 2026 03:13
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (11)
labeler/go.mod (2)

12-16: ⚠️ Potential issue | 🟠 Major

k8s.io/kubernetes v1.32.10 is three minor versions behind k8s.io/api v0.35.1 — replace directives are missing.

k8s.io/api uses a 0.X.Y versioning scheme that maps directly to k8s.io/kubernetes v1.X.Y. Here k8s.io/api v0.35.1 (= Kubernetes 1.35) while k8s.io/kubernetes v1.32.10 (= Kubernetes 1.32) creates a 3-minor-version skew:

  1. k8s.io/kubernetes is a "super-module" whose own go.mod contains replace directives that pin all staging packages (k8s.io/api, k8s.io/apimachinery, k8s.io/client-go, …) to v0.32.x. Those replace directives do not propagate to downstream consumers — they must be replicated in this go.mod.
  2. Without them, Go's MVS selects k8s.io/api v0.35.1 (the higher explicit requirement), but k8s.io/kubernetes v1.32.10 was built against k8s.io/api v0.32.10, causing API-surface mismatches at link or runtime.
  3. Importing k8s.io/kubernetes as a direct dependency is widely discouraged; the recommended approach is to depend only on k8s.io/api, k8s.io/client-go, and k8s.io/apimachinery.

Either align k8s.io/kubernetes to v1.35.x (matching k8s.io/api v0.35.1) with the corresponding replace directives, or remove the direct k8s.io/kubernetes dependency entirely and use only the staging modules.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@labeler/go.mod` around lines 12 - 16, The go.mod currently lists
k8s.io/kubernetes v1.32.10 which skews three minor versions from k8s.io/api
v0.35.1; fix by removing the direct k8s.io/kubernetes dependency (preferred):
delete the k8s.io/kubernetes line, ensure no code imports k8s.io/kubernetes, run
go mod tidy so only k8s.io/api, k8s.io/apimachinery, and k8s.io/client-go
remain; alternatively, if you must keep k8s.io/kubernetes, update it to the
matching v1.35.x (e.g., v1.35.1) and add the replace directives from the
k8s.io/kubernetes go.mod that pin staging modules (k8s.io/api,
k8s.io/apimachinery, k8s.io/client-go, etc.) to the corresponding v0.35.x
versions so MVS stays consistent.

71-71: ⚠️ Potential issue | 🟡 Minor

k8s.io/apiextensions-apiserver v0.34.3 is one minor version behind the other k8s.io/* dependencies at v0.35.1.

All k8s.io/ staging modules should track the same minor version. Since this PR introduces CRD support, running v0.34.3 alongside k8s.io/api@v0.35.1, k8s.io/apimachinery@v0.35.1, and k8s.io/client-go@v0.35.1 risks subtle type incompatibilities. Bump to v0.35.1 to align with the rest of the Kubernetes dependencies.

🔧 Proposed fix
-	k8s.io/apiextensions-apiserver v0.34.3 // indirect
+	k8s.io/apiextensions-apiserver v0.35.1 // indirect

Then run go mod tidy inside labeler/ to update go.sum.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@labeler/go.mod` at line 71, Update the k8s.io staging module version in
labeler/go.mod: change the k8s.io/apiextensions-apiserver dependency from
v0.34.3 to v0.35.1 so it matches the other k8s.io/* deps (k8s.io/api,
k8s.io/apimachinery, k8s.io/client-go), then run go mod tidy inside the labeler/
module to refresh go.sum; look for the k8s.io/apiextensions-apiserver entry in
labeler/go.mod to make the edit.
tilt/simple-health-client/go.mod (1)

22-24: ⚠️ Potential issue | 🟡 Minor

Orphaned replace directive for store-client — no corresponding require entry.

github.com/nvidia/nvsentinel/store-client appears in neither the direct nor indirect require block of this module. A replace directive for a module that is not required is silently ignored by Go toolchain and has no effect. This suggests either:

  1. A // indirect require entry for store-client is missing (i.e., go mod tidy was not run), or
  2. The directive was copy-pasted from another service's go.mod and is not needed here.

Please run go mod tidy in tilt/simple-health-client/ to reconcile, and remove the directive if store-client is truly not a transitive dependency.

🛠️ Proposed fix if `store-client` is not actually needed
 replace github.com/nvidia/nvsentinel/data-models => ../../data-models
-
-replace github.com/nvidia/nvsentinel/store-client => ../../store-client
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tilt/simple-health-client/go.mod` around lines 22 - 24, The replace directive
for module github.com/nvidia/nvsentinel/store-client in the go.mod is orphaned
because there is no corresponding require; run `go mod tidy` in
tilt/simple-health-client to let Go reconcile missing requires and then remove
the `replace github.com/nvidia/nvsentinel/store-client => ../../store-client`
entry if tidy shows it is not a transitive dependency. If tidy indicates the
module is required indirectly, add the resulting `require
github.com/nvidia/nvsentinel/store-client vX.Y.Z // indirect` (as produced by
tidy) instead of leaving a stray replace; otherwise delete the replace line
referencing store-client.
node-drainer/pkg/reconciler/reconciler.go (2)

909-917: 🛠️ Refactor suggestion | 🟠 Major

Inconsistent serialization: proto struct used directly instead of protojson round-trip.

updateNodeUserPodsEvictedStatus (line 584) uses protojson.Marshaljson.Unmarshal to convert the proto message to a map[string]interface{} before storing in MongoDB. Here, protos.OperationStatus is embedded directly as a struct value. This produces different serialization behavior — the BSON encoder will serialize exported Go fields and their struct tags, which may differ from protojson output (e.g., omitempty handling, field naming for nested types). Additionally, proto messages should be passed by pointer, not by value.

Consider reusing updateNodeUserPodsEvictedStatus or applying the same protojson round-trip here for consistency.

♻️ Proposed fix — use protojson round-trip or call the shared helper
-	update := map[string]any{
-		"$set": map[string]any{
-			"healtheventstatus.userpodsevictionstatus": protos.OperationStatus{
-				Status:  &protos.Status{Value: string(model.StatusFailed)},
-				Message: reason,
-			},
-		},
-	}
+	failedStatus := &protos.OperationStatus{
+		Status:  &protos.Status{Value: string(model.StatusFailed)},
+		Message: reason,
+	}
+
+	jsonBytes, err := protojson.Marshal(failedStatus)
+	if err != nil {
+		return fmt.Errorf("failed to marshal drain failed status: %w", err)
+	}
+
+	var statusMap map[string]interface{}
+	if err := json.Unmarshal(jsonBytes, &statusMap); err != nil {
+		return fmt.Errorf("failed to unmarshal drain failed status: %w", err)
+	}
+
+	update := map[string]any{
+		"$set": map[string]any{
+			"healtheventstatus.userpodsevictionstatus": statusMap,
+		},
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/reconciler/reconciler.go` around lines 909 - 917, The update
embeds a protos.OperationStatus value directly into the BSON update, causing
inconsistent serialization and using a non-pointer proto; change this to perform
the same protojson round-trip used by updateNodeUserPodsEvictedStatus
(protojson.Marshal the *protos.OperationStatus, then json.Unmarshal into
map[string]interface{}) or invoke the existing helper that returns the marshaled
map, and ensure you pass a pointer to protos.OperationStatus (e.g.,
&protos.OperationStatus{...}) so the stored document uses the same JSON field
names and omitempty behavior as elsewhere.

148-157: ⚠️ Potential issue | 🔴 Critical

Guard the nil chain when accessing UserPodsEvictionStatus.Status.Value.

Line 151 dereferences through two nested pointer fields without nil checks. If an event unmarshals without the userpodsevictionstatus field set, this will panic. The parser initializes NodeQuarantined with a default value when nil, but does not do the same for UserPodsEvictionStatus, creating an asymmetric nil-guard situation.

Add defensive checks before accessing nested Status fields:

-	if isTerminalStatus(model.Status(healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status.Value)) {
+	evictionStatus := healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus
+	if evictionStatus != nil && evictionStatus.Status != nil &&
+		isTerminalStatus(model.Status(evictionStatus.Status.Value)) {

Apply the same guard to other accesses of UserPodsEvictionStatus at lines 293–295 (handleEventCancellation), and 377, 510, 520, 761 (status mutation methods).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/reconciler/reconciler.go` around lines 148 - 157, The code
currently dereferences
healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status.Value
without nil checks which can panic if UserPodsEvictionStatus or its Status is
nil; update the guard in the reconciler pre-processing (the block that calls
isTerminalStatus with
model.Status(healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status.Value))
to first check that healthEventWithStatus.HealthEventStatus != nil, that
UserPodsEvictionStatus != nil, and that UserPodsEvictionStatus.Status != nil
before reading Value, and treat missing status as non-terminal; apply the same
defensive nil-check pattern to all other locations mentioned
(handleEventCancellation and the status mutation methods referenced) to avoid
panics when UserPodsEvictionStatus is absent.
fault-remediation/go.mod (1)

3-5: ⚠️ Potential issue | 🟡 Minor

Update the toolchain directive to the latest Go 1.25 patch for security fixes.

go1.25.7 (released 2026-02-04) includes security fixes to the go command and crypto/tls, while go1.25.6 (released 2026-01-15) included fixes to archive/zip, crypto/tls, and net/url. The declared toolchain go1.25.3 predates both and leaves those vulnerabilities in the toolchain used for builds.

🛡️ Proposed fix
-toolchain go1.25.3
+toolchain go1.25.7
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-remediation/go.mod` around lines 3 - 5, Update the toolchain directive
to the latest Go 1.25 patch: replace the current "toolchain go1.25.3" entry in
the go.mod (alongside the existing "go 1.25.0" directive) with "toolchain
go1.25.7" so builds use the Go 1.25.7 toolchain that includes the security
fixes; ensure no other toolchain lines remain conflicting and run go env or a
quick build to verify the new toolchain is picked up.
Makefile (1)

279-306: ⚠️ Potential issue | 🟠 Major

protos-generate without a clean step can leave stale generated files. If any .proto is removed/renamed, old *.pb.go / *_pb2.py can stick around and create confusing diffs or even compile-time collisions. Consider either restoring protos-clean as a prerequisite or making clean opt-in via a flag.

One option: opt-in clean
 .PHONY: protos-generate
 protos-generate: ## Generate protobuf files from .proto sources
 	`@echo` "Generating protobuf files in data-models (Go) and gpu-health-monitor (Python)..."
+	`@if` [ "$(PROTO_CLEAN)" = "1" ]; then \
+		$(MAKE) protos-clean; \
+	fi
 	`@echo` "=== Tool Versions ==="
 	`@echo` "Go: $$(go version)"
 	`@echo` "protoc: $$(protoc --version)"
 	`@echo` "protoc-gen-go: $$(protoc-gen-go --version)"
 	`@echo` "protoc-gen-go-grpc: $$(protoc-gen-go-grpc --version)"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 279 - 306, The protos-generate Makefile target can
leave stale generated files; update the protos-generate target to either depend
on an existing protos-clean target or honor an opt-in CLEAN flag: add
protos-clean as a prerequisite to protos-generate (so protos-clean runs
automatically) or implement a conditional that runs the protos-clean recipe when
CLEAN=true (check the protos-generate recipe and reference the protos-clean
target and the CLEAN variable), ensuring generated outputs are removed before
regenerating to avoid leftover *.pb.go and *_pb2.py artifacts.
platform-connectors/go.mod (1)

24-36: ⚠️ Potential issue | 🟠 Major

Remove protoc-gen-crd or pin it in tools.go if CRD generation is performed. This module has no direct usage of or reference to protoc-gen-crd, no tools.go file, and only carries it as a transitive indirect dependency. If platform-connectors does not run CRD generation, remove this dependency to keep the module manifest clean and explicit. If generation does occur, create a tools.go file with appropriate // +build tools directive to document the tool dependency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform-connectors/go.mod` around lines 24 - 36, The go.mod currently
carries github.com/yandex/protoc-gen-crd v1.1.0 as an indirect dependency but
the module has no CRD generation code or a tools declaration; either remove the
protoc-gen-crd line from go.mod to drop the unused transitive dependency, or if
you do generate CRDs, add a tools.go file (with the // +build tools build tag)
that imports github.com/yandex/protoc-gen-crd so the tool is explicitly pinned
and referenced; update go.mod via go mod tidy after making the change to ensure
the module graph is consistent.
node-drainer/pkg/evaluator/evaluator.go (1)

62-74: ⚠️ Potential issue | 🔴 Critical

Use safe getters or nil guards to prevent panic on unset proto fields. Direct dereference of healthEvent.HealthEventStatus.UserPodsEvictionStatus.Status.Value will panic if either intermediate pointer is nil. Proto3 fields are optional and can be unset in production; tests demonstrate this can occur (e.g., empty HealthEventStatus{}). Safe proto getters (GetUserPodsEvictionStatus(), GetStatus()) are available in the generated code.

Proposed fix using safe getters
 	statusPtr := healthEvent.HealthEventStatus.NodeQuarantined
 
 	if statusPtr != nil && statusPtr.Value == string(model.UnQuarantined) {
 		return &DrainActionResult{Action: ActionSkip}, nil
 	}
 
-	if isTerminalStatus(model.Status(healthEvent.HealthEventStatus.UserPodsEvictionStatus.Status.Value)) {
+	evictionStatusValue := ""
+	if u := healthEvent.HealthEventStatus.GetUserPodsEvictionStatus(); u != nil {
+		evictionStatusValue = u.GetStatus().GetValue()
+	}
+	if isTerminalStatus(model.Status(evictionStatusValue)) {
 		slog.Info("Event for node is in terminal state, skipping", "node", nodeName)
 
 		return &DrainActionResult{
 			Action: ActionSkip,
 		}, nil
 	}

Also applies to: 88-108

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/evaluator/evaluator.go` around lines 62 - 74, The code
dereferences proto fields directly (e.g.,
healthEvent.HealthEventStatus.UserPodsEvictionStatus.Status.Value and
HealthEventStatus.NodeQuarantined) which can be nil and cause panics; replace
those direct accesses with the generated safe getters (e.g.,
GetHealthEventStatus(), GetNodeQuarantined(), GetUserPodsEvictionStatus(),
GetStatus(), GetValue()) and add nil-check guards before calling
isTerminalStatus or returning a DrainActionResult; update both the initial
NodeQuarantined check and the later isTerminalStatus invocation (and the similar
occurrences around lines 88–108) to use these getters so the function (including
isTerminalStatus/DrainActionResult/ActionSkip paths) will not panic on unset
proto fields.
metadata-collector/go.mod (1)

12-56: ⚠️ Potential issue | 🟠 Major

Address version misalignment across go.mod files.

The repository has significant dependency version inconsistencies that need resolution:

  • grpc: Most modules use v1.79.1, but tilt/csp-api-mock/go.mod pins v1.77.0
  • genproto/googleapis/rpc: Six different versions detected across modules (v0.0.0-20260209200024-4cfbd4190f57, v0.0.0-20251202230838-ff82c1b0f217, v0.0.0-20260128011058-8636f8732409, v0.0.0-20250603155806-513f23925822, v0.0.0-20251124214823-79d6a2a48846, v0.0.0-20250303144028-a0af3efb3deb)
  • golang.org/x packages: Version drift in api/, plugins/, tests/scale-tests/, and tilt/csp-api-mock/ modules; health-monitors/csp-health-monitor/ uses oauth2 v0.35.0 vs v0.34.0 elsewhere

Standardize versions across all services to follow the semantic versioning principle outlined in the guidelines.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@metadata-collector/go.mod` around lines 12 - 56, Multiple go.mod files have
inconsistent dependency versions (notably google.golang.org/grpc,
google.golang.org/genproto/googleapis/rpc, and golang.org/x/* packages);
standardize by choosing canonical versions (e.g., grpc v1.79.1 and the agreed
genproto googleapis/rpc commit like v0.0.0-20260209200024-4cfbd4190f57 or
whatever the repo policy picks) and update all go.mod files that deviate
(example mismatches: tilt/csp-api-mock pins grpc v1.77.0,
health-monitors/csp-health-monitor pins oauth2 v0.35.0). For each module with a
mismatch, run go get <module>@<canonical-version> (e.g., go get
google.golang.org/grpc@v1.79.1, go get
google.golang.org/genproto/googleapis/rpc@<chosen-version>, go get
golang.org/x/oauth2@<chosen-version>), then go mod tidy and run tests/CI to
verify; repeat until all go.mod files list the canonical versions.
fault-quarantine/pkg/eventwatcher/event_watcher.go (1)

240-246: ⚠️ Potential issue | 🔴 Critical

Critical: MongoDB filters compare plain strings against nested protos.Status subdocuments — queries will never match.

After migrating NodeQuarantined from *model.Status (a string alias) to *protos.Status (a struct with a Value field), the BSON-stored format changed from a plain string to a nested object:

healtheventstatus.nodequarantined: "Quarantined"  → healtheventstatus.nodequarantined: {"value": "Quarantined"}

However, the filters on lines 243–246 and 300–305 still compare against raw strings using the $in operator, which will never match the nested structure.

Reading code (line 292) correctly accesses the nested field: NodeQuarantined.Value, and the update (line 310) correctly serializes the struct, but the filters are misaligned.

Proposed fix — use the nested `.value` field in filters
 	filter := map[string]interface{}{
 		"healthevent.nodename": nodeName,
-		"healtheventstatus.nodequarantined": map[string]interface{}{
+		"healtheventstatus.nodequarantined.value": map[string]interface{}{
 			"$in": []interface{}{model.Quarantined, model.UnQuarantined},
 		},
 	}

And similarly for the updateFilter:

-		"healtheventstatus.nodequarantined": map[string]interface{}{
+		"healtheventstatus.nodequarantined.value": map[string]interface{}{
 			"$in": []interface{}{model.Quarantined, model.AlreadyQuarantined},
 		},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-quarantine/pkg/eventwatcher/event_watcher.go` around lines 240 - 246,
The MongoDB filters compare raw strings against the protos.Status subdocument,
so they never match; update the filters to target the nested field
(healtheventstatus.nodequarantined.value) instead of the whole subdocument.
Specifically, change the initial filter (the map assigned to filter referencing
healtheventstatus.nodequarantined) and the updateFilter used later to use
"healtheventstatus.nodequarantined.value" with $in (or direct equality)
comparing to the string constants model.Quarantined and model.UnQuarantined;
locate the variables named filter and updateFilter and align them with how
NodeQuarantined.Value is read elsewhere (NodeQuarantined.Value) so queries match
the stored {"value":"..."} shape.
🧹 Nitpick comments (14)
janitor/go.mod (1)

107-108: Align googleapis/api and googleapis/rpc to the same version snapshot.

Both sibling submodules from the go-genproto monorepo should ideally track the same upstream snapshot. Currently, googleapis/rpc is at v0.0.0-20260209200024-4cfbd4190f57 (Feb 9, 2026) while googleapis/api remains at v0.0.0-20251202230838-ff82c1b0f217 (Nov 25, 2025). Although dependencies resolve without conflict, bumping googleapis/api to the matching Feb 2026 timestamp improves consistency and reduces the risk of type identity mismatches since these modules can exchange protobuf types.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@janitor/go.mod` around lines 107 - 108, The go.mod currently pins
google.golang.org/genproto/googleapis/api at v0.0.0-20251202230838-ff82c1b0f217
while googleapis/rpc is at v0.0.0-20260209200024-4cfbd4190f57; update the
googleapis/api dependency to the matching snapshot version
v0.0.0-20260209200024-4cfbd4190f57 so both genproto submodules use the same
upstream snapshot (change the module entry for
google.golang.org/genproto/googleapis/api to the Feb 9, 2026 pseudo-version).
janitor-provider/go.mod (1)

107-109: The genproto/googleapis/rpc sub-module version trails the base genproto by ~8 months — consider running go mod tidy to verify the dependency graph resolves cleanly.

The pseudo-version timestamps diverge noticeably across these modules:

  • google.golang.org/genproto: 2025-06-03
  • google.golang.org/genproto/googleapis/api: 2025-12-02
  • google.golang.org/genproto/googleapis/rpc: 2026-02-09

Since the base genproto and its sub-modules are independently versioned, this alone is not incorrect. However, the significant date gap could surface subtle incompatibilities if any transitively imported code relies on types defined in both modules.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@janitor-provider/go.mod` around lines 107 - 109, The go.mod shows mismatched
pseudo-versions for google.golang.org/genproto and its sub-modules
(googleapis/api and googleapis/rpc); run `go mod tidy` to refresh the dependency
graph and ensure the transitive requirements resolve, then, if tidy doesn't
align them, add an explicit `require` for
google.golang.org/genproto@<latest-consistent-pseudo-version>` (and/or pin
google.golang.org/genproto/googleapis/rpc to the same timestamped
pseudo-version) so the base genproto and its googleapis/* sub-modules are
consistent, regenerate go.sum, and verify `go build`/`go test` pass.
node-drainer/pkg/reconciler/reconciler_integration_test.go (1)

1393-1396: Proto message used by value — consider using a pointer for consistency.

protos.HealthEventStatus{...} is a proto message stored by value in the map. Proto messages contain internal state fields (protoimpl.MessageState) that are not designed to be copied by value. In this test context it works because the struct is only used for JSON round-tripping, but using &protos.HealthEventStatus{...} would be more idiomatic and consistent with proto conventions.

♻️ Suggested change
-		"healtheventstatus": protos.HealthEventStatus{
+		"healtheventstatus": &protos.HealthEventStatus{
 			NodeQuarantined:        &protos.Status{Value: string(opts.nodeQuarantined)},
 			UserPodsEvictionStatus: &protos.OperationStatus{Status: &protos.Status{Value: string(model.StatusInProgress)}},
 		},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/reconciler/reconciler_integration_test.go` around lines 1393
- 1396, The test creates a proto value with protos.HealthEventStatus stored by
value in the map which can incorrectly copy internal proto state; change the map
entry to store a pointer by replacing protos.HealthEventStatus{...} with
&protos.HealthEventStatus{...} and keep inner fields (NodeQuarantined,
UserPodsEvictionStatus and their Status/OperationStatus values) the same so the
map holds *protos.HealthEventStatus instead of a copied struct.
node-drainer/pkg/reconciler/reconciler.go (1)

293-295: Unsafe cast takes address of proto Value field.

Line 295 does (*model.Status)(&nodeQuarantinedStatus.Value) — this creates a pointer directly into the proto message's string field. While type-safe (both are string), if NodeQuarantined is nil (line 293 assigns the pointer), the .Value access on line 295 panics. The upstream parser reportedly validates for nil NodeQuarantined, but this is fragile and differs from line 166 which has the same pattern.

Consider converting via a local variable instead of taking the address of the proto field:

♻️ Proposed fix — use a local copy
 	nodeQuarantinedStatus := healthEventWithStatus.HealthEventStatus.NodeQuarantined
-
-	if r.isEventCancelled(eventID, nodeName, (*model.Status)(&nodeQuarantinedStatus.Value)) {
+	var nodeQuarantinedPtr *model.Status
+	if nodeQuarantinedStatus != nil {
+		s := model.Status(nodeQuarantinedStatus.Value)
+		nodeQuarantinedPtr = &s
+	}
+
+	if r.isEventCancelled(eventID, nodeName, nodeQuarantinedPtr) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/reconciler/reconciler.go` around lines 293 - 295, The code
dereferences a proto field address unsafely by passing
(*model.Status)(&nodeQuarantinedStatus.Value) into r.isEventCancelled, which
panics if NodeQuarantined is nil and creates a pointer into the proto's internal
memory; fix by nil-checking nodeQuarantinedStatus and creating a local
model.Status variable from nodeQuarantinedStatus.Value (e.g., v :=
model.Status(nodeQuarantinedStatus.Value)) then pass &v to r.isEventCancelled
(use the symbols nodeQuarantinedStatus, isEventCancelled, eventID, nodeName to
locate the call).
fault-remediation/pkg/reconciler/reconciler.go (1)

141-143: TODO comment should reference a GitHub issue, not a developer name.

// TODO - Avinash add a helper function to check for supported actions names a developer rather than tracking via an issue. As per coding guidelines, TODO comments should reference issues in Go code.

-// TODO - Avinash add a helper function to check for supported actions
+// TODO(#<issue-number>): extract a helper to check for supported actions
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-remediation/pkg/reconciler/reconciler.go` around lines 141 - 143,
Replace the developer-named TODO with an issue-reference TODO in the health
event handling block: change the comment "// TODO - Avinash add a helper
function to check for supported actions" to reference the GitHub issue that
tracks this work (e.g. "TODO(#<issue-number>): add a helper function to check
for supported actions") near the healthEventWithStatus check in reconciler.go so
the note follows the project's TODO guideline and is discoverable; no code
behavior changes required.
fault-remediation/pkg/reconciler/reconciler_test.go (1)

538-611: TestShouldSkipEvent is missing a table-driven case for the new FaultRemediated.GetValue() == true skip path.

The new guard added at reconciler.go:142–144 creates a new skip path: when HealthEventStatus.FaultRemediated is a non-nil BoolValue with value true. None of the four existing test cases or the table-driven TestShouldSkipEvent test cover this scenario.

// Add to the TestShouldSkipEvent table:
{
    name:              "Skip when FaultRemediated is true",
    nodeName:          "test-node-remediated",
    recommendedAction: protos.RecommendedAction_RESTART_BM,
    groupConfig:       getGroupConfig("restart", nil),
    shouldSkip:        true,
    description:       "Already-remediated events should be skipped",
},
// and construct healthEventWithStatus with:
// HealthEventStatus: &protos.HealthEventStatus{
//     NodeQuarantined: &protos.Status{Value: string(model.Quarantined)},
//     FaultRemediated: wrapperspb.Bool(true),
// }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-remediation/pkg/reconciler/reconciler_test.go` around lines 538 - 611,
Add a table-driven case to TestShouldSkipEvent that exercises the new skip path
in reconciler.shouldSkipEvent when HealthEventStatus.FaultRemediated is a
non-nil true BoolValue: add a test entry named like "Skip when FaultRemediated
is true" with RecommendedAction RESTART_BM (or other non-NONE), groupConfig from
getGroupConfig("restart", nil), shouldSkip=true, and build the
healthEventWithStatus so HealthEventStatus includes FaultRemediated:
wrapperspb.Bool(true) (and a NodeQuarantined status if required by the guard)
before calling r.shouldSkipEvent; this ensures the new FaultRemediated==true
branch is covered.
store-client/pkg/datastore/providers/postgresql/health_events_serialization_test.go (1)

68-68: Simplify — take address of composite literal directly.

The IIFE on line 68 (and line 80) can be replaced with a direct composite literal address:

♻️ Suggested simplification
-			NodeQuarantined: func() *protos.Status { s := protos.Status{Value: string(model.Quarantined)}; return &s }(),
+			NodeQuarantined: &protos.Status{Value: string(model.Quarantined)},

Same for line 80:

-			NodeQuarantined: func() *protos.Status { s := protos.Status{Value: string(model.UnQuarantined)}; return &s }(),
+			NodeQuarantined: &protos.Status{Value: string(model.UnQuarantined)},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@store-client/pkg/datastore/providers/postgresql/health_events_serialization_test.go`
at line 68, Replace the IIFE used to set NodeQuarantined (and the similar
pattern at the other occurrence) with a direct composite literal address; e.g.
change the current NodeQuarantined: func() *protos.Status { s :=
protos.Status{Value: string(model.Quarantined)}; return &s }() to
NodeQuarantined: &protos.Status{Value: string(model.Quarantined)} (and do the
analogous replacement for the other status field).
fault-quarantine/pkg/eventwatcher/event_watcher.go (1)

308-311: The $set update correctly uses &protos.Status{Value: "Cancelled"}, but prefer using the model.Cancelled constant for consistency.

Line 310 uses the hardcoded string "Cancelled" instead of string(model.Cancelled), which is used elsewhere in the codebase for this value.

♻️ Suggested fix
-			"healtheventstatus.nodequarantined": &protos.Status{Value: "Cancelled"},
+			"healtheventstatus.nodequarantined": &protos.Status{Value: string(model.Cancelled)},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-quarantine/pkg/eventwatcher/event_watcher.go` around lines 308 - 311,
The update map sets healtheventstatus.nodequarantined using
&protos.Status{Value: "Cancelled"}; change that to &protos.Status{Value:
string(model.Cancelled)} for consistency with the rest of the codebase, update
the map construction in event_watcher.go (the variable named update) and ensure
the model package is imported where the change is made (or use the correct
package alias if already imported).
health-events-analyzer/pkg/parser/parser_test.go (2)

419-419: Remove commented-out code.

Line 419 contains a commented-out field assignment that appears to be leftover from development. As per coding guidelines, commented-out code should be cleaned up.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@health-events-analyzer/pkg/parser/parser_test.go` at line 419, Remove the
leftover commented-out field assignment in the test: delete the commented line
referencing UserPodsEvictionStatus (the "//\tUserPodsEvictionStatus:
&protos.OperationStatus{Status: &protos.Status{Value:
string(model.StatusSucceeded)}},") from parser_test.go so the test file contains
no commented-out dead code; ensure surrounding test data structures remain
syntactically valid after removal (e.g., commas for preceding/next struct
fields).

486-489: Use proto.Equal for comparing protobuf messages in this test instead of reflect.DeepEqual.

The test compares *protos.Status values at lines 125 and 431 using reflect.DeepEqual. Protobuf messages contain internal fields (sizeCache, unknownFields) that can cause false negatives. The reference test in health_events_serialization_test.go correctly uses proto.Equal for this purpose.

Import google.golang.org/protobuf/proto and replace the comparison with a type-aware check, or use a helper that applies proto.Equal specifically for proto message types.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@health-events-analyzer/pkg/parser/parser_test.go` around lines 486 - 489, The
test's equality check should use proto.Equal for protobuf messages instead of
reflect.DeepEqual: import "google.golang.org/protobuf/proto" and in the test
where getValueFromPath() returns got and you compare to tt.want (which can be a
*protos.Status), replace the reflect.DeepEqual branch with a type-aware check
that if both got and tt.want are proto.Message (or specifically *protos.Status)
use proto.Equal(got.(proto.Message), tt.want.(proto.Message)); otherwise fall
back to the existing DeepEqual logic and keep the same t.Errorf message when
they differ.
data-models/Makefile (3)

63-63: Duplicate PROTO_OPTIONS declaration.

PROTO_OPTIONS is assigned the same value on both Line 63 and Line 79. The second assignment overwrites the first with an identical value, making one of them redundant.

Remove the duplicate declaration
-PROTO_OPTIONS = vendor/github.com/yandex/protoc-gen-crd/library/go/k8s/protoc_gen_crd/proto/crd.proto
-
-
 $(PROTO_OPTIONS):

(Remove lines 79-80, keeping only the first declaration at Line 63.)

Also applies to: 79-79

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@data-models/Makefile` at line 63, Remove the redundant duplicate
PROTO_OPTIONS declaration in the Makefile: keep the first assignment of
PROTO_OPTIONS (the one at/near the top) and delete the second identical
assignment (the later occurrence) so PROTO_OPTIONS is defined only once; locate
the symbol PROTO_OPTIONS in the Makefile and remove the later duplicate line.

93-97: PHONY prerequisite forces CRD regeneration on every build.

install-plugin-remote is declared .PHONY, so Make always considers it out of date. Because it's a prerequisite of the file target $(CRD_FILE), the CRD YAML will be regenerated (and the plugin re-downloaded via go install) on every invocation, even when the output is already up-to-date. This negates the benefit of using a file target.

Consider either:

  1. Making install-plugin-remote an order-only prerequisite (| install-plugin-remote), or
  2. Using a stamp file for the plugin installation, or
  3. Checking if the binary already exists before installing.
Option 1: Use an order-only prerequisite
-$(CRD_FILE): vendor-proto protobufs/health_event.proto install-plugin-remote
+$(CRD_FILE): vendor-proto protobufs/health_event.proto | install-plugin-remote

This ensures the plugin is installed before the CRD is generated, but won't force regeneration if the CRD file is already newer than its real prerequisites.

Option 3: Guard the install with a binary check
 install-plugin-remote:
+	`@if` command -v protoc-gen-crd >/dev/null 2>&1; then \
+		echo "protoc-gen-crd already installed, skipping."; \
+	else \
 	`@echo` "Installing protoc-gen-crd from github..."
 	GO111MODULE=on go install github.com/yandex/protoc-gen-crd/cmd/protoc-gen-crd@$(PROTOC_GEN_CRD_VERSION)
+	fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@data-models/Makefile` around lines 93 - 97, The Makefile rule for generating
$(CRD_FILE) lists the phony target install-plugin-remote as a normal
prerequisite which forces re-generation on every run; change the dependency so
the plugin installation does not mark the rule out-of-date by either making
install-plugin-remote an order-only prerequisite (use | install-plugin-remote)
in the $(CRD_FILE) rule, or replace the phony with a stamp/installed-file
checked by the rule (e.g., create and depend on a plugin-install-stamp file
produced after verifying the binary exists), or guard the install-plugin-remote
recipe to skip go install when the plugin binary is already present; update the
rule that references $(CRD_FILE) and the install-plugin-remote target
accordingly.

125-125: protos-clean should not depend on vendor-proto.

A clean target triggering go mod vendor is semantically backwards — cleaning should only remove artifacts. Additionally, this dependency is already transitively satisfied through the protos-generate → compile-crd-server → $(CRD_FILE) → vendor-proto chain, so it's also redundant here.

Remove `vendor-proto` from `protos-clean` dependencies
-protos-clean: vendor-proto crd-clean
+protos-clean: crd-clean
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@data-models/Makefile` at line 125, The Makefile target "protos-clean"
incorrectly depends on "vendor-proto" which runs go mod vendor; update the
dependency list for the "protos-clean" target to remove "vendor-proto" so it
only depends on cleanup targets (e.g., "crd-clean"); keep "vendor-proto" in the
generation chain (it remains reachable via "protos-generate → compile-crd-server
→ $(CRD_FILE) → vendor-proto") and ensure the rule name "protos-clean" no longer
lists "vendor-proto" as a prerequisite.
data-models/protobufs/health_event.proto (1)

51-56: Clarify the relationship between HealthEventWithStatus and HealthEventResource.

HealthEventWithStatus bundles createdAt + HealthEvent + HealthEventStatus, while HealthEventResource (the CRD type) holds only HealthEvent as spec and HealthEventStatus as status — without createdAt. A brief comment on HealthEventWithStatus explaining its intended use case (e.g., database/non-CRD storage) and how it relates to the CRD type would help future maintainers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@data-models/protobufs/health_event.proto` around lines 51 - 56, Add a clear
comment above the HealthEventWithStatus message that states this type is a
convenience/DB-side wrapper (includes createdAt + HealthEvent +
HealthEventStatus) intended for non-CRD storage and processing, and explicitly
note that the CRD representation is HealthEventResource which stores HealthEvent
under spec and HealthEventStatus under status and therefore does NOT include
createdAt; mention any mapping expectations between HealthEventWithStatus and
HealthEventResource (e.g., drop createdAt when converting to the CRD spec/status
pair) so maintainers understand their relationship and intended uses.

Comment on lines +26 to +32
// Status represents operation and quarantine states for health events.

// Status represents operation and quarantine states for health events.
// Use string values to match Go constants exactly.
message Status {
string value = 1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Duplicate comment and missing field comment on Status.

Lines 26 and 28 contain the same comment: "Status represents operation and quarantine states for health events." Remove the duplicate.

Additionally, Status.value (Line 31) lacks a field-level comment explaining what values are expected, their semantics, and constraints. The comment block on lines 34-35 listing valid constants should be placed directly above the value field (or the field should reference the constants inline). As per coding guidelines: "Include comprehensive comments for all fields in Protocol Buffer messages."

Proposed fix
-// Status represents operation and quarantine states for health events.
-
 // Status represents operation and quarantine states for health events.
 // Use string values to match Go constants exactly.
 message Status {
+  // value holds the status string. Valid values: NotStarted, InProgress,
+  // Failed, Succeeded, AlreadyDrained, UnQuarantined, Quarantined,
+  // AlreadyQuarantined, Cancelled.
   string value = 1;
 }
-
-// Common status string constants (for documentation/reference):
-// NotStarted, InProgress, Failed, Succeeded, AlreadyDrained, UnQuarantined, Quarantined, AlreadyQuarantined, Cancelled
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@data-models/protobufs/health_event.proto` around lines 26 - 32, Remove the
duplicate top-line comment for the Status message and add a comprehensive
field-level comment for Status.value describing expected string values,
semantics, and constraints; move or repeat the list of valid constants (the
values referenced in the existing comment block) directly above or inline with
the value field so it's clear which constants Status.value may hold; update the
Status message comment and the value field comment to reference the exact Go
constants by name and clarify whether the field is required/optional and any
canonical forms.

Comment on lines +38 to +56
message OperationStatus {
Status status = 1;
string message = 2;
}

// HealthEventStatus represents the status fields for a health event.
message HealthEventStatus {
Status nodeQuarantined = 1;
OperationStatus userPodsEvictionStatus = 2;
google.protobuf.BoolValue faultRemediated = 3;
google.protobuf.Timestamp lastRemediationTimestamp = 4;
}

// HealthEventWithStatus combines HealthEvent and HealthEventStatus for CRD generation.
message HealthEventWithStatus {
google.protobuf.Timestamp createdAt = 1;
HealthEvent healthEvent = 2;
HealthEventStatus healthEventStatus = 3;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

New messages are missing per-field comments.

The following new messages have fields without any documentation:

  • OperationStatus: status, message
  • HealthEventStatus: nodeQuarantined, userPodsEvictionStatus, faultRemediated, lastRemediationTimestamp
  • HealthEventWithStatus: createdAt, healthEvent, healthEventStatus
  • HealthEventResource: spec, status

These fields are part of a CRD surface that will be visible to Kubernetes users (e.g., via kubectl describe). Document each field with its purpose, expected values, and semantics. As per coding guidelines: "Include comprehensive comments for all fields in Protocol Buffer messages."

Example for HealthEventStatus
 // HealthEventStatus represents the status fields for a health event.
 message HealthEventStatus {
+  // nodeQuarantined indicates the quarantine state of the node (e.g., Quarantined, UnQuarantined).
   Status nodeQuarantined = 1;
+  // userPodsEvictionStatus tracks the eviction progress for user pods on the affected node.
   OperationStatus userPodsEvictionStatus = 2;
+  // faultRemediated indicates whether the underlying fault has been remediated.
+  // Uses BoolValue wrapper to distinguish "not set" from "false".
   google.protobuf.BoolValue faultRemediated = 3;
+  // lastRemediationTimestamp records when the last remediation attempt occurred.
   google.protobuf.Timestamp lastRemediationTimestamp = 4;
 }

Also applies to: 125-135

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@data-models/protobufs/health_event.proto` around lines 38 - 56, Add
comprehensive per-field comments for the newly added protobuf messages so CRD
users can see clear documentation in kubectl; specifically update message
OperationStatus (document fields status — meaning and allowed values, and
message — human-readable details), HealthEventStatus (document nodeQuarantined —
what true/false means, userPodsEvictionStatus — semantics of OperationStatus
here, faultRemediated — when true and what it implies, lastRemediationTimestamp
— timestamp semantics), HealthEventWithStatus (document createdAt, healthEvent
and healthEventStatus fields), and the HealthEventResource message fields spec
and status (describe the desired state vs observed state and any invariants).
Put each comment immediately above the corresponding field using the same style
as other proto comments so they appear in generated CRD docs.

Comment on lines +30 to +31
spec:
nullable: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's find and examine the specific CRD file
find . -name "health_event.crd.yaml" -type f 2>/dev/null | head -5

Repository: NVIDIA/NVSentinel

Length of output: 140


🏁 Script executed:

# Look for the proto definition mentioned
find . -name "health_event.proto" -type f 2>/dev/null | head -5

Repository: NVIDIA/NVSentinel

Length of output: 103


🏁 Script executed:

# Check for CRD generator configuration
find . -name "*crd*" -o -name "*protobuf*" -o -name "*generate*" | grep -E "\.(yaml|yml|json|go)$" | head -20

Repository: NVIDIA/NVSentinel

Length of output: 512


🏁 Script executed:

# Check file sizes to know how much to read
wc -l ./distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml ./data-models/protobufs/health_event.proto ./data-models/pkg/model/health_event_crd.go

Repository: NVIDIA/NVSentinel

Length of output: 257


🏁 Script executed:

# Read the full CRD file to see the context
cat -n ./distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml

Repository: NVIDIA/NVSentinel

Length of output: 5692


🏁 Script executed:

# Read the proto definition
cat -n ./data-models/protobufs/health_event.proto

Repository: NVIDIA/NVSentinel

Length of output: 5090


🏁 Script executed:

# Read the CRD generator code
cat -n ./data-models/pkg/model/health_event_crd.go

Repository: NVIDIA/NVSentinel

Length of output: 1772


🏁 Script executed:

# Check other CRDs for comparison
cat -n ./node-drainer/pkg/customdrain/testdata/drainrequest_crd.yaml | head -50

Repository: NVIDIA/NVSentinel

Length of output: 1880


🏁 Script executed:

# Look for any CRD generation scripts or configuration
find . -name "*.go" -o -name "Makefile" -o -name "*.sh" | xargs grep -l "protoc-gen-crd\|health_event" | head -10

Repository: NVIDIA/NVSentinel

Length of output: 520


🏁 Script executed:

# Check the Makefile for CRD generation commands
cat -n ./data-models/Makefile | grep -A 10 -B 5 "crd\|protoc"

Repository: NVIDIA/NVSentinel

Length of output: 4583


🏁 Script executed:

# Check for any proto extensions or options that control CRD generation
rg -A 5 "protoc_gen_crd\|k8s_crd" ./data-models/protobufs/

Repository: NVIDIA/NVSentinel

Length of output: 43


🏁 Script executed:

# Look for other generated CRDs to see if this pattern is consistent
find . -name "*.crd.yaml" -type f -exec grep -l "nullable: true" {} \;

Repository: NVIDIA/NVSentinel

Length of output: 140


spec is marked as nullable: true — likely an artifact of the protoc-gen-crd generator and proto3's implicit field optionality.

Making spec nullable allows creating a HealthEventResource with no spec at all (spec: null). This is generated from proto3 message fields without explicit optional markers, which the generator converts to nullable OpenAPI schemas. Kubernetes conventions typically enforce spec as required.

Add required: ["spec"] (and status, which has the same issue on line 123) at the top-level properties in the OpenAPI schema to enforce that these fields must be present. Alternatively, consider updating the proto definition or generator configuration if regenerating the CRD.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`
around lines 30 - 31, The CRD's top-level schema currently allows spec (and
similarly status) to be null; update the HealthEventResource CRD
(health_event.crd.yaml) so the root OpenAPI schema declares required:
["spec","status"] under the top-level properties and remove or override
nullable: true for the spec and status properties (or at minimum keep nullable
false) to prevent creating objects with spec: null/status: null; target the
top-level schema entries named "spec" and "status" in the YAML and add the
required array there (or adjust the proto/gen config if you prefer
regenerating).

Comment on lines +89 to +91
generatedTimestamp:
type: string
format: RFC3339
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

Kubernetes CRD openAPIV3Schema valid string formats date-time RFC3339

💡 Result:

In a Kubernetes CRD openAPIV3Schema, an RFC3339 timestamp is represented as an OpenAPI/JSON-Schema string with format: date-time:

schema:
  openAPIV3Schema:
    type: object
    properties:
      spec:
        type: object
        properties:
          startsAt:
            type: string
            format: date-time
  • The OpenAPI date-time string format is defined as RFC3339 “date-time” (with timezone). [1][2]
  • Kubernetes CRD validation guidance explicitly uses type: string + format: date-time for date-time fields (and then optionally compares them via CEL rules). [3]
  • Kubernetes’ built-in metav1.Time is published as OpenAPI string with format date-time and marshals in RFC3339. [4]

Example valid values: 2026-02-20T13:45:30Z, 2026-02-20T13:45:30+00:00 (note +0000 is not RFC3339). [1]

Sources:
[1] OpenAPI Format Registry: date-time (RFC3339)
[2] OpenAPI Spec v3.0.4 (date-time per RFC3339)
[3] Kubernetes blog: CRD validation rules (example uses format: date-time)
[4] Kubernetes apimachinery Time OpenAPI schema format and RFC3339 formatting


🏁 Script executed:

find . -path "*/distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml" -type f

Repository: NVIDIA/NVSentinel

Length of output: 140


🏁 Script executed:

cat -n ./distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml | sed -n '80,100p'

Repository: NVIDIA/NVSentinel

Length of output: 890


🏁 Script executed:

cat -n ./distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml | sed -n '140,160p'

Repository: NVIDIA/NVSentinel

Length of output: 652


format: RFC3339 is not a standard OpenAPI format — use date-time for proper Kubernetes validation.

Kubernetes CRD validation uses the OpenAPI 3.0 specification, which defines RFC 3339 timestamps with the format identifier date-time, not RFC3339. Kubernetes will silently ignore the unrecognized RFC3339 value and skip schema validation on these fields.

The same issue applies to lastRemediationTimestamp on line 149.

♻️ Proposed fix
                generatedTimestamp:
                  type: string
-                  format: RFC3339
+                  format: date-time
                lastRemediationTimestamp:
                  type: string
-                  format: RFC3339
+                  format: date-time

Since this YAML is generated by protoc-gen-crd, this should be fixed in the CRD generator configuration.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
generatedTimestamp:
type: string
format: RFC3339
generatedTimestamp:
type: string
format: date-time
Suggested change
generatedTimestamp:
type: string
format: RFC3339
lastRemediationTimestamp:
type: string
format: date-time
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`
around lines 89 - 91, The CRD uses a nonstandard OpenAPI format value "RFC3339"
which Kubernetes will ignore; change the schema entries for generatedTimestamp
and lastRemediationTimestamp to use format: date-time instead of RFC3339, and
update the protoc-gen-crd generator configuration/templates so future generated
YAML emits "date-time" for timestamp fields (search for generatedTimestamp and
lastRemediationTimestamp in health_event.crd.yaml and in the generator templates
to apply the fix).

Comment on lines 118 to 124
nodeQuarantined := healthEventWithStatus.HealthEventStatus.NodeQuarantined

if nodeQuarantined != nil {
if *nodeQuarantined == model.UnQuarantined || *nodeQuarantined == model.Cancelled {
return r.handleCancellationEvent(ctx, nodeName, *nodeQuarantined, r.Watcher, event.ResumeToken)
if nodeQuarantined.Value == string(model.UnQuarantined) || nodeQuarantined.Value == string(model.Cancelled) {
return r.handleCancellationEvent(ctx, nodeName, model.Status(nodeQuarantined.Value), r.Watcher, event.ResumeToken)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for ParseHealthEventFromEvent implementation
rg -n "func ParseHealthEventFromEvent" --type go -A 30

Repository: NVIDIA/NVSentinel

Length of output: 2344


🏁 Script executed:

#!/bin/bash
# Check HealthEventWithStatus and HealthEventStatus definitions
rg -n "type HealthEventWithStatus\|type HealthEventStatus" --type go -A 5 -B 2

Repository: NVIDIA/NVSentinel

Length of output: 43


🏁 Script executed:

#!/bin/bash
# Look at the actual code around line 118-124 in reconciler.go
cat -n fault-remediation/pkg/reconciler/reconciler.go | sed -n '110,150p'

Repository: NVIDIA/NVSentinel

Length of output: 1820


🏁 Script executed:

#!/bin/bash
# Search for the nil guard at line 142 mentioned in review
rg -n "HealthEventStatus != nil" fault-remediation/pkg/reconciler/reconciler.go -A 3 -B 1

Repository: NVIDIA/NVSentinel

Length of output: 360


🏁 Script executed:

#!/bin/bash
# Find HealthEventWithStatus struct definition
fd . data-models/ -name "*.go" | xargs grep -l "HealthEventWithStatus" | head -5

Repository: NVIDIA/NVSentinel

Length of output: 291


🏁 Script executed:

#!/bin/bash
# Search more broadly for HealthEventWithStatus definition
rg -n "type HealthEventWithStatus" --type go

Repository: NVIDIA/NVSentinel

Length of output: 305


🏁 Script executed:

#!/bin/bash
# Get full ParseHealthEventFromEvent implementation
cat -n commons/pkg/eventutil/parser.go | head -100

Repository: NVIDIA/NVSentinel

Length of output: 3641


🏁 Script executed:

#!/bin/bash
# Check where HealthEventWithStatus is instantiated
rg -n "HealthEventWithStatus{" --type go -A 3

Repository: NVIDIA/NVSentinel

Length of output: 168


🏁 Script executed:

#!/bin/bash
# Check HealthEventWithStatus struct definition in data-models
cat -n data-models/pkg/model/health_event_extentions.go | sed -n '40,60p'

Repository: NVIDIA/NVSentinel

Length of output: 533


🏁 Script executed:

#!/bin/bash
# Also check the protos definition
cat -n data-models/pkg/protos/health_event.pb.go | sed -n '330,345p'

Repository: NVIDIA/NVSentinel

Length of output: 1011


🏁 Script executed:

#!/bin/bash
# Check store-client definition
cat -n store-client/pkg/datastore/types.go | sed -n '93,110p'

Repository: NVIDIA/NVSentinel

Length of output: 937


🏁 Script executed:

#!/bin/bash
# Check the HealthEventStatus type to understand if it's nil by default
cat -n data-models/pkg/protos/health_event.proto | head -50

Repository: NVIDIA/NVSentinel

Length of output: 134


🏁 Script executed:

#!/bin/bash
# Verify if there's documentation about when HealthEventStatus is populated
rg -n "HealthEventStatus" data-models/pkg/protos/health_event.proto -A 2 -B 2

Repository: NVIDIA/NVSentinel

Length of output: 142


🏁 Script executed:

#!/bin/bash
# Find proto files
find . -name "*.proto" -type f | head -20

Repository: NVIDIA/NVSentinel

Length of output: 181


🏁 Script executed:

#!/bin/bash
# Look for health_event definitions in proto directory
fd "health_event" --type f

Repository: NVIDIA/NVSentinel

Length of output: 1234


🏁 Script executed:

#!/bin/bash
# Check line 73 context in parser.go more carefully
cat -n commons/pkg/eventutil/parser.go | sed -n '70,80p'

Repository: NVIDIA/NVSentinel

Length of output: 445


🏁 Script executed:

#!/bin/bash
# Check what's in the health_event.proto file
cat -n data-models/protobufs/health_event.proto | head -100

Repository: NVIDIA/NVSentinel

Length of output: 3867


🏁 Script executed:

#!/bin/bash
# Search for where HealthEventStatus is created/initialized
rg -n "HealthEventStatus\s*=" --type go -B 2 -A 2 | head -50

Repository: NVIDIA/NVSentinel

Length of output: 43


🏁 Script executed:

#!/bin/bash
# Check the ParseHealthEventFromEvent more carefully - is the HealthEventStatus dereference safe?
cat -n commons/pkg/eventutil/parser.go | sed -n '60,85p'

Repository: NVIDIA/NVSentinel

Length of output: 909


🏁 Script executed:

#!/bin/bash
# Verify if there are other code paths creating HealthEventWithStatus without going through ParseHealthEventFromEvent
rg -n "HealthEventWithStatus\s*{" --type go -B 2 -A 5

Repository: NVIDIA/NVSentinel

Length of output: 174


🏁 Script executed:

#!/bin/bash
# Verify if there are other code paths creating HealthEventWithStatus
rg -n "HealthEventWithStatus" --type go | grep -E ":\s*=" | head -20

Repository: NVIDIA/NVSentinel

Length of output: 2315


🏁 Script executed:

#!/bin/bash
# Check if HealthEventStatus is ever guaranteed to be non-nil elsewhere in reconciler.go
cat -n fault-remediation/pkg/reconciler/reconciler.go | sed -n '100,130p'

Repository: NVIDIA/NVSentinel

Length of output: 1372


Nil pointer dereference risk on HealthEventStatus at line 118.

Line 118 dereferences healthEventWithStatus.HealthEventStatus.NodeQuarantined without checking if HealthEventStatus itself is nil. Although the nil check at line 120 guards nodeQuarantined, it does not guard the HealthEventStatus field. Since HealthEventStatus is a pointer in the protobuf definition and can be nil when unmarshaled from events lacking that field, this dereference will panic if HealthEventStatus is nil.

This is confirmed by the guard at line 142 (if healthEventWithStatus.HealthEventStatus != nil), which proves HealthEventStatus can be nil in the codebase post-parse.

🛡️ Proposed fix
- nodeQuarantined := healthEventWithStatus.HealthEventStatus.NodeQuarantined

- if nodeQuarantined != nil {
-     if nodeQuarantined.Value == string(model.UnQuarantined) || nodeQuarantined.Value == string(model.Cancelled) {
-         return r.handleCancellationEvent(ctx, nodeName, model.Status(nodeQuarantined.Value), r.Watcher, event.ResumeToken)
-     }
- }
+ if healthEventWithStatus.HealthEventStatus != nil {
+     nodeQuarantined := healthEventWithStatus.HealthEventStatus.NodeQuarantined
+     if nodeQuarantined != nil {
+         if nodeQuarantined.Value == string(model.UnQuarantined) || nodeQuarantined.Value == string(model.Cancelled) {
+             return r.handleCancellationEvent(ctx, nodeName, model.Status(nodeQuarantined.Value), r.Watcher, event.ResumeToken)
+         }
+     }
+ }

Also, the TODO comment at line 141 names a developer ("Avinash") instead of referencing an issue. Replace with // TODO - #<issue-number> add a helper function... per coding guidelines.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
nodeQuarantined := healthEventWithStatus.HealthEventStatus.NodeQuarantined
if nodeQuarantined != nil {
if *nodeQuarantined == model.UnQuarantined || *nodeQuarantined == model.Cancelled {
return r.handleCancellationEvent(ctx, nodeName, *nodeQuarantined, r.Watcher, event.ResumeToken)
if nodeQuarantined.Value == string(model.UnQuarantined) || nodeQuarantined.Value == string(model.Cancelled) {
return r.handleCancellationEvent(ctx, nodeName, model.Status(nodeQuarantined.Value), r.Watcher, event.ResumeToken)
}
}
if healthEventWithStatus.HealthEventStatus != nil {
nodeQuarantined := healthEventWithStatus.HealthEventStatus.NodeQuarantined
if nodeQuarantined != nil {
if nodeQuarantined.Value == string(model.UnQuarantined) || nodeQuarantined.Value == string(model.Cancelled) {
return r.handleCancellationEvent(ctx, nodeName, model.Status(nodeQuarantined.Value), r.Watcher, event.ResumeToken)
}
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-remediation/pkg/reconciler/reconciler.go` around lines 118 - 124,
healthEventWithStatus.HealthEventStatus is dereferenced without a nil check
causing a potential panic; before reading HealthEventStatus.NodeQuarantined,
guard that healthEventWithStatus.HealthEventStatus != nil (e.g., check and
assign nodeQuarantined only after that nil check) and then keep the existing
logic that calls r.handleCancellationEvent(ctx, nodeName,
model.Status(nodeQuarantined.Value), r.Watcher, event.ResumeToken) when
nodeQuarantined is non-nil and its Value equals UnQuarantined or Cancelled;
additionally replace the TODO that names a developer with the required issue
reference format (// TODO - #<issue-number> add a helper function...) per
guidelines.

golang.org/x/sys v0.38.0 // indirect
golang.org/x/text v0.31.0 // indirect
google.golang.org/genproto/googleapis/rpc v0.0.0-20251124214823-79d6a2a48846 // indirect
github.com/yandex/protoc-gen-crd v1.1.0 // indirect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Explore repository structure
ls -la | head -20
find . -maxdepth 2 -type f -name "go.mod" | head -10

Repository: NVIDIA/NVSentinel

Length of output: 1467


🏁 Script executed:

#!/bin/bash
# Check the specific nccl-loopback go.mod file
cat preflight-checks/nccl-loopback/go.mod

Repository: NVIDIA/NVSentinel

Length of output: 890


🏁 Script executed:

#!/bin/bash
# Check data-models go.mod file
cat data-models/go.mod

Repository: NVIDIA/NVSentinel

Length of output: 1379


🏁 Script executed:

#!/bin/bash
# Search for protoc-gen-crd imports in generated .pb.go files
find . -name "*.pb.go" -type f | head -20
rg "yandex/protoc-gen-crd" --type=go -B2 -A2

Repository: NVIDIA/NVSentinel

Length of output: 1458


🏁 Script executed:

#!/bin/bash
# Search for CRD annotations in proto files
find . -name "*.proto" -type f | xargs grep -l "k8s_crd" 2>/dev/null | head -10

Repository: NVIDIA/NVSentinel

Length of output: 103


🏁 Script executed:

#!/bin/bash
# Check health_event.proto for k8s_crd option usage
cat data-models/protobufs/health_event.proto | head -60

Repository: NVIDIA/NVSentinel

Length of output: 2209


🏁 Script executed:

#!/bin/bash
# Check nccl-loopback's actual imports from data-models
find preflight-checks/nccl-loopback -name "*.go" -type f | xargs grep -l "data-models" | head -5

Repository: NVIDIA/NVSentinel

Length of output: 166


🏁 Script executed:

#!/bin/bash
# Check what's imported from data-models in nccl-loopback
grep -r "github.com/nvidia/nvsentinel/data-models" preflight-checks/nccl-loopback --include="*.go"

Repository: NVIDIA/NVSentinel

Length of output: 282


protoc-gen-crd build-tool package has leaked into a runtime preflight check module.

The dependency chain is confirmed: nccl-loopback imports types from data-models/pkg/protos, which contains health_event.pb.go. This generated file imports github.com/yandex/protoc-gen-crd/library/go/k8s/protoc_gen_crd/proto because the original health_event.proto includes the CRD proto descriptor to support k8s_crd message options.

Since nccl-loopback is a runtime preflight check tool with no CRD generation responsibility, carrying a transitive dependency on a protobuf compiler plugin package:

  • Unnecessarily couples the binary to Kubernetes CRD tooling.
  • Increases supply-chain exposure.
  • Violates module minimality principles.

The root cause is that CRD-annotated protos and base data-model protos coexist in the same data-models module, making it impossible for consumers to import runtime types without pulling in CRD generation infrastructure.

Factor CRD-specific proto options into a dedicated sub-package or separate module within data-models so runtime consumers like nccl-loopback can import base types without the CRD generation dependency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/nccl-loopback/go.mod` at line 15, The build-tool package
github.com/yandex/protoc-gen-crd leaked into the runtime binary because
data-models’ generated file health_event.pb.go includes CRD proto options; fix
by separating CRD-annotated protos into a dedicated package/module so runtime
consumers (nccl-loopback) can depend only on base types: create a new
data-models submodule (e.g., data-models/base or data-models/proto-runtime) that
contains the plain health_event proto (without k8s_crd imports/options),
regenerate protobufs for that module so health_event.pb.go no longer imports
github.com/yandex/protoc-gen-crd, update nccl-loopback to import the new base
module/package, and remove the transitive protoc-gen-crd dependency from the
runtime module’s go.mod.

@yavinash007 yavinash007 force-pushed the k8s-data-store branch 2 times, most recently from ac6d456 to 7298063 Compare February 24, 2026 07:08
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
store-client/pkg/testutils/event_builder.go (1)

15-15: ⚠️ Potential issue | 🟡 Minor

Add a package-level godoc comment.

The testutils package has no doc comment. As per coding guidelines, package-level godoc is required for all Go packages.

📝 Proposed fix
+// Package testutils provides database-agnostic event construction utilities for use in tests.
 package testutils

As per coding guidelines: "Package-level godoc required for all Go packages".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@store-client/pkg/testutils/event_builder.go` at line 15, Add a package-level
godoc comment for the testutils package by inserting a descriptive comment block
immediately above the "package testutils" declaration; the comment should
briefly describe the purpose of the testutils package (what helpers/utilities it
provides and intended audience/usage) so it satisfies the package-level
documentation requirement.
node-drainer/pkg/reconciler/reconciler.go (1)

151-157: ⚠️ Potential issue | 🔴 Critical

New nil-pointer dereference on UserPodsEvictionStatus in PreprocessAndEnqueueEvent

healthEventWithStatus is populated via json.Unmarshal from a raw MongoDB document. If userpodsevictionstatus is absent or null in that document, healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus will be a nil *protos.OperationStatus. Accessing .Status at lines 151 and 154 will then panic.

This specific occurrence in PreprocessAndEnqueueEvent is not covered by the existing review comment on the same pattern in executeSkip, executeMarkAlreadyDrained, executeUpdateStatus, and handleCancelledEvent.

🐛 Proposed fix
-	if isTerminalStatus(model.Status(healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status)) {
-		slog.Debug("Skipping event - already in terminal state",
-			"node", healthEventWithStatus.HealthEvent.NodeName,
-			"status", healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status)
-
-		return nil
-	}
+	if evictionStatus := healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus; evictionStatus != nil &&
+		isTerminalStatus(model.Status(evictionStatus.Status)) {
+		slog.Debug("Skipping event - already in terminal state",
+			"node", healthEventWithStatus.HealthEvent.NodeName,
+			"status", evictionStatus.Status)
+
+		return nil
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/reconciler/reconciler.go` around lines 151 - 157,
PreprocessAndEnqueueEvent dereferences
healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus without guarding
for nil, which can panic when the MongoDB document omits userpodsevictionstatus;
update PreprocessAndEnqueueEvent to check that
healthEventWithStatus.HealthEventStatus != nil and
healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus != nil before
calling isTerminalStatus or accessing .Status, and only call slog.Debug with the
status when the pointer is non-nil (otherwise treat as non-terminal or log that
the eviction status is missing) so the function no longer panics on null/missing
UserPodsEvictionStatus.
♻️ Duplicate comments (11)
tilt/simple-health-client/go.mod (1)

14-18: Verify protoc-gen-crd is truly required at runtime

Same concern as prior: if this is only for CRD annotations in generated protos, consider keeping it build-only to avoid leaking tooling into the client module.

#!/bin/bash
# Check where protoc-gen-crd is imported or referenced.
rg -n "protoc-gen-crd" -g '*.go'
rg -n "yandex/protoc-gen-crd" -g 'go.mod'

Based on learnings: Keep Go dependencies minimal and up-to-date.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tilt/simple-health-client/go.mod` around lines 14 - 18, The go.mod entry for
github.com/yandex/protoc-gen-crd appears to be a tooling dependency (protoc
plugin) rather than a runtime import; confirm there are no imports of
"protoc-gen-crd" in Go sources (search for protoc-gen-crd or
yandex/protoc-gen-crd), and if it is only used for code generation remove it
from the module's runtime deps—either delete the indirect line from go.mod or
move it into a dedicated tooling declaration (create a tools.go with a build tag
like // +build tools and a blank import of github.com/yandex/protoc-gen-crd, or
put the plugin into a separate tools/go.mod) so the client module does not carry
the generator as a runtime dependency; update go.sum accordingly and run go mod
tidy to finalize.
data-models/protobufs/health_event.proto (1)

30-44: ⚠️ Potential issue | 🟡 Minor

Add per-field comments for the new proto messages.

These fields are part of the CRD surface but are undocumented. Please add field-level comments for OperationStatus, HealthEventStatus, HealthEventWithStatus, and HealthEventResource.

✍️ Suggested field comments
 message OperationStatus {
+  // status indicates the operation state (e.g., NotStarted, InProgress, Succeeded, Failed).
   string status = 1;
+  // message provides human-readable details for the operation status.
   string message = 2;
 }

 message HealthEventStatus {
+  // nodeQuarantined captures the node quarantine state (e.g., Quarantined, UnQuarantined).
   string nodeQuarantined = 1;
+  // quarantineFinishTimestamp records when quarantine completed.
   google.protobuf.Timestamp quarantineFinishTimestamp = 2;
+  // userPodsEvictionStatus tracks eviction progress for user pods.
   OperationStatus userPodsEvictionStatus = 3;
+  // drainFinishTimestamp records when drain completed.
   google.protobuf.Timestamp drainFinishTimestamp = 4;
+  // faultRemediated indicates whether the fault was remediated (BoolValue distinguishes unset).
   google.protobuf.BoolValue faultRemediated = 5;
+  // lastRemediationTimestamp records the last remediation attempt time.
   google.protobuf.Timestamp lastRemediationTimestamp = 6;
 }

 message HealthEventWithStatus {
+  // createdAt records when the health event was created.
   google.protobuf.Timestamp createdAt = 1;
+  // healthEvent is the event spec payload.
   HealthEvent healthEvent = 2;
+  // healthEventStatus is the observed status for the event.
   HealthEventStatus healthEventStatus = 3;
 }

 message HealthEventResource {
   option (protoc_gen_crd.k8s_crd) = {
     api_group: "healthevents.dgxc.nvidia.com",
     kind: "HealthEventResource",
     plural: "healtheventresources",
     singular: "healtheventresource",
     categories: ["nvidia", "gpu"]
   };
+  // spec defines the desired state for the health event.
   HealthEvent spec = 1;
+  // status reflects the observed state for the health event.
   HealthEventStatus status = 2;
 }

As per coding guidelines, "Include comprehensive comments for all fields in Protocol Buffer messages."

Also applies to: 46-51, 118-129

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@data-models/protobufs/health_event.proto` around lines 30 - 44, The new
protobuf messages lack per-field comments; add clear field-level comments for
OperationStatus, HealthEventStatus, HealthEventWithStatus, and
HealthEventResource describing purpose and semantics of each field (e.g., what
"status" and "message" in OperationStatus represent, meaning of nodeQuarantined,
quarantineFinishTimestamp, userPodsEvictionStatus, drainFinishTimestamp,
faultRemediated, lastRemediationTimestamp in HealthEventStatus, and each field
in HealthEventWithStatus and HealthEventResource). Locate the messages by name
in the proto and add concise, guideline-compliant comments above each field to
document expected values, units (timestamps), and any nullable/optional
semantics.
distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml (2)

89-91: ⚠️ Potential issue | 🟡 Minor

format: RFC3339 is not a recognized OpenAPI 3.0 format — should be date-time.

Kubernetes CRD validation uses OpenAPI 3.0 spec, where RFC 3339 timestamps use the format identifier date-time. The unrecognized RFC3339 value will be silently ignored, bypassing schema validation on generatedTimestamp (line 91), quarantineFinishTimestamp (line 130), drainFinishTimestamp (line 141), and lastRemediationTimestamp (line 147). Fix this in the protoc-gen-crd generator configuration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`
around lines 89 - 91, The CRD uses an invalid OpenAPI format "RFC3339" for
timestamp fields; update the protoc-gen-crd generator configuration so those
fields emit format: "date-time" instead. Locate the generator template/config
that produces the fields named generatedTimestamp, quarantineFinishTimestamp,
drainFinishTimestamp, and lastRemediationTimestamp and replace the RFC3339
format token with the OpenAPI 3.0 compliant "date-time" token so schema
validation is enforced.

29-31: ⚠️ Potential issue | 🟡 Minor

spec and status are both nullable: true — allows creating resources with spec: null.

Kubernetes conventions typically require spec to be present. The nullable: true on both spec (line 31) and status (line 123) comes from proto3 implicit optionality, but should be overridden to enforce resource validity. Consider adding required: ["spec"] at the top-level schema or adjusting the generator config.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`
around lines 29 - 31, The CRD schema currently marks properties.spec and
properties.status as nullable, allowing creation of resources with spec: null;
update the top-level OpenAPI schema so that spec is required (add required:
["spec"] at the top-level schema) and remove or set nullable: false for
properties.spec (and optionally properties.status) to enforce presence and
validity; adjust the CRD generator configuration if needed so proto3 optionality
does not emit nullable: true for spec/status.
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)

27-28: ⚠️ Potential issue | 🔴 Critical

Same missing crd_pb2 dependency as in the dcgm-diag module — runtime import failure.

Identical to the issue in preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py: the Go-style crd_pb2 import path does not resolve in Python. All downstream code importing this module will fail at runtime.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py`
around lines 27 - 28, The generated import for crd_pb2 in health_event_pb2.py
uses an invalid Go-style path
(github.com.yandex.protoc_gen_crd.library.go.k8s.protoc_gen_crd.proto.crd_pb2)
which causes runtime ImportError; update the import in health_event_pb2.py to
the correct Python package/module name used in this repo (or add a guarded
try/except that falls back to the local crd_pb2 module), i.e., replace or wrap
the failing import line that references
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2
so that the module imports the actual crd_pb2 Python module available at
runtime.
preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi (1)

5-6: ⚠️ Potential issue | 🟡 Minor

crd_pb2 import in .pyi stub will break type-checking tools.

Same root cause as the .py files — the Go-style import path doesn't resolve in Python. While this won't block runtime, it will cause errors in mypy/pyright if type checking is run in CI. This will be fixed when the .py import issue is resolved.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi`
around lines 5 - 6, The .pyi stub imports a Go-style path
"github.com.yandex.protoc_gen_crd.library.go.k8s.protoc_gen_crd.proto.crd_pb2"
(bound to _crd_pb2) which breaks type-checkers; change the import in
health_event_pb2.pyi to match the valid Python module path used by the
corresponding .py (or a relative import) so that _crd_pb2 resolves for
mypy/pyright, or remove/replace the annotation with a forward-declared type if
the module isn’t needed for type checking; update the import that defines
_crd_pb2 in the top-level of the stub to the correct Python package name
consistent with the generated .py files.
preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py (1)

27-28: ⚠️ Potential issue | 🔴 Critical

Missing crd_pb2 Python module — runtime ModuleNotFoundError on import.

Line 28 imports crd_pb2 via a Go-style module path (github.com.yandex.protoc_gen_crd...) that does not exist in the Python environment. This will cause a ModuleNotFoundError at runtime, blocking all modules that import health_event_pb2 (e.g., health.py, errors.py, config.py, __main__.py).

Either vendor the compiled crd_pb2.py Python module under the expected package path and add it to pyproject.toml, or regenerate the protobuf stubs with a protoc configuration that avoids the Go-only CRD import for Python targets.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py` around lines
27 - 28, The generated health_event_pb2.py imports a Go-style module name
(github...crd_pb2) that doesn't exist in Python and causes ModuleNotFoundError;
fix by either vendoring a Python crd_pb2 module under the expected package path
so the import in health_event_pb2.py resolves (and add it to
pyproject.toml/packaging), or re-run protoc to regenerate the protobuf stubs for
Python without the Go CRD import (adjust proto import paths/options so crd_pb2
is a proper Python package import); update health_event_pb2.py only by replacing
the invalid import with the correct Python package path if you vendor the
module.
api/Makefile (1)

102-116: Missing clean phony target — add an alias for convention compliance.

The protos-clean target exists but there's no top-level clean target, which is expected by the repo's Makefile conventions (flagged by checkmake). Additionally, test (line 112) runs without -race, diverging from other modules like data-models that use gotestsum with -race and coverage.

♻️ Proposed additions
 .PHONY: lint-test
 lint-test: lint test ## Run lint and tests.
 	`@echo` "Lint and tests completed."
+
+.PHONY: clean
+clean: protos-clean ## Remove generated code.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/Makefile` around lines 102 - 116, Add a top-level phony clean target as
an alias to the existing protos-clean to satisfy checkmake conventions: declare
.PHONY: clean and implement clean: protos-clean (so callers can run make clean).
Also update the test target to run tests with the race detector and match other
modules—replace the plain go test invocation in the test target with a gotestsum
(or go test) invocation that includes -race (e.g., gotestsum -- -race ./... or
go test -race ./...) and keep verbose output; reference the existing
targets/names protos-clean and test and tools gotestsum and the -race flag when
making these edits.
preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py (1)

27-28: ⚠️ Potential issue | 🔴 Critical

Same missing crd_pb2 dependency — identical to dcgm-diag and gpu-health-monitor.

All three Python modules share the same generated health_event_pb2.py with the unresolvable Go-style crd_pb2 import. Fix this at the protobuf generation level so all Python targets are corrected simultaneously.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py`
around lines 27 - 28, The generated health_event_pb2.py contains a broken
Go-style import for crd_pb2; fix the .proto generation so Python outputs a
proper python import for the CRD proto: add or set option python_package =
"desired.python.module.name" (or ensure the CRD proto defines a python_package)
in the CRD proto and regenerate so that health_event_pb2 imports crd_pb2 as a
normal Python module (not the Go path), or adjust the protobuf generation rule
(protoc invocation) to emit crd_pb2 into the same Python package/module
namespace used by health_event_pb2; target the health_event_pb2.py file and the
CRD proto (crd.proto / crd_pb2) when making this change.
Makefile (1)

281-281: ⚠️ Potential issue | 🟡 Minor

protos-generate no longer depends on protos-clean — stale files may persist.

Removing the protos-clean prerequisite means orphaned .pb.go files from renamed or deleted .proto messages won't be cleaned before regeneration. The author has acknowledged this needs refactoring due to vendor dependencies. Ensure this is addressed before merging to avoid subtle build issues from stale generated code.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` at line 281, The Makefile target protos-generate currently no
longer depends on protos-clean, which allows stale .pb.go files to persist;
restore cleaning by reintroducing protos-clean as a prerequisite of
protos-generate or by invoking the cleaning logic at the start of the
protos-generate recipe (e.g., call the protos-clean target or run the same
cleanup commands before generation), ensuring any renamed or deleted
.proto-derived files are removed prior to running the generate steps so
generated outputs remain consistent.
node-drainer/pkg/reconciler/reconciler.go (1)

375-380: Nil-pointer dereferences on UserPodsEvictionStatus — already flagged

The patterns at lines 376–377, 509–510, 519–520, and 780–781 (podsEvictionStatus := healthEvent.HealthEventStatus.UserPodsEvictionStatus followed by podsEvictionStatus.Status = ...) all dereference a nil *protos.OperationStatus if the field is unset.

Also applies to: 509-513, 519-520, 780-784

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/reconciler/reconciler.go` around lines 375 - 380, The code
dereferences healthEvent.HealthEventStatus.UserPodsEvictionStatus (assigned to
podsEvictionStatus) without checking for nil, causing possible nil-pointer
panics; before setting podsEvictionStatus.Status (in the blocks around
model.UnQuarantined and the other occurrences), add a nil check and initialize a
new protos.OperationStatus when nil (so podsEvictionStatus =
&protos.OperationStatus{Status: ...} or set Status after ensuring non-nil), then
pass the non-nil podsEvictionStatus into r.updateNodeUserPodsEvictedStatus;
apply this pattern to all occurrences referenced (the podsEvictionStatus
assignments near updateNodeUserPodsEvictedStatus calls and similar blocks).
🧹 Nitpick comments (5)
store-client/pkg/testutils/event_builder.go (1)

36-36: Optional: drop the model.Status dependency in favour of a plain string.

quarantineStatus is typed as model.Status across the struct, the builder field, the default value in NewTestEventBuilder, and the WithQuarantineStatus method, but protos.HealthEventStatus.NodeQuarantined is a plain string. Since model.Status is itself a type … string, this creates an avoidable coupling to the legacy model package in a pure test-utility file.

♻️ Proposed refactor
 type TestEventBuilder struct {
     ...
-    quarantineStatus model.Status
+    quarantineStatus string
 }

 func NewTestEventBuilder() *TestEventBuilder {
     return &TestEventBuilder{
-        quarantineStatus: model.Status("NotQuarantined"),
+        quarantineStatus: "NotQuarantined",
         ...
     }
 }

-func (b *TestEventBuilder) WithQuarantineStatus(status model.Status) *TestEventBuilder {
+func (b *TestEventBuilder) WithQuarantineStatus(status string) *TestEventBuilder {
     b.quarantineStatus = status
     return b
 }

 type testEvent struct {
     ...
-    quarantineStatus model.Status
+    quarantineStatus string
 }

The UnmarshalDocument call on line 133 then becomes simply NodeQuarantined: e.quarantineStatus without the explicit cast.

Also applies to: 82-85, 108-108

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@store-client/pkg/testutils/event_builder.go` at line 36, Change the
quarantineStatus field and its API from model.Status to plain string: update the
struct field quarantineStatus, the NewTestEventBuilder default value, and the
WithQuarantineStatus(builder) method signature to use string instead of
model.Status, and adjust usages (e.g. the UnmarshalDocument call where
NodeQuarantined is set) to assign e.quarantineStatus directly without casting;
update any other references in this file (including the builder field and
default at NewTestEventBuilder and the WithQuarantineStatus implementation) to
remove the legacy model dependency.
health-events-analyzer/pkg/parser/parser_test.go (1)

419-419: Remove stale commented-out code.

It adds noise and can drift from actual behavior.

🧹 Suggested cleanup
-//	UserPodsEvictionStatus: &protos.OperationStatus{Status: &protos.Status{Value: string(model.StatusSucceeded)}},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@health-events-analyzer/pkg/parser/parser_test.go` at line 419, Remove the
stale commented-out line containing "UserPodsEvictionStatus:
&protos.OperationStatus{Status: &protos.Status{Value:
string(model.StatusSucceeded)}},” from parser_test.go (it’s dead/commented test
data), keeping the surrounding test code intact and ensuring the test still
compiles and passes; specifically remove that commented entry referencing
UserPodsEvictionStatus, protos.OperationStatus, protos.Status and
model.StatusSucceeded to eliminate noise.
distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml (1)

149-149: Empty additionalPrinterColumns — consider adding useful columns for kubectl get.

With no additional printer columns, kubectl get healtheventresources will only show NAME and AGE. Adding columns like nodeName, checkName, isHealthy, or componentClass from .spec would significantly improve operational usability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`
at line 149, The CRD currently sets additionalPrinterColumns: [] so kubectl get
healtheventresources only shows NAME and AGE; update the CRD's
additionalPrinterColumns array to declare useful columns (e.g., nodeName mapping
to .spec.nodeName, checkName to .spec.checkName, isHealthy to .spec.isHealthy,
componentClass to .spec.componentClass) with appropriate JSONPaths, types
(string/boolean), and description entries so those fields appear in `kubectl
get` output; modify the additionalPrinterColumns block in the
health_event.crd.yaml (the additionalPrinterColumns entry) to include these
column definitions.
data-models/Makefile (1)

68-71: install-plugin-remote is a PHONY target used as a prerequisite of $(CRD_FILE) — it re-runs go install on every build.

Since install-plugin-remote is .PHONY, it will always execute when $(CRD_FILE) is built, even if the plugin is already installed at the correct version. Consider using a stamp file or checking the installed version first.

♻️ Proposed improvement
-.PHONY: install-plugin-remote
-install-plugin-remote:
+# Stamp file to avoid re-installing on every build
+PROTOC_GEN_CRD_STAMP = .protoc-gen-crd-$(PROTOC_GEN_CRD_VERSION).stamp
+
+$(PROTOC_GEN_CRD_STAMP):
 	`@echo` "Installing protoc-gen-crd from github..."
 	GO111MODULE=on go install github.com/yandex/protoc-gen-crd/cmd/protoc-gen-crd@$(PROTOC_GEN_CRD_VERSION)
+	`@touch` $@
+
+.PHONY: install-plugin-remote
+install-plugin-remote: $(PROTOC_GEN_CRD_STAMP)

Then update the $(CRD_FILE) prerequisite to use $(PROTOC_GEN_CRD_STAMP) instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@data-models/Makefile` around lines 68 - 71, The install-plugin-remote PHONY
target always re-runs go install for
github.com/yandex/protoc-gen-crd@$(PROTOC_GEN_CRD_VERSION); change this to
create and update a stamp file (e.g. $(PROTOC_GEN_CRD_STAMP)) only when
installation/version changes: implement install-plugin-remote to check the
installed protoc-gen-crd version (or run go install and write the current
$(PROTOC_GEN_CRD_VERSION) into $(PROTOC_GEN_CRD_STAMP) on success) and then
update the $(CRD_FILE) prerequisite to depend on $(PROTOC_GEN_CRD_STAMP) instead
of the PHONY target so the plugin is not reinstalled on every build.
node-drainer/pkg/reconciler/reconciler.go (1)

929-934: setDrainFailedStatus bypasses the protojson+map serialization pattern used everywhere else

updateNodeUserPodsEvictedStatus marshals the status via protojsonjson.Unmarshalmap[string]interface{} before writing to MongoDB, but setDrainFailedStatus writes a raw protos.OperationStatus{} struct literal directly into the update payload. For the current Status/Message string-only struct the BSON driver produces the same {status: ..., message: ...} output, but this will silently diverge if OperationStatus gains any proto well-known-type fields (e.g., a Timestamp, Duration, or oneof) that the BSON driver doesn't know how to encode correctly.

♻️ Proposed refactor — extract a shared helper
-	update := map[string]any{
-		"$set": map[string]any{
-			"healtheventstatus.userpodsevictionstatus": protos.OperationStatus{
-				Status:  string(model.StatusFailed),
-				Message: reason,
-			},
-		},
-	}
+	failedStatus := &protos.OperationStatus{
+		Status:  string(model.StatusFailed),
+		Message: reason,
+	}
+	jsonBytes, err := protojson.Marshal(failedStatus)
+	if err != nil {
+		return fmt.Errorf("failed to marshal drain failed status: %w", err)
+	}
+	var evictionStatusMap map[string]interface{}
+	if err := json.Unmarshal(jsonBytes, &evictionStatusMap); err != nil {
+		return fmt.Errorf("failed to unmarshal drain failed status JSON: %w", err)
+	}
+	update := map[string]any{
+		"$set": map[string]any{
+			"healtheventstatus.userpodsevictionstatus": evictionStatusMap,
+		},
+	}

Alternatively, if the suggestion on the protojson comment above is adopted (using explicit field paths), this becomes trivial:

"$set": map[string]any{
    "healtheventstatus.userpodsevictionstatus.status":  string(model.StatusFailed),
    "healtheventstatus.userpodsevictionstatus.message": reason,
},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/reconciler/reconciler.go` around lines 929 - 934, The
setDrainFailedStatus update writes a raw protos.OperationStatus struct into the
MongoDB "$set" payload, bypassing the protojson→json.Unmarshal→map pattern used
by updateNodeUserPodsEvictedStatus and risking future encoding mismatches for
proto well-known types; change setDrainFailedStatus to produce the same
map-encoded value as updateNodeUserPodsEvictedStatus by marshaling the
protos.OperationStatus via protojson, json.Unmarshal into a
map[string]interface{}, and place that map under
"healtheventstatus.userpodsevictionstatus" in the "$set" payload (or
alternatively set explicit scalar paths like
"healtheventstatus.userpodsevictionstatus.status" and ".message" if you prefer
field-path updates).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@data-models/Makefile`:
- Line 63: Remove the duplicate PROTO_OPTIONS assignment in the Makefile: locate
the two identical declarations of PROTO_OPTIONS (one at the current start and
the other later in the file) and delete the redundant one so that PROTO_OPTIONS
is declared only once pointing to
vendor/github.com/yandex/protoc-gen-crd/library/go/k8s/protoc_gen_crd/proto/crd.proto;
ensure no other rules rely on the second declaration before removing it.
- Around line 124-125: The make target protos-clean should not depend on
vendor-proto because clean targets must not trigger go mod vendor; remove
vendor-proto from the protos-clean prerequisites and instead ensure vendor-proto
is a prerequisite of protos-generate (it’s safe since compile-crd-server ->
$(CRD_FILE) already transitively requires vendor-proto), so update the Makefile
to drop vendor-proto from the protos-clean dependency list and add it to
protos-generate if not already present.

In `@distros/kubernetes/nvsentinel/values.yaml`:
- Around line 110-111: Add an inline comment documenting the new Helm value
k8sdatastoreCrds.enabled (global.k8sdatastoreCrds.enabled) in values.yaml,
explaining that setting it to true enables CRD installation and providing a
short usage example or note (e.g., "set to true to install CRDs; leave false if
CRDs are managed externally"). Place this comment immediately above or beside
the k8sdatastoreCrds.enabled entry so readers see the enablement requirement and
example when scanning values.yaml.

In `@node-drainer/pkg/evaluator/evaluator.go`:
- Around line 68-69: The code dereferences
healthEvent.HealthEventStatus.UserPodsEvictionStatus without checking for nil
before accessing .Status; update the check inside the evaluator (around the
isTerminalStatus call) to first ensure
healthEvent.HealthEventStatus.UserPodsEvictionStatus != nil and only then call
isTerminalStatus(model.Status(healthEvent.HealthEventStatus.UserPodsEvictionStatus.Status));
retain the existing slog.Info("Event for node is in terminal state, skipping",
"node", nodeName) behavior when the guard indicates a terminal state.

In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 573-586: The code marshals userPodsEvictionStatus with
protojson.Marshal then unmarshals into evictionStatusMap and uses that in
updateFields, but protojson output is unstable (camelCase keys, omitempty) which
can cause mismatched MongoDB field names and silent omitted clears; instead,
construct the update map explicitly from the OperationStatus fields (e.g., build
a map[string]any with explicit keys like "status" and "message" under
"healtheventstatus.userpodsevictionstatus") and assign that to updateFields so
updates always set/clear the exact fields predictably (modify the logic around
userPodsEvictionStatus, evictionStatusMap and updateFields to populate explicit
keys rather than relying on protojson).
- Around line 615-632: In observeEvictionDurationIfSucceeded add a nil check for
the userPodsEvictionStatus parameter before dereferencing its Status field:
return early if userPodsEvictionStatus == nil to avoid panics when callers pass
healthEvent.HealthEventStatus.UserPodsEvictionStatus (which is a
*protos.OperationStatus and can be nil); keep the rest of the logic (parsing
healthEvent, checking QuarantineFinishTimestamp, computing evictionDuration and
observing the metric) unchanged and only proceed when the pointer is non-nil and
Status equals model.StatusSucceeded.

---

Outside diff comments:
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 151-157: PreprocessAndEnqueueEvent dereferences
healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus without guarding
for nil, which can panic when the MongoDB document omits userpodsevictionstatus;
update PreprocessAndEnqueueEvent to check that
healthEventWithStatus.HealthEventStatus != nil and
healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus != nil before
calling isTerminalStatus or accessing .Status, and only call slog.Debug with the
status when the pointer is non-nil (otherwise treat as non-terminal or log that
the eviction status is missing) so the function no longer panics on null/missing
UserPodsEvictionStatus.

In `@store-client/pkg/testutils/event_builder.go`:
- Line 15: Add a package-level godoc comment for the testutils package by
inserting a descriptive comment block immediately above the "package testutils"
declaration; the comment should briefly describe the purpose of the testutils
package (what helpers/utilities it provides and intended audience/usage) so it
satisfies the package-level documentation requirement.

---

Duplicate comments:
In `@api/Makefile`:
- Around line 102-116: Add a top-level phony clean target as an alias to the
existing protos-clean to satisfy checkmake conventions: declare .PHONY: clean
and implement clean: protos-clean (so callers can run make clean). Also update
the test target to run tests with the race detector and match other
modules—replace the plain go test invocation in the test target with a gotestsum
(or go test) invocation that includes -race (e.g., gotestsum -- -race ./... or
go test -race ./...) and keep verbose output; reference the existing
targets/names protos-clean and test and tools gotestsum and the -race flag when
making these edits.

In `@data-models/protobufs/health_event.proto`:
- Around line 30-44: The new protobuf messages lack per-field comments; add
clear field-level comments for OperationStatus, HealthEventStatus,
HealthEventWithStatus, and HealthEventResource describing purpose and semantics
of each field (e.g., what "status" and "message" in OperationStatus represent,
meaning of nodeQuarantined, quarantineFinishTimestamp, userPodsEvictionStatus,
drainFinishTimestamp, faultRemediated, lastRemediationTimestamp in
HealthEventStatus, and each field in HealthEventWithStatus and
HealthEventResource). Locate the messages by name in the proto and add concise,
guideline-compliant comments above each field to document expected values, units
(timestamps), and any nullable/optional semantics.

In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`:
- Around line 89-91: The CRD uses an invalid OpenAPI format "RFC3339" for
timestamp fields; update the protoc-gen-crd generator configuration so those
fields emit format: "date-time" instead. Locate the generator template/config
that produces the fields named generatedTimestamp, quarantineFinishTimestamp,
drainFinishTimestamp, and lastRemediationTimestamp and replace the RFC3339
format token with the OpenAPI 3.0 compliant "date-time" token so schema
validation is enforced.
- Around line 29-31: The CRD schema currently marks properties.spec and
properties.status as nullable, allowing creation of resources with spec: null;
update the top-level OpenAPI schema so that spec is required (add required:
["spec"] at the top-level schema) and remove or set nullable: false for
properties.spec (and optionally properties.status) to enforce presence and
validity; adjust the CRD generator configuration if needed so proto3 optionality
does not emit nullable: true for spec/status.

In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py`:
- Around line 27-28: The generated import for crd_pb2 in health_event_pb2.py
uses an invalid Go-style path
(github.com.yandex.protoc_gen_crd.library.go.k8s.protoc_gen_crd.proto.crd_pb2)
which causes runtime ImportError; update the import in health_event_pb2.py to
the correct Python package/module name used in this repo (or add a guarded
try/except that falls back to the local crd_pb2 module), i.e., replace or wrap
the failing import line that references
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2
so that the module imports the actual crd_pb2 Python module available at
runtime.

In `@Makefile`:
- Line 281: The Makefile target protos-generate currently no longer depends on
protos-clean, which allows stale .pb.go files to persist; restore cleaning by
reintroducing protos-clean as a prerequisite of protos-generate or by invoking
the cleaning logic at the start of the protos-generate recipe (e.g., call the
protos-clean target or run the same cleanup commands before generation),
ensuring any renamed or deleted .proto-derived files are removed prior to
running the generate steps so generated outputs remain consistent.

In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 375-380: The code dereferences
healthEvent.HealthEventStatus.UserPodsEvictionStatus (assigned to
podsEvictionStatus) without checking for nil, causing possible nil-pointer
panics; before setting podsEvictionStatus.Status (in the blocks around
model.UnQuarantined and the other occurrences), add a nil check and initialize a
new protos.OperationStatus when nil (so podsEvictionStatus =
&protos.OperationStatus{Status: ...} or set Status after ensuring non-nil), then
pass the non-nil podsEvictionStatus into r.updateNodeUserPodsEvictedStatus;
apply this pattern to all occurrences referenced (the podsEvictionStatus
assignments near updateNodeUserPodsEvictedStatus calls and similar blocks).

In `@preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py`:
- Around line 27-28: The generated health_event_pb2.py imports a Go-style module
name (github...crd_pb2) that doesn't exist in Python and causes
ModuleNotFoundError; fix by either vendoring a Python crd_pb2 module under the
expected package path so the import in health_event_pb2.py resolves (and add it
to pyproject.toml/packaging), or re-run protoc to regenerate the protobuf stubs
for Python without the Go CRD import (adjust proto import paths/options so
crd_pb2 is a proper Python package import); update health_event_pb2.py only by
replacing the invalid import with the correct Python package path if you vendor
the module.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py`:
- Around line 27-28: The generated health_event_pb2.py contains a broken
Go-style import for crd_pb2; fix the .proto generation so Python outputs a
proper python import for the CRD proto: add or set option python_package =
"desired.python.module.name" (or ensure the CRD proto defines a python_package)
in the CRD proto and regenerate so that health_event_pb2 imports crd_pb2 as a
normal Python module (not the Go path), or adjust the protobuf generation rule
(protoc invocation) to emit crd_pb2 into the same Python package/module
namespace used by health_event_pb2; target the health_event_pb2.py file and the
CRD proto (crd.proto / crd_pb2) when making this change.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi`:
- Around line 5-6: The .pyi stub imports a Go-style path
"github.com.yandex.protoc_gen_crd.library.go.k8s.protoc_gen_crd.proto.crd_pb2"
(bound to _crd_pb2) which breaks type-checkers; change the import in
health_event_pb2.pyi to match the valid Python module path used by the
corresponding .py (or a relative import) so that _crd_pb2 resolves for
mypy/pyright, or remove/replace the annotation with a forward-declared type if
the module isn’t needed for type checking; update the import that defines
_crd_pb2 in the top-level of the stub to the correct Python package name
consistent with the generated .py files.

In `@tilt/simple-health-client/go.mod`:
- Around line 14-18: The go.mod entry for github.com/yandex/protoc-gen-crd
appears to be a tooling dependency (protoc plugin) rather than a runtime import;
confirm there are no imports of "protoc-gen-crd" in Go sources (search for
protoc-gen-crd or yandex/protoc-gen-crd), and if it is only used for code
generation remove it from the module's runtime deps—either delete the indirect
line from go.mod or move it into a dedicated tooling declaration (create a
tools.go with a build tag like // +build tools and a blank import of
github.com/yandex/protoc-gen-crd, or put the plugin into a separate
tools/go.mod) so the client module does not carry the generator as a runtime
dependency; update go.sum accordingly and run go mod tidy to finalize.

---

Nitpick comments:
In `@data-models/Makefile`:
- Around line 68-71: The install-plugin-remote PHONY target always re-runs go
install for github.com/yandex/protoc-gen-crd@$(PROTOC_GEN_CRD_VERSION); change
this to create and update a stamp file (e.g. $(PROTOC_GEN_CRD_STAMP)) only when
installation/version changes: implement install-plugin-remote to check the
installed protoc-gen-crd version (or run go install and write the current
$(PROTOC_GEN_CRD_VERSION) into $(PROTOC_GEN_CRD_STAMP) on success) and then
update the $(CRD_FILE) prerequisite to depend on $(PROTOC_GEN_CRD_STAMP) instead
of the PHONY target so the plugin is not reinstalled on every build.

In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`:
- Line 149: The CRD currently sets additionalPrinterColumns: [] so kubectl get
healtheventresources only shows NAME and AGE; update the CRD's
additionalPrinterColumns array to declare useful columns (e.g., nodeName mapping
to .spec.nodeName, checkName to .spec.checkName, isHealthy to .spec.isHealthy,
componentClass to .spec.componentClass) with appropriate JSONPaths, types
(string/boolean), and description entries so those fields appear in `kubectl
get` output; modify the additionalPrinterColumns block in the
health_event.crd.yaml (the additionalPrinterColumns entry) to include these
column definitions.

In `@health-events-analyzer/pkg/parser/parser_test.go`:
- Line 419: Remove the stale commented-out line containing
"UserPodsEvictionStatus: &protos.OperationStatus{Status: &protos.Status{Value:
string(model.StatusSucceeded)}},” from parser_test.go (it’s dead/commented test
data), keeping the surrounding test code intact and ensuring the test still
compiles and passes; specifically remove that commented entry referencing
UserPodsEvictionStatus, protos.OperationStatus, protos.Status and
model.StatusSucceeded to eliminate noise.

In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 929-934: The setDrainFailedStatus update writes a raw
protos.OperationStatus struct into the MongoDB "$set" payload, bypassing the
protojson→json.Unmarshal→map pattern used by updateNodeUserPodsEvictedStatus and
risking future encoding mismatches for proto well-known types; change
setDrainFailedStatus to produce the same map-encoded value as
updateNodeUserPodsEvictedStatus by marshaling the protos.OperationStatus via
protojson, json.Unmarshal into a map[string]interface{}, and place that map
under "healtheventstatus.userpodsevictionstatus" in the "$set" payload (or
alternatively set explicit scalar paths like
"healtheventstatus.userpodsevictionstatus.status" and ".message" if you prefer
field-path updates).

In `@store-client/pkg/testutils/event_builder.go`:
- Line 36: Change the quarantineStatus field and its API from model.Status to
plain string: update the struct field quarantineStatus, the NewTestEventBuilder
default value, and the WithQuarantineStatus(builder) method signature to use
string instead of model.Status, and adjust usages (e.g. the UnmarshalDocument
call where NodeQuarantined is set) to assign e.quarantineStatus directly without
casting; update any other references in this file (including the builder field
and default at NewTestEventBuilder and the WithQuarantineStatus implementation)
to remove the legacy model dependency.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0ba16c0 and 7298063.

⛔ Files ignored due to path filters (22)
  • commons/go.sum is excluded by !**/*.sum
  • data-models/go.sum is excluded by !**/*.sum
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
  • event-exporter/go.sum is excluded by !**/*.sum
  • fault-quarantine/go.sum is excluded by !**/*.sum
  • fault-remediation/go.sum is excluded by !**/*.sum
  • health-events-analyzer/go.sum is excluded by !**/*.sum
  • health-monitors/csp-health-monitor/go.sum is excluded by !**/*.sum
  • health-monitors/kubernetes-object-monitor/go.sum is excluded by !**/*.sum
  • health-monitors/syslog-health-monitor/go.sum is excluded by !**/*.sum
  • janitor-provider/go.sum is excluded by !**/*.sum
  • janitor/go.sum is excluded by !**/*.sum
  • labeler/go.sum is excluded by !**/*.sum
  • metadata-collector/go.sum is excluded by !**/*.sum
  • node-drainer/go.sum is excluded by !**/*.sum
  • platform-connectors/go.sum is excluded by !**/*.sum
  • preflight-checks/nccl-loopback/go.sum is excluded by !**/*.sum
  • preflight/go.sum is excluded by !**/*.sum
  • store-client/go.sum is excluded by !**/*.sum
  • tests/go.sum is excluded by !**/*.sum
  • tests/scale-tests/event-generator/go.sum is excluded by !**/*.sum
  • tilt/simple-health-client/go.sum is excluded by !**/*.sum
📒 Files selected for processing (59)
  • .gitignore
  • .versions.yaml
  • Makefile
  • api/Makefile
  • commons/go.mod
  • commons/pkg/eventutil/parser.go
  • commons/pkg/eventutil/parser_test.go
  • data-models/Makefile
  • data-models/go.mod
  • data-models/pkg/model/health_event_crd.go
  • data-models/pkg/model/health_event_extentions.go
  • data-models/protobufs/health_event.proto
  • data-models/tools.go
  • distros/kubernetes/nvsentinel/Chart.yaml
  • distros/kubernetes/nvsentinel/charts/k8s-datastore/Chart.yaml
  • distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml
  • distros/kubernetes/nvsentinel/charts/k8s-datastore/values.yaml
  • distros/kubernetes/nvsentinel/values.yaml
  • event-exporter/go.mod
  • fault-quarantine/go.mod
  • fault-quarantine/pkg/eventwatcher/event_watcher.go
  • fault-quarantine/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/go.mod
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • health-events-analyzer/go.mod
  • health-events-analyzer/pkg/parser/parser_test.go
  • health-events-analyzer/pkg/reconciler/reconciler_test.go
  • health-monitors/csp-health-monitor/go.mod
  • health-monitors/gpu-health-monitor/Makefile
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • health-monitors/kubernetes-object-monitor/go.mod
  • health-monitors/syslog-health-monitor/go.mod
  • janitor-provider/go.mod
  • janitor/go.mod
  • labeler/go.mod
  • metadata-collector/go.mod
  • node-drainer/go.mod
  • node-drainer/pkg/evaluator/evaluator.go
  • node-drainer/pkg/reconciler/reconciler.go
  • node-drainer/pkg/reconciler/reconciler_integration_test.go
  • platform-connectors/go.mod
  • preflight-checks/dcgm-diag/Makefile
  • preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py
  • preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyi
  • preflight-checks/nccl-allreduce/Makefile
  • preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py
  • preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi
  • preflight-checks/nccl-loopback/go.mod
  • preflight/go.mod
  • store-client/go.mod
  • store-client/pkg/datastore/providers/postgresql/health_events_serialization_test.go
  • store-client/pkg/testutils/event_builder.go
  • tests/go.mod
  • tests/scale-tests/event-generator/go.mod
  • tilt/simple-health-client/go.mod
✅ Files skipped from review due to trivial changes (1)
  • distros/kubernetes/nvsentinel/charts/k8s-datastore/values.yaml
🚧 Files skipped from review as they are similar to previous changes (26)
  • .gitignore
  • .versions.yaml
  • platform-connectors/go.mod
  • health-monitors/gpu-health-monitor/Makefile
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • preflight/go.mod
  • tests/scale-tests/event-generator/go.mod
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • health-monitors/kubernetes-object-monitor/go.mod
  • janitor/go.mod
  • labeler/go.mod
  • commons/go.mod
  • fault-remediation/pkg/reconciler/reconciler.go
  • preflight-checks/nccl-loopback/go.mod
  • distros/kubernetes/nvsentinel/Chart.yaml
  • janitor-provider/go.mod
  • health-monitors/syslog-health-monitor/go.mod
  • data-models/tools.go
  • distros/kubernetes/nvsentinel/charts/k8s-datastore/Chart.yaml
  • node-drainer/pkg/reconciler/reconciler_integration_test.go
  • store-client/go.mod
  • metadata-collector/go.mod
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • fault-remediation/go.mod
  • health-events-analyzer/go.mod
  • commons/pkg/eventutil/parser.go

Comment on lines +110 to +111
k8sdatastoreCrds:
enabled: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Document the new Helm value with an inline comment (incl. enablement note/example).

global.k8sdatastoreCrds.enabled is introduced without inline documentation, so it’s unclear that setting it to true is required to enable CRD installation.

💡 Suggested inline comment
   k8sdatastoreCrds:
-    enabled: false
+    # Enable k8s datastore CRD chart installation (set to true to install CRDs; e.g., enabled: true)
+    enabled: false

As per coding guidelines "Document all values in Helm chart values.yaml with inline comments" and "Note truthy value requirements in Helm chart documentation where applicable".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
k8sdatastoreCrds:
enabled: false
k8sdatastoreCrds:
# Enable k8s datastore CRD chart installation (set to true to install CRDs; e.g., enabled: true)
enabled: false
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@distros/kubernetes/nvsentinel/values.yaml` around lines 110 - 111, Add an
inline comment documenting the new Helm value k8sdatastoreCrds.enabled
(global.k8sdatastoreCrds.enabled) in values.yaml, explaining that setting it to
true enables CRD installation and providing a short usage example or note (e.g.,
"set to true to install CRDs; leave false if CRDs are managed externally").
Place this comment immediately above or beside the k8sdatastoreCrds.enabled
entry so readers see the enablement requirement and example when scanning
values.yaml.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
fault-quarantine/pkg/reconciler/reconciler_e2e_test.go (1)

5257-5262: ⚠️ Potential issue | 🟡 Minor

Missed string() conversion for nodequarantined in TestMetrics_NodeQuarantineDuration.

This is the only remaining inline event construction that still stores model.StatusInProgress as the raw model.Status type instead of string(model.StatusInProgress). Every other occurrence updated in this PR (lines 213, 1457, 1497, 2237, 2318, 3419) applies the string() cast explicitly.

While UnmarshalDocument uses a JSON round-trip (making both forms produce identical output in the current test flow), any code path that performs a direct .(string) type assertion on healtheventstatus.nodequarantined would panic with the raw model.Status value but succeed with a plain string.

🛠️ Proposed fix
-                "nodequarantined": model.StatusInProgress,
+                "nodequarantined": string(model.StatusInProgress),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-quarantine/pkg/reconciler/reconciler_e2e_test.go` around lines 5257 -
5262, In TestMetrics_NodeQuarantineDuration update the inline event construction
that sets healtheventstatus.nodequarantined (for eventID1) to store a string
rather than the raw model.Status: replace the raw model.Status value with
string(model.StatusInProgress) so the healtheventstatus map contains a string
and not a model.Status, ensuring downstream code that does .(string) assertions
on healtheventstatus.nodequarantined (and the TestMetrics_NodeQuarantineDuration
test) won't panic.
commons/pkg/eventutil/parser.go (1)

67-74: ⚠️ Potential issue | 🔴 Critical

Nil dereference: HealthEventStatus is not guarded before accessing .NodeQuarantined.

HealthEventStatus is *protos.HealthEventStatus. If the incoming document lacks the healtheventstatus key entirely, json.Unmarshal leaves the pointer nil. Line 72 then panics. The existing nil guard at lines 67–69 only covers HealthEvent; HealthEventStatus needs the same treatment.

🛠️ Proposed fix
 	// Safety check - ensure HealthEvent is not nil
 	if healthEventWithStatus.HealthEvent == nil {
 		return healthEventWithStatus, fmt.Errorf("health event is nil after unmarshaling")
 	}

+	// Safety check - initialize HealthEventStatus if absent from document
+	if healthEventWithStatus.HealthEventStatus == nil {
+		healthEventWithStatus.HealthEventStatus = &protos.HealthEventStatus{}
+	}
+
 	// Set default value for NodeQuarantined if nil (e.g., for new events)
 	if healthEventWithStatus.HealthEventStatus.NodeQuarantined == "" {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/pkg/eventutil/parser.go` around lines 67 - 74, The code dereferences
healthEventWithStatus.HealthEventStatus without a nil check; before accessing or
setting NodeQuarantined, verify if healthEventWithStatus.HealthEventStatus is
nil and if so initialize it (e.g., new protos.HealthEventStatus{}), then set
NodeQuarantined to string(model.StatusNotStarted) when empty; update the block
around HealthEvent/HealthEventStatus checks to guard HealthEventStatus the same
way as HealthEvent to avoid panics.
node-drainer/pkg/reconciler/reconciler.go (1)

151-155: ⚠️ Potential issue | 🟠 Major

Guard terminal-status check against nil eviction status.

If UserPodsEvictionStatus (or even HealthEventStatus) is nil, this dereference panics during preprocessing and will abort ingestion. Add a nil guard and only treat the status as terminal when present.

🐛 Proposed fix
-	if isTerminalStatus(model.Status(healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status)) {
+	if healthEventWithStatus.HealthEventStatus != nil {
+		if status := healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus; status != nil &&
+			isTerminalStatus(model.Status(status.Status)) {
 			slog.Debug("Skipping event - already in terminal state",
 				"node", healthEventWithStatus.HealthEvent.NodeName,
-				"status", healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status)
+				"status", status.Status)
 
-		return nil
+			return nil
+		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/reconciler/reconciler.go` around lines 151 - 155, The
nil-guard is missing before dereferencing
HealthEventStatus.UserPodsEvictionStatus.Status which can panic; update the
precondition in the reconciler so you first check that
healthEventWithStatus.HealthEventStatus != nil and
healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus != nil and only
then call isTerminalStatus(model.Status(...Status)); leave the existing
slog.Debug path unchanged but ensure you only treat the event as terminal when
those two fields are present (use the same healthEventWithStatus,
HealthEventStatus, UserPodsEvictionStatus and isTerminalStatus identifiers to
locate the code).
♻️ Duplicate comments (11)
fault-remediation/pkg/reconciler/reconciler.go (2)

118-121: ⚠️ Potential issue | 🔴 Critical

Guard HealthEventStatus before dereferencing NodeQuarantined.

Line 118 can panic if healthEventWithStatus.HealthEventStatus is nil; shouldSkipEvent already treats it as possibly nil, so this is still a risk.

🛡️ Proposed fix
- nodeQuarantined := healthEventWithStatus.HealthEventStatus.NodeQuarantined
-
- if nodeQuarantined == string(model.UnQuarantined) || nodeQuarantined == string(model.Cancelled) {
-     return r.handleCancellationEvent(ctx, nodeName, model.Status(nodeQuarantined), r.Watcher, event.ResumeToken)
- }
+ if healthEventWithStatus.HealthEventStatus != nil {
+     nodeQuarantined := healthEventWithStatus.HealthEventStatus.NodeQuarantined
+     if nodeQuarantined == string(model.UnQuarantined) || nodeQuarantined == string(model.Cancelled) {
+         return r.handleCancellationEvent(ctx, nodeName, model.Status(nodeQuarantined), r.Watcher, event.ResumeToken)
+     }
+ }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-remediation/pkg/reconciler/reconciler.go` around lines 118 - 121, The
code dereferences healthEventWithStatus.HealthEventStatus.NodeQuarantined
without guarding that HealthEventStatus may be nil; update the conditional in
the reconciler to first check healthEventWithStatus.HealthEventStatus != nil
before reading NodeQuarantined, then compare the value against
model.UnQuarantined and model.Cancelled and call r.handleCancellationEvent only
when the status exists and matches; reference the HealthEventStatus field and
NodeQuarantined value, and keep the call to r.handleCancellationEvent(ctx,
nodeName, model.Status(nodeQuarantined), r.Watcher, event.ResumeToken) unchanged
when invoked.

139-141: ⚠️ Potential issue | 🟡 Minor

Replace the TODO with an issue reference.

The TODO names a developer instead of a tracking issue, which violates the repo guideline.

✍️ Suggested update
- // TODO - Avinash add a helper function to check for supported actions
+ // TODO - #<issue-number> add a helper function to check for supported actions

As per coding guidelines, "TODO comments should reference issues in Go code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-remediation/pkg/reconciler/reconciler.go` around lines 139 - 141,
Replace the developer-named TODO above the healthEventWithStatus check with a
TODO that references the tracking issue (e.g. "TODO(#<issue-number>): ..." or a
short issue URL) per the repo guideline; remove the name "Avinash" and ensure
the comment succinctly states the needed work (e.g., add helper to check
supported actions) so reviewers can find the issue when seeing the TODO near
healthEventWithStatus.HealthEventStatus.FaultRemediated.
preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyi (1)

5-6: Duplicate: dependency verification covered in health_event_pb2.py.

No extra action here; see the runtime import check in the generated .py file for ensuring the CRD proto module is available.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyi` around
lines 5 - 6, This .pyi duplicate declaration can be removed: delete the
redundant import "from
github.com.yandex.protoc_gen_crd.library.go.k8s.protoc_gen_crd.proto import
crd_pb2 as _crd_pb2" from health_event_pb2.pyi and rely on the runtime
import/verification already present in health_event_pb2.py (the existing runtime
check will ensure the CRD proto module is available).
distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml (2)

30-31: spec is still nullable: true — prior unresolved comment.

nullable: true on spec (and status at line 123) allows creating a HealthEventResource with spec: null, which is non-idiomatic for Kubernetes CRDs. This was flagged in a prior review and remains unaddressed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`
around lines 30 - 31, The CRD currently allows spec and status to be null for
the HealthEventResource by setting "nullable: true" (see the spec and status
schema entries); change those entries so spec and status are non-nullable
(remove "nullable: true" or set to "nullable: false") in the HealthEventResource
CRD schema to prevent creating resources with spec: null or status: null,
ensuring the spec/status schema blocks in the CRD are explicitly required and
validated.

89-91: format: RFC3339 is still non-standard OpenAPI — prior unresolved comment.

format: RFC3339 is not a recognized OpenAPI 3.0 format identifier (the correct one is date-time). Kubernetes ignores unrecognized format values, so timestamp validation is skipped. This was flagged in a prior review and remains unaddressed. Same issue on lines 129–131, 140–142, and 145–147.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`
around lines 89 - 91, The CRD uses the non‑standard OpenAPI format value
"RFC3339" for timestamp properties (e.g. generatedTimestamp); replace every
occurrence of format: RFC3339 with the OpenAPI‑recognized format: date-time
while keeping type: string so Kubernetes/OpenAPI will validate timestamps
correctly (apply this change for generatedTimestamp and the other fields that
currently use format: RFC3339).
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

5-6: ⚠️ Potential issue | 🟠 Major

crd_pb2 import likely missing from the Python package.

The stub mirrors the same missing dependency as the generated .py module; keep the stubs consistent with the fix you apply there.

#!/bin/bash
# Verify whether crd_pb2 exists in the repo/package.
fd "crd_pb2.py" health-monitors/gpu-health-monitor
rg -n "crd_pb2" health-monitors/gpu-health-monitor -g '*.py'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi`
around lines 5 - 6, The stub import for crd_pb2 in health_event_pb2.pyi is
pointing to a non-existent package path; update the stub to match the actual
Python package used by the generated .py (or add the missing crd_pb2 module to
the package) so both the .py and .pyi are consistent: locate the import in
health_event_pb2.pyi (currently referencing github.com.yandex...crd_pb2) and
replace it with the correct Python module name (or a relative/package import
that matches the generated health_event_pb2.py), and ensure the corresponding
crd_pb2.py exists in the package so type stubs and runtime imports resolve.
api/Makefile (1)

104-116: ⚠️ Potential issue | 🟡 Minor

Add a clean phony target to satisfy checkmake.

checkmake still requires clean in this module. Consider aliasing to protos-clean or a no‑op target.

♻️ Proposed fix
 .PHONY: lint
 lint: ## Run golangci-lint on all Go files.
 	`@echo` "Running golangci-lint..."
 	`@golangci-lint` run ./...
 
+.PHONY: clean
+clean: protos-clean ## Remove generated code.
+
 .PHONY: test
 test: ## Run Go unit tests.
 	`@echo` "Running tests..."
 	`@go` test -v ./...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/Makefile` around lines 104 - 116, Add a PHONY clean target to the
Makefile so checkmake passes: create a "clean" target (and add it to .PHONY)
that either calls the existing "protos-clean" target or is a no-op placeholder,
ensuring it coexists with the current "lint", "test", and "lint-test" targets;
update the .PHONY list to include "clean" and make "clean" simply delegate to
"protos-clean" or echo a harmless message.
node-drainer/pkg/reconciler/reconciler.go (2)

637-655: ⚠️ Potential issue | 🟡 Minor

Add a nil guard before reading userPodsEvictionStatus.

The helper dereferences the pointer unconditionally; if any caller passes nil this will panic.

🐛 Proposed fix
 func (r *Reconciler) observeEvictionDurationIfSucceeded(event datastore.Event,
 	userPodsEvictionStatus *protos.OperationStatus) {
+	if userPodsEvictionStatus == nil {
+		return
+	}
 	if userPodsEvictionStatus.Status != string(model.StatusSucceeded) {
 		return
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/reconciler/reconciler.go` around lines 637 - 655, The
function observeEvictionDurationIfSucceeded dereferences userPodsEvictionStatus
without a nil check which can panic if a caller passes nil; add a nil guard at
the top of observeEvictionDurationIfSucceeded (e.g., if userPodsEvictionStatus
== nil { return }) and then perform the existing Status comparison
(userPodsEvictionStatus.Status != string(model.StatusSucceeded)) only after that
guard so the function safely returns when the pointer is nil.

28-28: ⚠️ Potential issue | 🟠 Major

Avoid protojson → map for persistence (unstable keys / omitempty).

protojson.Marshal output is explicitly unstable and omits zero values, which can silently skip clears on $set. Prefer building an explicit update map.

♻️ Proposed fix
-	// Marshal the proto message to JSON, then unmarshal to map[string]interface{}
-	jsonBytes, err := protojson.Marshal(userPodsEvictionStatus)
-	if err != nil {
-		return fmt.Errorf("failed to marshal userPodsEvictionStatus: %w", err)
-	}
-
-	var evictionStatusMap map[string]interface{}
-	if err := json.Unmarshal(jsonBytes, &evictionStatusMap); err != nil {
-		return fmt.Errorf("failed to unmarshal eviction status JSON: %w", err)
-	}
-
-	updateFields := map[string]any{
-		"healtheventstatus.userpodsevictionstatus": evictionStatusMap,
-	}
+	updateFields := map[string]any{
+		"healtheventstatus.userpodsevictionstatus.status":  userPodsEvictionStatus.Status,
+		"healtheventstatus.userpodsevictionstatus.message": userPodsEvictionStatus.Message,
+	}
protojson Marshal output stability and default field name / omitempty behavior

Also applies to: 587-609, 612-613, 627-627

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/reconciler/reconciler.go` at line 28, You are currently
using protojson.Marshal to convert protobuf messages into maps for persistence
(the import "google.golang.org/protobuf/encoding/protojson" and its uses around
the indicated locations); protojson output is unstable and omits zero-values, so
replace each protojson->map conversion with an explicit update map constructed
from the protobuf message fields (use the message's getters or direct field
access) so zero/empty values are represented when you intend to clear fields in
the datastore; locate the code that calls protojson.Marshal and the subsequent
map construction around the referenced areas and build a deterministic map
(e.g., map[string]interface{}) by enumerating each field you want
persisted/cleared, then pass that explicit map to your persistence/update
function instead of the protojson-derived map.
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)

27-28: ⚠️ Potential issue | 🟠 Major

crd_pb2 import likely missing from the Python package.

github.com.yandex...crd_pb2 is not typically available in the gpu‑health‑monitor environment; this will fail at runtime unless it’s vendored and declared. Regenerate the protobufs without the CRD import or vendor/add the dependency.

#!/bin/bash
# Verify whether crd_pb2 exists in the repo/package.
fd "crd_pb2.py" health-monitors/gpu-health-monitor
rg -n "crd_pb2" health-monitors/gpu-health-monitor -g '*.py'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py`
around lines 27 - 28, The generated protobuf file health_event_pb2.py imports an
unavailable module named
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2
(the long alias from the line importing crd_pb2); either remove that CRD
dependency by regenerating the protos without the CRD import or vendor/add the
missing crd_pb2 Python package into the project so the import resolves at
runtime. Locate health_event_pb2.py and the import statement for
github.com.yandex...crd_pb2, then (a) re-run your protobuf generation with the
CRD proto excluded/omitted or (b) add the generated crd_pb2.py into the package
(and update packaging/import paths), and verify with the suggested fd/rg checks
that crd_pb2.py is present.
data-models/protobufs/health_event.proto (1)

31-51: ⚠️ Potential issue | 🟡 Minor

Add field-level comments for the new CRD-facing messages.

OperationStatus, HealthEventStatus, HealthEventWithStatus, and HealthEventResource fields are undocumented; CRD users will see these in generated docs. Please add per‑field comments describing semantics and expected values.

📝 Example (apply similarly to all new fields)
 message OperationStatus {
-  string status = 1;
-  string message = 2;
+  // status holds the operation status string (e.g., InProgress, Succeeded, Failed).
+  string status = 1;
+  // message provides a human‑readable status detail.
+  string message = 2;
 }
 
 message HealthEventStatus {
-  string nodeQuarantined = 1;
+  // nodeQuarantined indicates the quarantine state of the node.
+  string nodeQuarantined = 1;
   google.protobuf.Timestamp quarantineFinishTimestamp = 2;
-  OperationStatus userPodsEvictionStatus = 3;
+  // userPodsEvictionStatus tracks user‑pod eviction progress.
+  OperationStatus userPodsEvictionStatus = 3;
   google.protobuf.Timestamp drainFinishTimestamp = 4;
-  google.protobuf.BoolValue faultRemediated = 5;
+  // faultRemediated indicates whether the underlying fault was remediated.
+  google.protobuf.BoolValue faultRemediated = 5;
   google.protobuf.Timestamp lastRemediationTimestamp = 6;
 }

As per coding guidelines, "Include comprehensive comments for all fields in Protocol Buffer messages."

Also applies to: 118-129

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@data-models/protobufs/health_event.proto` around lines 31 - 51, Add
descriptive per-field comments for the new CRD-facing protobuf messages:
OperationStatus, HealthEventStatus, HealthEventWithStatus (and any
HealthEventResource fields referenced). For each field (OperationStatus.status,
OperationStatus.message, HealthEventStatus.nodeQuarantined,
HealthEventStatus.quarantineFinishTimestamp,
HealthEventStatus.userPodsEvictionStatus,
HealthEventStatus.drainFinishTimestamp, HealthEventStatus.faultRemediated,
HealthEventStatus.lastRemediationTimestamp, HealthEventWithStatus.createdAt,
HealthEventWithStatus.healthEvent, HealthEventWithStatus.healthEventStatus) add
a one-line comment directly above the field that explains its semantics,
allowed/typical values or formats (e.g., enum-like strings, timestamps are
RFC3339/google.protobuf.Timestamp, boolean meaning), and any unit or scope
expectations so generated CRD docs are clear. Ensure comments follow existing
proto style (single-line comments) and mirror the example used elsewhere in the
repo.
🧹 Nitpick comments (2)
preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi (1)

5-6: _crd_pb2 is imported but unused in the type stubs.

The _crd_pb2 import (line 6) isn't referenced by any type annotation in this .pyi file — it's only needed at the descriptor level in the companion .py file. Since this is auto-generated, no action needed here, but the same runtime availability concern for the github.com.yandex.protoc_gen_crd... package applies (flagged in the .py file review).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi`
around lines 5 - 6, Remove the unused import _crd_pb2 from the
health_event_pb2.pyi type stub to avoid a spurious unused-import; update the top
of the file to delete the line importing "from
github.com.yandex.protoc_gen_crd.library.go.k8s.protoc_gen_crd.proto import
crd_pb2 as _crd_pb2" (or alternatively mark it with a "# type: ignore" if the
import must remain for tooling), and ensure any runtime dependency checks or
notes already raised for the companion .py (the github.com.yandex.protoc_gen_crd
package) are addressed separately so the .pyi stays minimal.
commons/pkg/eventutil/parser_test.go (1)

27-157: Missing test case for completely absent healtheventstatus field.

Every test case provides "healtheventstatus": {...} (some with an empty map). There is no case where the key is entirely absent from the document. That path leaves HealthEventStatus nil after json.Unmarshal and triggers the nil dereference in ParseHealthEventFromEvent at line 72 of parser.go. Adding a test for this scenario would catch the bug (and validate the fix once applied).

♻️ Proposed additional test case
{
    name: "handle completely absent healtheventstatus",
    event: datastore.Event{
        "operationType": "insert",
        "fullDocument": map[string]interface{}{
            "healthevent": map[string]interface{}{
                "nodename":       "absent-status-node",
                "checkname":      "GpuXidError",
                "componentclass": "GPU",
            },
        },
    },
    expectError: false,
    checkResult: func(t *testing.T, result model.HealthEventWithStatus) {
        assert.Equal(t, "absent-status-node", result.HealthEvent.NodeName)
        assert.EqualValues(t, model.StatusNotStarted, result.HealthEventStatus.NodeQuarantined)
    },
},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/pkg/eventutil/parser_test.go` around lines 27 - 157, Add a unit test
that covers a document/event where "healtheventstatus" is entirely absent (e.g.,
name "handle completely absent healtheventstatus") and assert that
ParseHealthEventFromEvent returns a HealthEventWithStatus with
HealthEvent.NodeName set and HealthEventStatus.NodeQuarantined defaulted to
model.StatusNotStarted; then fix ParseHealthEventFromEvent in parser.go to guard
against nil HealthEventStatus after json.Unmarshal by initializing a default
model.HealthEventStatus (with NodeQuarantined = model.StatusNotStarted) when the
parsed status is nil or its NodeQuarantined is empty, ensuring no nil
dereference occurs in the code paths that access result.HealthEventStatus.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`:
- Around line 49-70: The schema's enum constraints on fields recommendedAction
(and similarly processingStrategy) are ineffective because
x-kubernetes-int-or-string: true disables enum validation and format: enum is
invalid; remove the enum and format: enum entries or replace them with a
Kubernetes CEL validation using x-kubernetes-validations that enforces allowed
integer/string values for recommendedAction/processingStrategy, and ensure the
CRD uses either the advisory enum removal or an explicit
x-kubernetes-validations CEL rule to enforce the proto enum semantics.

In `@node-drainer/pkg/evaluator/evaluator.go`:
- Around line 62-66: Guard the pointer before dereferencing: check
healthEvent.HealthEventStatus != nil before reading NodeQuarantined and only
perform the comparison to string(model.UnQuarantined) when the pointer is
non-nil; e.g. wrap the existing statusStr logic in an if
healthEvent.HealthEventStatus != nil { ... } block (referencing
HealthEventStatus and NodeQuarantined) so you don't panic and keep the later
UserPodsEvictionStatus nil guard and existing DrainActionResult/ActionSkip
behavior intact.

In `@preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py`:
- Around line 27-28: The generated Python file health_event_pb2.py erroneously
imports
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2
(the crd_pb2 symbol) which is unused and causes ModuleNotFoundError; edit
data-models/protobufs/health_event.proto to remove the import
"github.com/yandex/protoc-gen-crd/library/go/k8s/protoc_gen_crd/proto/crd.proto"
(used only for Go CRD generation), then re-run protoc to regenerate
health_event_pb2.py so the offending import line is gone and the module no
longer references
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2.

In `@preflight-checks/dcgm-diag/Makefile`:
- Line 35: protos-clean should be side‑effect free: remove vendor-proto from the
protos-clean target dependency so it does not fetch or vendor code; ensure
vendor-proto remains only as a dependency of protos-generate (or add
vendor-proto to the protos-generate prerequisites if not already present) and
verify protos-clean only removes generated artifacts without invoking vendoring.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py`:
- Line 28: The generated health_event_pb2.py imports a non-existent module
github.com.yandex.protoc_gen_crd.library.go.k8s.protoc_gen_crd.proto.crd_pb2
which will raise ImportError; regenerate or provide the missing Python CRD proto
package so that crd_pb2 and any intermediate packages (with __init__.py) exist
on the Python path, or adjust the generated import to the correct Python package
name and location; locate the import in health_event_pb2.py (the failing import
line referencing crd_pb2), re-run the protobuf compilation for the CRD proto
target to produce crd_pb2.py (or vendor the compiled crd_pb2 into the
nccl_allreduce package) and ensure the package is installed or added to
PYTHONPATH so the import resolves at runtime.

---

Outside diff comments:
In `@commons/pkg/eventutil/parser.go`:
- Around line 67-74: The code dereferences
healthEventWithStatus.HealthEventStatus without a nil check; before accessing or
setting NodeQuarantined, verify if healthEventWithStatus.HealthEventStatus is
nil and if so initialize it (e.g., new protos.HealthEventStatus{}), then set
NodeQuarantined to string(model.StatusNotStarted) when empty; update the block
around HealthEvent/HealthEventStatus checks to guard HealthEventStatus the same
way as HealthEvent to avoid panics.

In `@fault-quarantine/pkg/reconciler/reconciler_e2e_test.go`:
- Around line 5257-5262: In TestMetrics_NodeQuarantineDuration update the inline
event construction that sets healtheventstatus.nodequarantined (for eventID1) to
store a string rather than the raw model.Status: replace the raw model.Status
value with string(model.StatusInProgress) so the healtheventstatus map contains
a string and not a model.Status, ensuring downstream code that does .(string)
assertions on healtheventstatus.nodequarantined (and the
TestMetrics_NodeQuarantineDuration test) won't panic.

In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 151-155: The nil-guard is missing before dereferencing
HealthEventStatus.UserPodsEvictionStatus.Status which can panic; update the
precondition in the reconciler so you first check that
healthEventWithStatus.HealthEventStatus != nil and
healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus != nil and only
then call isTerminalStatus(model.Status(...Status)); leave the existing
slog.Debug path unchanged but ensure you only treat the event as terminal when
those two fields are present (use the same healthEventWithStatus,
HealthEventStatus, UserPodsEvictionStatus and isTerminalStatus identifiers to
locate the code).

---

Duplicate comments:
In `@api/Makefile`:
- Around line 104-116: Add a PHONY clean target to the Makefile so checkmake
passes: create a "clean" target (and add it to .PHONY) that either calls the
existing "protos-clean" target or is a no-op placeholder, ensuring it coexists
with the current "lint", "test", and "lint-test" targets; update the .PHONY list
to include "clean" and make "clean" simply delegate to "protos-clean" or echo a
harmless message.

In `@data-models/protobufs/health_event.proto`:
- Around line 31-51: Add descriptive per-field comments for the new CRD-facing
protobuf messages: OperationStatus, HealthEventStatus, HealthEventWithStatus
(and any HealthEventResource fields referenced). For each field
(OperationStatus.status, OperationStatus.message,
HealthEventStatus.nodeQuarantined, HealthEventStatus.quarantineFinishTimestamp,
HealthEventStatus.userPodsEvictionStatus,
HealthEventStatus.drainFinishTimestamp, HealthEventStatus.faultRemediated,
HealthEventStatus.lastRemediationTimestamp, HealthEventWithStatus.createdAt,
HealthEventWithStatus.healthEvent, HealthEventWithStatus.healthEventStatus) add
a one-line comment directly above the field that explains its semantics,
allowed/typical values or formats (e.g., enum-like strings, timestamps are
RFC3339/google.protobuf.Timestamp, boolean meaning), and any unit or scope
expectations so generated CRD docs are clear. Ensure comments follow existing
proto style (single-line comments) and mirror the example used elsewhere in the
repo.

In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`:
- Around line 30-31: The CRD currently allows spec and status to be null for the
HealthEventResource by setting "nullable: true" (see the spec and status schema
entries); change those entries so spec and status are non-nullable (remove
"nullable: true" or set to "nullable: false") in the HealthEventResource CRD
schema to prevent creating resources with spec: null or status: null, ensuring
the spec/status schema blocks in the CRD are explicitly required and validated.
- Around line 89-91: The CRD uses the non‑standard OpenAPI format value
"RFC3339" for timestamp properties (e.g. generatedTimestamp); replace every
occurrence of format: RFC3339 with the OpenAPI‑recognized format: date-time
while keeping type: string so Kubernetes/OpenAPI will validate timestamps
correctly (apply this change for generatedTimestamp and the other fields that
currently use format: RFC3339).

In `@fault-remediation/pkg/reconciler/reconciler.go`:
- Around line 118-121: The code dereferences
healthEventWithStatus.HealthEventStatus.NodeQuarantined without guarding that
HealthEventStatus may be nil; update the conditional in the reconciler to first
check healthEventWithStatus.HealthEventStatus != nil before reading
NodeQuarantined, then compare the value against model.UnQuarantined and
model.Cancelled and call r.handleCancellationEvent only when the status exists
and matches; reference the HealthEventStatus field and NodeQuarantined value,
and keep the call to r.handleCancellationEvent(ctx, nodeName,
model.Status(nodeQuarantined), r.Watcher, event.ResumeToken) unchanged when
invoked.
- Around line 139-141: Replace the developer-named TODO above the
healthEventWithStatus check with a TODO that references the tracking issue (e.g.
"TODO(#<issue-number>): ..." or a short issue URL) per the repo guideline;
remove the name "Avinash" and ensure the comment succinctly states the needed
work (e.g., add helper to check supported actions) so reviewers can find the
issue when seeing the TODO near
healthEventWithStatus.HealthEventStatus.FaultRemediated.

In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py`:
- Around line 27-28: The generated protobuf file health_event_pb2.py imports an
unavailable module named
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2
(the long alias from the line importing crd_pb2); either remove that CRD
dependency by regenerating the protos without the CRD import or vendor/add the
missing crd_pb2 Python package into the project so the import resolves at
runtime. Locate health_event_pb2.py and the import statement for
github.com.yandex...crd_pb2, then (a) re-run your protobuf generation with the
CRD proto excluded/omitted or (b) add the generated crd_pb2.py into the package
(and update packaging/import paths), and verify with the suggested fd/rg checks
that crd_pb2.py is present.

In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi`:
- Around line 5-6: The stub import for crd_pb2 in health_event_pb2.pyi is
pointing to a non-existent package path; update the stub to match the actual
Python package used by the generated .py (or add the missing crd_pb2 module to
the package) so both the .py and .pyi are consistent: locate the import in
health_event_pb2.pyi (currently referencing github.com.yandex...crd_pb2) and
replace it with the correct Python module name (or a relative/package import
that matches the generated health_event_pb2.py), and ensure the corresponding
crd_pb2.py exists in the package so type stubs and runtime imports resolve.

In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 637-655: The function observeEvictionDurationIfSucceeded
dereferences userPodsEvictionStatus without a nil check which can panic if a
caller passes nil; add a nil guard at the top of
observeEvictionDurationIfSucceeded (e.g., if userPodsEvictionStatus == nil {
return }) and then perform the existing Status comparison
(userPodsEvictionStatus.Status != string(model.StatusSucceeded)) only after that
guard so the function safely returns when the pointer is nil.
- Line 28: You are currently using protojson.Marshal to convert protobuf
messages into maps for persistence (the import
"google.golang.org/protobuf/encoding/protojson" and its uses around the
indicated locations); protojson output is unstable and omits zero-values, so
replace each protojson->map conversion with an explicit update map constructed
from the protobuf message fields (use the message's getters or direct field
access) so zero/empty values are represented when you intend to clear fields in
the datastore; locate the code that calls protojson.Marshal and the subsequent
map construction around the referenced areas and build a deterministic map
(e.g., map[string]interface{}) by enumerating each field you want
persisted/cleared, then pass that explicit map to your persistence/update
function instead of the protojson-derived map.

In `@preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyi`:
- Around line 5-6: This .pyi duplicate declaration can be removed: delete the
redundant import "from
github.com.yandex.protoc_gen_crd.library.go.k8s.protoc_gen_crd.proto import
crd_pb2 as _crd_pb2" from health_event_pb2.pyi and rely on the runtime
import/verification already present in health_event_pb2.py (the existing runtime
check will ensure the CRD proto module is available).

---

Nitpick comments:
In `@commons/pkg/eventutil/parser_test.go`:
- Around line 27-157: Add a unit test that covers a document/event where
"healtheventstatus" is entirely absent (e.g., name "handle completely absent
healtheventstatus") and assert that ParseHealthEventFromEvent returns a
HealthEventWithStatus with HealthEvent.NodeName set and
HealthEventStatus.NodeQuarantined defaulted to model.StatusNotStarted; then fix
ParseHealthEventFromEvent in parser.go to guard against nil HealthEventStatus
after json.Unmarshal by initializing a default model.HealthEventStatus (with
NodeQuarantined = model.StatusNotStarted) when the parsed status is nil or its
NodeQuarantined is empty, ensuring no nil dereference occurs in the code paths
that access result.HealthEventStatus.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi`:
- Around line 5-6: Remove the unused import _crd_pb2 from the
health_event_pb2.pyi type stub to avoid a spurious unused-import; update the top
of the file to delete the line importing "from
github.com.yandex.protoc_gen_crd.library.go.k8s.protoc_gen_crd.proto import
crd_pb2 as _crd_pb2" (or alternatively mark it with a "# type: ignore" if the
import must remain for tooling), and ensure any runtime dependency checks or
notes already raised for the companion .py (the github.com.yandex.protoc_gen_crd
package) are addressed separately so the .pyi stays minimal.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7298063 and 712d208.

⛔ Files ignored due to path filters (1)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (31)
  • Makefile
  • api/Makefile
  • commons/pkg/eventutil/parser.go
  • commons/pkg/eventutil/parser_test.go
  • data-models/Makefile
  • data-models/pkg/model/health_event_extentions.go
  • data-models/protobufs/health_event.proto
  • distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml
  • distros/kubernetes/nvsentinel/charts/k8s-datastore/values.yaml
  • distros/kubernetes/nvsentinel/values.yaml
  • fault-quarantine/pkg/eventwatcher/event_watcher.go
  • fault-quarantine/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/go.mod
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • health-events-analyzer/pkg/parser/parser_test.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • node-drainer/pkg/evaluator/evaluator.go
  • node-drainer/pkg/reconciler/reconciler.go
  • node-drainer/pkg/reconciler/reconciler_integration_test.go
  • preflight-checks/dcgm-diag/Makefile
  • preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py
  • preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyi
  • preflight-checks/nccl-allreduce/Makefile
  • preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py
  • preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi
  • store-client/pkg/datastore/providers/postgresql/health_events_serialization_test.go
  • store-client/pkg/testutils/event_builder.go
🚧 Files skipped from review as they are similar to previous changes (8)
  • fault-remediation/pkg/remediation/remediation.go
  • fault-quarantine/pkg/eventwatcher/event_watcher.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • distros/kubernetes/nvsentinel/charts/k8s-datastore/values.yaml
  • node-drainer/pkg/reconciler/reconciler_integration_test.go
  • distros/kubernetes/nvsentinel/values.yaml
  • Makefile
  • health-events-analyzer/pkg/parser/parser_test.go

Comment on lines +49 to +70
recommendedAction:
enum:
- 0
- "NONE"
- 2
- "COMPONENT_RESET"
- 5
- "CONTACT_SUPPORT"
- 6
- "RUN_FIELDDIAG"
- 15
- "RESTART_VM"
- 24
- "RESTART_BM"
- 25
- "REPLACE_VM"
- 26
- "RUN_DCGMEUD"
- 99
- "UNKNOWN"
format: enum
x-kubernetes-int-or-string: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

enum constraint combined with x-kubernetes-int-or-string: true is silently ignored by Kubernetes.

Kubernetes structural schema validation treats x-kubernetes-int-or-string: true as a special marker and does not apply enum constraints to such fields. The recommendedAction and processingStrategy enums will not be enforced — any arbitrary integer or string can be stored in these fields. Additionally, format: enum (line 69) is not a recognized OpenAPI format and has no validation effect.

If proto enum validation is important, consider a CEL validation rule (x-kubernetes-validations) as an alternative, or document that these fields are advisory-only.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`
around lines 49 - 70, The schema's enum constraints on fields recommendedAction
(and similarly processingStrategy) are ineffective because
x-kubernetes-int-or-string: true disables enum validation and format: enum is
invalid; remove the enum and format: enum entries or replace them with a
Kubernetes CEL validation using x-kubernetes-validations that enforces allowed
integer/string values for recommendedAction/processingStrategy, and ensure the
CRD uses either the advisory enum removal or an explicit
x-kubernetes-validations CEL rule to enforce the proto enum semantics.

Comment on lines +27 to +28
from google.protobuf import wrappers_pb2 as google_dot_protobuf_dot_wrappers__pb2
from github.com.yandex.protoc_gen_crd.library.go.k8s.protoc_gen_crd.proto import crd_pb2 as github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate any generated CRD Python modules
fd -a "crd_pb2.py"

# Check for a matching package hierarchy (e.g., github/com/...)
fd -a "__init__.py" -g "github/**" || true

# Verify dependency declaration in Poetry config
fd -a "pyproject.toml" -x rg -n "crd|yandex|protoc_gen_crd" {} || true

Repository: NVIDIA/NVSentinel

Length of output: 157


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate any generated CRD Python modules
echo "=== Searching for crd_pb2.py files ==="
fd -a "crd_pb2.py" || echo "No crd_pb2.py found"

# Check for github package directory at repo root or subdirs
echo "=== Checking for github directory structure ==="
fd -a -t d "github" || echo "No github directory found"

# Look for pyproject.toml
echo "=== Locating pyproject.toml ==="
fd -a "pyproject.toml"

# Check repo root structure
echo "=== Repository root structure ==="
ls -la | head -20

# Search for crd references in pyproject.toml if it exists
echo "=== Searching for crd/yandex/protoc_gen_crd in pyproject.toml ==="
fd -a "pyproject.toml" -x cat {} | rg -i "crd|yandex|protoc_gen_crd" || echo "No matches found"

Repository: NVIDIA/NVSentinel

Length of output: 1727


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check the dcgm-diag pyproject.toml
echo "=== dcgm-diag pyproject.toml ==="
cat /home/jailuser/git/preflight-checks/dcgm-diag/pyproject.toml

echo -e "\n=== Checking if health_event_pb2.py exists ==="
ls -la /home/jailuser/git/preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py 2>/dev/null || echo "File not found"

echo -e "\n=== First 50 lines of health_event_pb2.py ==="
head -50 /home/jailuser/git/preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py 2>/dev/null || echo "File not found"

# Search for any references to github.com.yandex or crd in dcgm-diag
echo -e "\n=== Searching for crd/yandex references in dcgm-diag ==="
rg -i "github.com.yandex|crd|yandex" /home/jailuser/git/preflight-checks/dcgm-diag/ || echo "No matches"

Repository: NVIDIA/NVSentinel

Length of output: 11249


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Locating health_event.proto ==="
fd -a "health_event.proto"

echo -e "\n=== Content of health_event.proto ==="
cat /home/jailuser/git/preflight-checks/dcgm-diag/dcgm_diag/protos/health_event.proto 2>/dev/null || echo "Not found"

echo -e "\n=== Checking if crd.proto exists anywhere ==="
fd -a "crd.proto"

echo -e "\n=== Looking for go.mod or other dependency files ==="
fd -a "go.mod|go.sum" /home/jailuser/git/preflight-checks/dcgm-diag/ || echo "None found"

Repository: NVIDIA/NVSentinel

Length of output: 306


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Content of health_event.proto ==="
cat /home/jailuser/git/data-models/protobufs/health_event.proto

echo -e "\n=== Data-models directory structure ==="
ls -la /home/jailuser/git/data-models/protobufs/ | head -20

echo -e "\n=== Checking for crd dependency in data-models ==="
rg -i "crd|yandex" /home/jailuser/git/data-models/ --max-count=10 || echo "None found"

echo -e "\n=== Looking for go.mod in data-models ==="
find /home/jailuser/git/data-models -name "go.mod" -o -name "go.sum" -o -name "Makefile" | head -10

Repository: NVIDIA/NVSentinel

Length of output: 7766


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Checking if crd_pb2 symbol is used in health_event_pb2.py ==="
grep -n "crd_pb2\|_crd_pb2" /home/jailuser/git/preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py | head -20

echo -e "\n=== Checking if health_event_pb2 is imported by Python code ==="
rg -r "import.*health_event_pb2|from.*health_event_pb2" /home/jailuser/git/preflight-checks/dcgm-diag/ || echo "No imports found"

echo -e "\n=== Full grep for crd in generated _pb2.py ==="
grep -c "github_dot_com_dot_yandex" /home/jailuser/git/preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py || echo "0 matches"

echo -e "\n=== Check if the import statement will fail by trying to parse it ==="
python3 -c "import sys; sys.path.insert(0, '/home/jailuser/git/preflight-checks/dcgm-diag'); from dcgm_diag.protos import health_event_pb2" 2>&1 || echo "Import failed as expected"

Repository: NVIDIA/NVSentinel

Length of output: 1016


Remove the unused crd_pb2 import from the proto file and regenerate Python code.

The generated import on line 28 (from github.com.yandex...crd_pb2) causes a ModuleNotFoundError because the crd module is a Go dependency, not available in Python. The symbol is imported but never used in the generated Python code. Remove import "github.com/yandex/protoc-gen-crd/library/go/k8s/protoc_gen_crd/proto/crd.proto"; from data-models/protobufs/health_event.proto (the proto imports it for Go CRD generation, not Python), then regenerate the _pb2.py files to resolve this runtime blocker.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py` around lines
27 - 28, The generated Python file health_event_pb2.py erroneously imports
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2
(the crd_pb2 symbol) which is unused and causes ModuleNotFoundError; edit
data-models/protobufs/health_event.proto to remove the import
"github.com/yandex/protoc-gen-crd/library/go/k8s/protoc_gen_crd/proto/crd.proto"
(used only for Go CRD generation), then re-run protoc to regenerate
health_event_pb2.py so the offending import line is gone and the module no
longer references
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
store-client/pkg/testutils/event_builder.go (2)

15-16: ⚠️ Potential issue | 🟡 Minor

Add a package-level doc comment.

The testutils package lacks a // Package testutils ... doc comment. As per coding guidelines, all Go packages require package-level godoc.

📝 Proposed fix
+// Package testutils provides database-agnostic test helpers for constructing
+// and populating HealthEvent fixtures without coupling tests to a specific datastore backend.
 package testutils

As per coding guidelines: "Package-level godoc required for all Go packages."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@store-client/pkg/testutils/event_builder.go` around lines 15 - 16, Add a
package-level godoc comment immediately above the "package testutils"
declaration: start the comment with "// Package testutils ..." and provide a
brief description of the package’s purpose and what exported helpers (e.g.,
event builders, test utilities) it provides, following Go godoc conventions
(short summary sentence, optional additional details). Ensure the comment is a
plain // comment block describing the package rather than a block comment, and
keep it concise and relevant to identifiers in this package.

43-43: ⚠️ Potential issue | 🟡 Minor

Use the model.UnQuarantined constant instead of the hardcoded string literal.

The model package defines Status constants (e.g., model.UnQuarantined, model.Quarantined), but the code uses model.Status("NotQuarantined"), which doesn't match any defined constant. Use model.UnQuarantined to align with the established quarantine status values.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@store-client/pkg/testutils/event_builder.go` at line 43, Replace the
hardcoded status literal used when constructing the event (the field named
quarantineStatus in the event builder) with the predefined constant from the
model package: use model.UnQuarantined instead of
model.Status("NotQuarantined"), ensuring the builder sets the established Status
constant rather than an unmatched string.
node-drainer/pkg/reconciler/reconciler.go (1)

145-167: ⚠️ Potential issue | 🔴 Critical

Guard against nil UserPodsEvictionStatus before terminal-status check.
HealthEventStatus or UserPodsEvictionStatus can be nil; dereferencing will panic. Use the generated getter and guard before accessing Status.

Proposed fix
-	if isTerminalStatus(model.Status(healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status)) {
+	podsEvictionStatus := healthEventWithStatus.HealthEventStatus.GetUserPodsEvictionStatus()
+	if podsEvictionStatus != nil &&
+		isTerminalStatus(model.Status(podsEvictionStatus.Status)) {
 		slog.Debug("Skipping event - already in terminal state",
 			"node", healthEventWithStatus.HealthEvent.NodeName,
-			"status", healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status)
+			"status", podsEvictionStatus.Status)
 
 		return nil
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/reconciler/reconciler.go` around lines 145 - 167, The
terminal-status check currently dereferences
HealthEventStatus.UserPodsEvictionStatus and can panic if either is nil; before
calling isTerminalStatus, use the generated getters (e.g.
healthEventWithStatus.GetHealthEventStatus(), GetUserPodsEvictionStatus(),
GetStatus()) and guard for nil so you only call
isTerminalStatus(model.Status(...)) when UserPodsEvictionStatus and its Status
are non-nil; similarly replace direct field access in the subsequent slog.Debug
and the handleEventCancellation call with the safe getters or nil-checked values
to avoid panics.
♻️ Duplicate comments (8)
api/Makefile (1)

102-116: ⚠️ Potential issue | 🟡 Minor

Add a clean phony alias to satisfy checkmake.
checkmake still expects a clean target; add a minimal alias to protos-clean.

Proposed patch
 .PHONY: lint-test
 lint-test: lint test ## Run lint and tests.
 	`@echo` "Lint and tests completed."
+
+.PHONY: clean
+clean: protos-clean ## Remove generated code.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/Makefile` around lines 102 - 116, The Makefile is missing a clean phony
target required by checkmake; add a `.PHONY: clean` entry and a minimal `clean`
target that delegates to the existing `protos-clean` target (so add `clean:
protos-clean`) near the other phony targets (next to `lint`, `test`,
`lint-test`) to satisfy the check without changing behavior.
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

5-7: ⚠️ Potential issue | 🟡 Minor

crd_pb2 stub import is unresolved.
Type-checkers/stub loading will fail unless this module is vendored/declared. Regenerate stubs without the CRD import or add the dependency.

#!/bin/bash
# Verify whether crd_pb2 exists and is declared as a dependency.
fd -a "crd_pb2.py"
fd -a "pyproject.toml" -x rg -n "protoc_gen_crd|crd_pb2|yandex" {} || true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi`
around lines 5 - 7, The stub import for crd_pb2 in health_event_pb2.pyi is
unresolved; either remove/regenerate the stubs so they don't import
github.com.yandex...crd_pb2 or vendor/declare that module as a Python
dependency. Locate the import line referencing "crd_pb2 as _crd_pb2" in
health_event_pb2.pyi and fix by regenerating the .pyi without the CRD import (so
protoc doesn't emit the github import) or add a vendored stub/package for the
module (crd_pb2) to the repo and ensure it's listed in packaging/pyproject so
type-checkers can resolve it.
node-drainer/pkg/reconciler/reconciler.go (1)

28-28: ⚠️ Potential issue | 🟠 Major

Avoid protojson for persistence updates; build the map explicitly.
protojson.Marshal output is documented as unstable and omits zero-values, which can leave stale fields (e.g., message) in MongoDB. Build the map directly to ensure deterministic keys and field clearing.

Proposed fix
-	// Marshal the proto message to JSON, then unmarshal to map[string]interface{}
-	jsonBytes, err := protojson.Marshal(userPodsEvictionStatus)
-	if err != nil {
-		return fmt.Errorf("failed to marshal userPodsEvictionStatus: %w", err)
-	}
-
-	var evictionStatusMap map[string]interface{}
-	if err := json.Unmarshal(jsonBytes, &evictionStatusMap); err != nil {
-		return fmt.Errorf("failed to unmarshal eviction status JSON: %w", err)
-	}
+	evictionStatusMap := map[string]any{
+		"status":  userPodsEvictionStatus.Status,
+		"message": userPodsEvictionStatus.Message,
+	}
protojson Marshal output stability omitempty behavior

Also applies to: 588-615

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/reconciler/reconciler.go` at line 28, The code is using
protojson.Marshal to generate update payloads for persistence which is unstable
and omits zero-values (risking stale fields like message); replace the
protojson.Marshal-based update with an explicit map construction that sets each
persisted key deterministically (including empty/zero values to clear fields)
before calling the MongoDB update path used in reconciler.go (search for
protojson.Marshal usage and the MongoDB update call in the reconciler's
update/save functions); ensure fields like "message" are always present in the
map (set to "" or the zero value when absent) and remove the protojson import
once replaced.
preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py (1)

27-31: ⚠️ Potential issue | 🔴 Critical

Runtime import failure risk from crd_pb2.
This CRD Python module isn’t typically available; without vendoring/declaring it, imports will fail. Regenerate without the CRD import or vendor/declare the dependency.

#!/bin/bash
# Verify whether crd_pb2 exists and is declared as a dependency.
fd -a "crd_pb2.py"
fd -a "pyproject.toml" -x rg -n "protoc_gen_crd|crd_pb2|yandex" {} || true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py` around lines
27 - 31, The generated health_event_pb2.py includes an unconditional import of
the CRD module (alias
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2)
which may not be present at runtime; either re-generate the protobuf sources
without that CRD dependency or remove/guard that import and any references to it
in the DESCRIPTOR so the module does not require crd_pb2 at import time (locate
the import statement for crd_pb2 and the DESCRIPTOR =
_descriptor_pool.Default().AddSerializedFile(...) usage in health_event_pb2.py
and eliminate or conditionalize the CRD-related pieces, or vendor/declare the
crd_pb2 dependency if regeneration isn't possible).
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)

27-31: ⚠️ Potential issue | 🔴 Critical

Runtime import failure: crd_pb2 is not available in the Python env.
This generated import will raise ModuleNotFoundError unless the CRD Python module is vendored and declared. Regenerate without the CRD import or vendor/declare the dependency in Poetry.

#!/bin/bash
# Verify whether crd_pb2 exists and is declared as a dependency.
fd -a "crd_pb2.py"
fd -a "pyproject.toml" -x rg -n "protoc_gen_crd|crd_pb2|yandex" {} || true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py`
around lines 27 - 31, The generated health_event_pb2.py imports a non-existent
module
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2
which causes ModuleNotFoundError; either regenerate the protobufs without
referencing crd.proto so the import is removed (re-run proto generation for
HealthEvent without the CRD include), or vendor the CRD Python module and
declare it in pyproject.toml so that
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2
becomes importable; look for the import line and the DESCRIPTOR
AddSerializedFile block in health_event_pb2.py to validate the change.
data-models/protobufs/health_event.proto (1)

30-50: ⚠️ Potential issue | 🟡 Minor

Add per-field comments for new CRD-facing fields.

The new message fields lack field-level documentation (OperationStatus, HealthEventStatus, HealthEventWithStatus, HealthEventResource). These are surfaced in generated CRD docs, so each field should describe meaning, expected values, and semantics.

📝 Example pattern
 message OperationStatus {
+  // status is the operation state (e.g., NotStarted, InProgress, Failed, Succeeded, AlreadyDrained).
   string status = 1;
+  // message provides a human-readable explanation for the current status.
   string message = 2;
 }
As per coding guidelines, "Include comprehensive comments for all fields in Protocol Buffer messages."

Also applies to: 120-129

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@data-models/protobufs/health_event.proto` around lines 30 - 50, The new
protobuf message fields (OperationStatus, HealthEventStatus,
HealthEventWithStatus and HealthEventResource) lack per-field documentation used
by generated CRD docs; update each field declaration to include a brief comment
describing the field's purpose, expected values/range/semantics, and any
important invariants or lifecycle notes (e.g., whether a timestamp is set by
server, what values "status" can take, meaning of boolean wrappers like
faultRemediated, and ownership/validation rules for nested messages). Locate the
messages OperationStatus, HealthEventStatus, HealthEventWithStatus and the
HealthEventResource message and add concise comments for every field (e.g.,
status, message, nodeQuarantined, quarantineFinishTimestamp,
userPodsEvictionStatus, drainFinishTimestamp, faultRemediated,
lastRemediationTimestamp, createdAt, healthEvent, healthEventStatus) following
the existing comment style so CRD docs clearly describe meaning and expected
semantics.
node-drainer/pkg/evaluator/evaluator.go (1)

62-62: ⚠️ Potential issue | 🔴 Critical

Guard healthEvent.HealthEventStatus before dereference.

Line 62 dereferences healthEvent.HealthEventStatus directly; if the status is missing this will panic before your later UserPodsEvictionStatus guard. Add a nil check and short-circuit with a safe action.

🛠️ Proposed fix
-	statusStr := healthEvent.HealthEventStatus.NodeQuarantined
+	if healthEvent.HealthEventStatus == nil {
+		slog.Error("HealthEventStatus is nil, cannot evaluate event",
+			"node", nodeName)
+		return &DrainActionResult{
+			Action:    ActionWait,
+			WaitDelay: time.Minute,
+		}, nil
+	}
+	statusStr := healthEvent.HealthEventStatus.NodeQuarantined
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/evaluator/evaluator.go` at line 62, The code dereferences
healthEvent.HealthEventStatus when assigning statusStr (line with statusStr :=
healthEvent.HealthEventStatus.NodeQuarantined); add a nil guard on
healthEvent.HealthEventStatus before accessing NodeQuarantined and short-circuit
to a safe default action (e.g., treat as non-quarantined or return early) so the
later UserPodsEvictionStatus logic doesn't panic; update the block around the
statusStr assignment in evaluator.go to check healthEvent.HealthEventStatus !=
nil, handle the nil case deterministically, then continue to use
UserPodsEvictionStatus as planned.
preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py (1)

27-31: ⚠️ Potential issue | 🔴 Critical

Ensure crd_pb2 is generated and packaged for Python.

Line 28 imports github.com.yandex.protoc_gen_crd...crd_pb2. If that module isn’t generated into the Python package (including intermediate __init__.py files), importing health_event_pb2 will raise ImportError at runtime. Please generate the CRD proto Python module (or vendor it) and ensure it’s on the module path.

🔎 Verification script
#!/bin/bash
# Check whether crd_pb2.py exists in the repo and where it lives
fd -a 'crd_pb2.py' --type f

# Check for expected package path in nccl_allreduce or data-models vendor outputs
fd -a '__init__.py' --type f | rg -n 'github\.com/yandex/protoc_gen_crd'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py`
around lines 27 - 31, The generated health_event_pb2 import references
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2
(crd_pb2) which must exist on the Python path; generate (or vendor) the Python
CRD proto module and ensure package __init__.py files create the
github.com.yandex.protoc_gen_crd... package namespace so the import in
health_event_pb2 (the google_dot_protobuf_wrappers import and the
github...crd_pb2 import) resolves at runtime; regenerate the CRD proto with
protoc (or copy the generated crd_pb2.py and accompanying __init__.py files)
into the repository and add the package directory to packaging/setup so it’s
installed with the module.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 145-167: The terminal-status check currently dereferences
HealthEventStatus.UserPodsEvictionStatus and can panic if either is nil; before
calling isTerminalStatus, use the generated getters (e.g.
healthEventWithStatus.GetHealthEventStatus(), GetUserPodsEvictionStatus(),
GetStatus()) and guard for nil so you only call
isTerminalStatus(model.Status(...)) when UserPodsEvictionStatus and its Status
are non-nil; similarly replace direct field access in the subsequent slog.Debug
and the handleEventCancellation call with the safe getters or nil-checked values
to avoid panics.

In `@store-client/pkg/testutils/event_builder.go`:
- Around line 15-16: Add a package-level godoc comment immediately above the
"package testutils" declaration: start the comment with "// Package testutils
..." and provide a brief description of the package’s purpose and what exported
helpers (e.g., event builders, test utilities) it provides, following Go godoc
conventions (short summary sentence, optional additional details). Ensure the
comment is a plain // comment block describing the package rather than a block
comment, and keep it concise and relevant to identifiers in this package.
- Line 43: Replace the hardcoded status literal used when constructing the event
(the field named quarantineStatus in the event builder) with the predefined
constant from the model package: use model.UnQuarantined instead of
model.Status("NotQuarantined"), ensuring the builder sets the established Status
constant rather than an unmatched string.

---

Duplicate comments:
In `@api/Makefile`:
- Around line 102-116: The Makefile is missing a clean phony target required by
checkmake; add a `.PHONY: clean` entry and a minimal `clean` target that
delegates to the existing `protos-clean` target (so add `clean: protos-clean`)
near the other phony targets (next to `lint`, `test`, `lint-test`) to satisfy
the check without changing behavior.

In `@data-models/protobufs/health_event.proto`:
- Around line 30-50: The new protobuf message fields (OperationStatus,
HealthEventStatus, HealthEventWithStatus and HealthEventResource) lack per-field
documentation used by generated CRD docs; update each field declaration to
include a brief comment describing the field's purpose, expected
values/range/semantics, and any important invariants or lifecycle notes (e.g.,
whether a timestamp is set by server, what values "status" can take, meaning of
boolean wrappers like faultRemediated, and ownership/validation rules for nested
messages). Locate the messages OperationStatus, HealthEventStatus,
HealthEventWithStatus and the HealthEventResource message and add concise
comments for every field (e.g., status, message, nodeQuarantined,
quarantineFinishTimestamp, userPodsEvictionStatus, drainFinishTimestamp,
faultRemediated, lastRemediationTimestamp, createdAt, healthEvent,
healthEventStatus) following the existing comment style so CRD docs clearly
describe meaning and expected semantics.

In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py`:
- Around line 27-31: The generated health_event_pb2.py imports a non-existent
module
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2
which causes ModuleNotFoundError; either regenerate the protobufs without
referencing crd.proto so the import is removed (re-run proto generation for
HealthEvent without the CRD include), or vendor the CRD Python module and
declare it in pyproject.toml so that
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2
becomes importable; look for the import line and the DESCRIPTOR
AddSerializedFile block in health_event_pb2.py to validate the change.

In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi`:
- Around line 5-7: The stub import for crd_pb2 in health_event_pb2.pyi is
unresolved; either remove/regenerate the stubs so they don't import
github.com.yandex...crd_pb2 or vendor/declare that module as a Python
dependency. Locate the import line referencing "crd_pb2 as _crd_pb2" in
health_event_pb2.pyi and fix by regenerating the .pyi without the CRD import (so
protoc doesn't emit the github import) or add a vendored stub/package for the
module (crd_pb2) to the repo and ensure it's listed in packaging/pyproject so
type-checkers can resolve it.

In `@node-drainer/pkg/evaluator/evaluator.go`:
- Line 62: The code dereferences healthEvent.HealthEventStatus when assigning
statusStr (line with statusStr :=
healthEvent.HealthEventStatus.NodeQuarantined); add a nil guard on
healthEvent.HealthEventStatus before accessing NodeQuarantined and short-circuit
to a safe default action (e.g., treat as non-quarantined or return early) so the
later UserPodsEvictionStatus logic doesn't panic; update the block around the
statusStr assignment in evaluator.go to check healthEvent.HealthEventStatus !=
nil, handle the nil case deterministically, then continue to use
UserPodsEvictionStatus as planned.

In `@node-drainer/pkg/reconciler/reconciler.go`:
- Line 28: The code is using protojson.Marshal to generate update payloads for
persistence which is unstable and omits zero-values (risking stale fields like
message); replace the protojson.Marshal-based update with an explicit map
construction that sets each persisted key deterministically (including
empty/zero values to clear fields) before calling the MongoDB update path used
in reconciler.go (search for protojson.Marshal usage and the MongoDB update call
in the reconciler's update/save functions); ensure fields like "message" are
always present in the map (set to "" or the zero value when absent) and remove
the protojson import once replaced.

In `@preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py`:
- Around line 27-31: The generated health_event_pb2.py includes an unconditional
import of the CRD module (alias
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2)
which may not be present at runtime; either re-generate the protobuf sources
without that CRD dependency or remove/guard that import and any references to it
in the DESCRIPTOR so the module does not require crd_pb2 at import time (locate
the import statement for crd_pb2 and the DESCRIPTOR =
_descriptor_pool.Default().AddSerializedFile(...) usage in health_event_pb2.py
and eliminate or conditionalize the CRD-related pieces, or vendor/declare the
crd_pb2 dependency if regeneration isn't possible).

In `@preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py`:
- Around line 27-31: The generated health_event_pb2 import references
github_dot_com_dot_yandex_dot_protoc__gen__crd_dot_library_dot_go_dot_k8s_dot_protoc__gen__crd_dot_proto_dot_crd__pb2
(crd_pb2) which must exist on the Python path; generate (or vendor) the Python
CRD proto module and ensure package __init__.py files create the
github.com.yandex.protoc_gen_crd... package namespace so the import in
health_event_pb2 (the google_dot_protobuf_wrappers import and the
github...crd_pb2 import) resolves at runtime; regenerate the CRD proto with
protoc (or copy the generated crd_pb2.py and accompanying __init__.py files)
into the repository and add the package directory to packaging/setup so it’s
installed with the module.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 712d208 and aa8342c.

⛔ Files ignored due to path filters (1)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (31)
  • Makefile
  • api/Makefile
  • commons/pkg/eventutil/parser.go
  • commons/pkg/eventutil/parser_test.go
  • data-models/Makefile
  • data-models/pkg/model/health_event_extentions.go
  • data-models/protobufs/health_event.proto
  • distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml
  • distros/kubernetes/nvsentinel/charts/k8s-datastore/values.yaml
  • distros/kubernetes/nvsentinel/values.yaml
  • fault-quarantine/pkg/eventwatcher/event_watcher.go
  • fault-quarantine/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/go.mod
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • health-events-analyzer/pkg/parser/parser_test.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • node-drainer/pkg/evaluator/evaluator.go
  • node-drainer/pkg/reconciler/reconciler.go
  • node-drainer/pkg/reconciler/reconciler_integration_test.go
  • preflight-checks/dcgm-diag/Makefile
  • preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py
  • preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyi
  • preflight-checks/nccl-allreduce/Makefile
  • preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py
  • preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi
  • store-client/pkg/datastore/providers/postgresql/health_events_serialization_test.go
  • store-client/pkg/testutils/event_builder.go
🚧 Files skipped from review as they are similar to previous changes (14)
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-quarantine/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyi
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • node-drainer/pkg/reconciler/reconciler_integration_test.go
  • distros/kubernetes/nvsentinel/values.yaml
  • preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi
  • commons/pkg/eventutil/parser.go
  • distros/kubernetes/nvsentinel/charts/k8s-datastore/values.yaml
  • preflight-checks/dcgm-diag/Makefile
  • distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml
  • fault-remediation/go.mod
  • commons/pkg/eventutil/parser_test.go

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
node-drainer/pkg/reconciler/reconciler.go (2)

150-156: ⚠️ Potential issue | 🔴 Critical

Nil dereference panic if UserPodsEvictionStatus is absent from the document

UserPodsEvictionStatus is a *protos.OperationStatus field. Documents that have never been processed by setInitialStatusAndEnqueue will have this field absent, causing unmarshalGenericEvent to leave the pointer nil. Accessing .Status at line 150 without a preceding nil check will panic.

Every other call site in this file guards the pointer first (lines 516, 534, 797). This newly added path does not.

🐛 Proposed fix
-	if isTerminalStatus(model.Status(healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status)) {
+	evictionStatus := healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus
+	if evictionStatus != nil && isTerminalStatus(model.Status(evictionStatus.Status)) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/reconciler/reconciler.go` around lines 150 - 156, The code
dereferences
healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status without
checking for nil; update the early-return block (the one calling
isTerminalStatus(...)) to first guard that
healthEventWithStatus.HealthEventStatus != nil and
healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus != nil before
accessing .Status (mirror the nil checks used at other call sites in this file),
and only call isTerminalStatus when the pointer is non-nil; if the pointer is
nil, skip the terminal-state check and continue (i.e., do not dereference).

953-960: ⚠️ Potential issue | 🟠 Major

setDrainFailedStatus bypasses the explicit-map pattern, risking BSON field name drift

updateNodeUserPodsEvictedStatus was specifically refactored to use an explicit map[string]interface{}{"status": ..., "message": ...} to guarantee predictable, lowercase field names in MongoDB. setDrainFailedStatus embeds a protos.OperationStatus{} struct value directly into the $set payload.

The BSON encoder uses lowercased field names when no bson tag is present, so it matches today — but any future exported field added to protos.OperationStatus will be serialized silently, and the inconsistency makes the intent opaque.

♻️ Proposed fix — use the same explicit map approach
 	update := map[string]any{
 		"$set": map[string]any{
-			"healtheventstatus.userpodsevictionstatus": protos.OperationStatus{
-				Status:  string(model.StatusFailed),
-				Message: reason,
-			},
+			"healtheventstatus.userpodsevictionstatus": map[string]interface{}{
+				"status":  string(model.StatusFailed),
+				"message": reason,
+			},
 		},
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/reconciler/reconciler.go` around lines 953 - 960, The update
payload in setDrainFailedStatus currently embeds a protos.OperationStatus struct
directly, risking implicit BSON field names; change it to mirror
updateNodeUserPodsEvictedStatus by constructing an explicit map[string]any for
the "$set" value (e.g., "$set":
map[string]any{"healtheventstatus.userpodsevictionstatus":
map[string]any{"status": string(model.StatusFailed), "message": reason}}) so
only the intended "status" and "message" lowercase fields are written; locate
setDrainFailedStatus and replace the protos.OperationStatus literal with the
explicit map form referencing protos.OperationStatus fields for values.
♻️ Duplicate comments (1)
node-drainer/pkg/reconciler/reconciler.go (1)

631-649: Nil guard for userPodsEvictionStatus still absent

The nil check requested in the previous review was reportedly addressed, but the current function body at line 634 still accesses userPodsEvictionStatus.Status unconditionally. While all current callers pass a non-nil pointer (enforced by guards in the calling functions), the defensive guard is absent and the function signature accepts a pointer.

🛡️ Proposed guard
 func (r *Reconciler) observeEvictionDurationIfSucceeded(event datastore.Event,
 	userPodsEvictionStatus *protos.OperationStatus) {
+	if userPodsEvictionStatus == nil {
+		return
+	}
 	if userPodsEvictionStatus.Status != string(model.StatusSucceeded) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/reconciler/reconciler.go` around lines 631 - 649, Add a
defensive nil guard at the start of observeEvictionDurationIfSucceeded to return
immediately if the pointer userPodsEvictionStatus is nil before accessing
userPodsEvictionStatus.Status; locate the function
observeEvictionDurationIfSucceeded in reconciler.go and insert a check for
userPodsEvictionStatus == nil (return) so subsequent code can safely read
userPodsEvictionStatus.Status and avoid panics.
🧹 Nitpick comments (2)
api/Makefile (1)

53-53: make all now implies lint + tests — consider whether that's intentional for local use.

Running lint-test as part of all is great for CI but means a plain make invocation also blocks on linting and testing, which can slow local iteration. If the intent is CI-first, this is fine as-is. A common pattern is to keep all: protos-generate build for local builds and add a separate ci: all lint-test target for pipeline use.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/Makefile` at line 53, The current Makefile target "all" includes
"lint-test" which forces linting and tests on every plain `make`; change the
Makefile so "all" depends only on "protos-generate" and "build" (remove
"lint-test" from the "all" prerequisite list) and add a new CI-specific target
(e.g., `ci`) that depends on "all" and "lint-test" so CI runs linting/tests but
local `make` stays fast; update targets "all", "lint-test", "protos-generate",
and "build" accordingly.
node-drainer/pkg/reconciler/reconciler.go (1)

163-169: Prefer a local model.Status variable over the unsafe pointer cast

(*model.Status)(&healthEventWithStatus.HealthEventStatus.NodeQuarantined) works only because model.Status is type Status string, but the cast will silently break if that type alias changes. The same pattern appears at line 294.

♻️ Suggested refactor (applies to both line 165 and line 294)
+	nodeQuarantinedStatus := model.Status(healthEventWithStatus.HealthEventStatus.NodeQuarantined)
 	if shouldSkip := r.handleEventCancellation(
-		documentID, nodeName, (*model.Status)(&healthEventWithStatus.HealthEventStatus.NodeQuarantined)); shouldSkip {
+		documentID, nodeName, &nodeQuarantinedStatus); shouldSkip {

And at line 294:

+	nodeQuarantinedModelStatus := model.Status(nodeQuarantinedStatus)
-	if r.isEventCancelled(eventID, nodeName, (*model.Status)(&nodeQuarantinedStatus)) {
+	if r.isEventCancelled(eventID, nodeName, &nodeQuarantinedModelStatus) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/reconciler/reconciler.go` around lines 163 - 169, Replace
the unsafe pointer cast to model.Status by creating a local variable of type
model.Status, assign it the string value from
healthEventWithStatus.HealthEventStatus.NodeQuarantined (and the analogous value
used at the other site), and pass a pointer to that local variable into
r.handleEventCancellation; update both call sites (the call using
healthEventWithStatus.HealthEventStatus.NodeQuarantined and the similar call
around line 294) so they reference the new local model.Status variable instead
of using (*model.Status)(...).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/Makefile`:
- Around line 104-107: The lint target currently invokes golangci-lint from
PATH; add a version-pinned variable (e.g., GOLANGCI_LINT_VERSION set to v2.5.0)
and a download/install target (e.g., download-golangci-lint or
install-golangci-lint) that uses go-install-tool to install
golangci-lint@$(GOLANGCI_LINT_VERSION) into $(LOCALBIN), then update the lint
target to call $(LOCALBIN)/golangci-lint (or the variable holding that path)
instead of the bare golangci-lint; ensure references use the existing patterns
for other tools like protoc-gen-go/protoc-gen-go-grpc and that the version
matches .versions.yaml.

---

Outside diff comments:
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 150-156: The code dereferences
healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus.Status without
checking for nil; update the early-return block (the one calling
isTerminalStatus(...)) to first guard that
healthEventWithStatus.HealthEventStatus != nil and
healthEventWithStatus.HealthEventStatus.UserPodsEvictionStatus != nil before
accessing .Status (mirror the nil checks used at other call sites in this file),
and only call isTerminalStatus when the pointer is non-nil; if the pointer is
nil, skip the terminal-state check and continue (i.e., do not dereference).
- Around line 953-960: The update payload in setDrainFailedStatus currently
embeds a protos.OperationStatus struct directly, risking implicit BSON field
names; change it to mirror updateNodeUserPodsEvictedStatus by constructing an
explicit map[string]any for the "$set" value (e.g., "$set":
map[string]any{"healtheventstatus.userpodsevictionstatus":
map[string]any{"status": string(model.StatusFailed), "message": reason}}) so
only the intended "status" and "message" lowercase fields are written; locate
setDrainFailedStatus and replace the protos.OperationStatus literal with the
explicit map form referencing protos.OperationStatus fields for values.

---

Duplicate comments:
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 631-649: Add a defensive nil guard at the start of
observeEvictionDurationIfSucceeded to return immediately if the pointer
userPodsEvictionStatus is nil before accessing userPodsEvictionStatus.Status;
locate the function observeEvictionDurationIfSucceeded in reconciler.go and
insert a check for userPodsEvictionStatus == nil (return) so subsequent code can
safely read userPodsEvictionStatus.Status and avoid panics.

---

Nitpick comments:
In `@api/Makefile`:
- Line 53: The current Makefile target "all" includes "lint-test" which forces
linting and tests on every plain `make`; change the Makefile so "all" depends
only on "protos-generate" and "build" (remove "lint-test" from the "all"
prerequisite list) and add a new CI-specific target (e.g., `ci`) that depends on
"all" and "lint-test" so CI runs linting/tests but local `make` stays fast;
update targets "all", "lint-test", "protos-generate", and "build" accordingly.

In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 163-169: Replace the unsafe pointer cast to model.Status by
creating a local variable of type model.Status, assign it the string value from
healthEventWithStatus.HealthEventStatus.NodeQuarantined (and the analogous value
used at the other site), and pass a pointer to that local variable into
r.handleEventCancellation; update both call sites (the call using
healthEventWithStatus.HealthEventStatus.NodeQuarantined and the similar call
around line 294) so they reference the new local model.Status variable instead
of using (*model.Status)(...).

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aa8342c and eaba40f.

📒 Files selected for processing (3)
  • api/Makefile
  • node-drainer/pkg/evaluator/evaluator.go
  • node-drainer/pkg/reconciler/reconciler.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • node-drainer/pkg/evaluator/evaluator.go

Comment on lines +104 to +107
.PHONY: lint
lint: ## Run golangci-lint on all Go files.
@echo "Running golangci-lint..."
@golangci-lint run ./...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check .versions.yaml for golangci-lint entry
echo "=== Checking .versions.yaml for golangci-lint ==="
if [ -f ".versions.yaml" ]; then
  cat .versions.yaml
else
  echo ".versions.yaml not found in repo root"
fi

Repository: NVIDIA/NVSentinel

Length of output: 2502


🏁 Script executed:

#!/bin/bash
# Check api/Makefile for tool management patterns
echo "=== Checking api/Makefile ==="
if [ -f "api/Makefile" ]; then
  cat -n api/Makefile | head -150
else
  echo "api/Makefile not found"
fi

Repository: NVIDIA/NVSentinel

Length of output: 5507


🏁 Script executed:

#!/bin/bash
# Search for golangci-lint references in Makefiles
echo "=== Searching for golangci-lint in Makefiles ==="
rg "golangci" --type makefile -C 3

Repository: NVIDIA/NVSentinel

Length of output: 159


Add version-pinned golangci-lint to api/Makefile following the same pattern as other Go tools.

golangci-lint is defined in .versions.yaml at v2.5.0 but is invoked directly from PATH in the lint target without version pinning. All other Go tools (protoc-gen-go, protoc-gen-go-grpc) are pinned and installed via go-install-tool into $(LOCALBIN), ensuring reproducible builds across machines and CI runners. The lint target should follow the same pattern.

Add the variable and download target:

♻️ Proposed changes
 # Tool Binaries
 PROTOC_GEN_GO           ?= $(LOCALBIN)/protoc-gen-go
 PROTOC_GEN_GO_GRPC      ?= $(LOCALBIN)/protoc-gen-go-grpc
+GOLANGCI_LINT           ?= $(LOCALBIN)/golangci-lint
 
 # Tool Versions (load from ../.versions.yaml)
 # Requires yq to be installed: brew install yq (macOS) or see https://github.com/mikefarah/yq
 YQ := $(shell command -v yq 2> /dev/null)
 ifndef YQ
 $(error yq is required but not found. Install it with: brew install yq (macOS) or see https://github.com/mikefarah/yq)
 endif
 
 VERSIONS_FILE := ../.versions.yaml
 
 PROTOC_GEN_GO_VERSION           := $(shell $(YQ) '.protobuf.protoc_gen_go' $(VERSIONS_FILE))
 PROTOC_GEN_GO_GRPC_VERSION      := $(shell $(YQ) '.protobuf.protoc_gen_go_grpc' $(VERSIONS_FILE))
+GOLANGCI_LINT_VERSION           := $(shell $(YQ) '.go_tools.golangci_lint' $(VERSIONS_FILE))
 .PHONY: lint
-lint: ## Run golangci-lint on all Go files.
+lint: $(GOLANGCI_LINT) ## Run golangci-lint on all Go files.
 	`@echo` "Running golangci-lint..."
-	`@golangci-lint` run ./...
+	@$(GOLANGCI_LINT) run ./...
 $(PROTOC_GEN_GO_GRPC):
 	$(call go-install-tool,$(PROTOC_GEN_GO_GRPC),google.golang.org/grpc/cmd/protoc-gen-go-grpc,$(PROTOC_GEN_GO_GRPC_VERSION))
 
+$(GOLANGCI_LINT):
+	$(call go-install-tool,$(GOLANGCI_LINT),github.com/golangci/golangci-lint/cmd/golangci-lint,$(GOLANGCI_LINT_VERSION))
+
 # go-install-tool macro
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/Makefile` around lines 104 - 107, The lint target currently invokes
golangci-lint from PATH; add a version-pinned variable (e.g.,
GOLANGCI_LINT_VERSION set to v2.5.0) and a download/install target (e.g.,
download-golangci-lint or install-golangci-lint) that uses go-install-tool to
install golangci-lint@$(GOLANGCI_LINT_VERSION) into $(LOCALBIN), then update the
lint target to call $(LOCALBIN)/golangci-lint (or the variable holding that
path) instead of the bare golangci-lint; ensure references use the existing
patterns for other tools like protoc-gen-go/protoc-gen-go-grpc and that the
version matches .versions.yaml.

// from the change stream before passing to the reconciler
// NOTE: Always use a pointer for proto structs (e.g., *protos.HealthEventStatus)!
// Proto-generated structs may embed sync.Mutex or other non-copyable fields.
// Using a value type can cause unsafe copies and subtle bugs (see Go linter: copylocksdefault).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to change protos.HealthEventStatus to be a ptr, earlier it was value

@lalitadithya
Copy link
Collaborator

/ok to test 310f037

@yavinash007
Copy link
Contributor Author

@ivelichkovich @lalitadithya Sorry, rebase was messy, had to squash all the commits for a clean rebase.

@lalitadithya
Copy link
Collaborator

sorry @yavinash007 , we merged in a bunch of dependabot updates earlier today. Can you rebase again? The next depedenbot updates aren't for a week, so we can get this merged in before that. sorry again

expected := bson.M{
"healtheventstatus.faultremediated": true,
"healtheventstatus.lastremediationtimestamp": now,
"healtheventstatus.faultremediated": map[string]interface{}{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be a string with the value true?

Signed-off-by: Avinash Yeddula <avinash.yeddula@omniva.com>
Signed-off-by: Avinash Yeddula <avinash.yeddula@omniva.com>
@yavinash007
Copy link
Contributor Author

#882 (comment)

Yes, I'm running on MAC/arm, likely those build flags has something to do with it.

I reverted it to base line.

@yavinash007
Copy link
Contributor Author

@lalitadithya Could take a look a the latest commit pls, made couple of changes while debugging e2e tests.

All the e2e tests are passing now.

Can we run tests on this MR pls ? If you dont have any major concerns

@lalitadithya
Copy link
Collaborator

/ok to test d0a73ef

@github-actions
Copy link

github-actions bot commented Mar 4, 2026

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler 20.90% (+0.02%) 👍
github.com/nvidia/nvsentinel/store-client/pkg/client 5.98% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/datastore 3.80% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql 5.13% (ø)
github.com/nvidia/nvsentinel/tests 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler.go 20.90% (+0.02%) 1306 (-6) 273 (-1) 1033 (-5) 👍
github.com/nvidia/nvsentinel/store-client/pkg/client/mongodb_client.go 3.43% (ø) 3115 107 3008
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql/health_events.go 0.04% (ø) 2761 1 2760
github.com/nvidia/nvsentinel/store-client/pkg/datastore/types.go 8.20% (ø) 122 10 112

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/store-client/pkg/client/mongodb_client_granular_test.go
  • github.com/nvidia/nvsentinel/tests/kubernetes_object_monitor_test.go

@lalitadithya
Copy link
Collaborator

/ok to test 352df82

@github-actions
Copy link

github-actions bot commented Mar 4, 2026

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/labeler 0.00% (ø)
github.com/nvidia/nvsentinel/labeler/pkg/initializer 0.00% (ø)
github.com/nvidia/nvsentinel/labeler/pkg/labeler 80.41% (-1.69%) 👎

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/labeler/main.go 0.00% (ø) 86 0 86
github.com/nvidia/nvsentinel/labeler/pkg/initializer/init.go 0.00% (ø) 33 0 33
github.com/nvidia/nvsentinel/labeler/pkg/labeler/labeler.go 80.41% (-1.69%) 296 238 (-5) 58 (+5) 👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/labeler/pkg/labeler/labeler_test.go

@lalitadithya
Copy link
Collaborator

/ok to test fae77ef

@lalitadithya lalitadithya merged commit e6aa959 into NVIDIA:main Mar 4, 2026
65 of 66 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants