Skip to content

feat: add fabric-manager-monitor for GPU infrastructure health checks#891

Open
dmvevents wants to merge 3 commits intoNVIDIA:mainfrom
dmvevents:feature/fabric-manager-monitor
Open

feat: add fabric-manager-monitor for GPU infrastructure health checks#891
dmvevents wants to merge 3 commits intoNVIDIA:mainfrom
dmvevents:feature/fabric-manager-monitor

Conversation

@dmvevents
Copy link

@dmvevents dmvevents commented Feb 20, 2026

Summary

Adds a new health monitor that detects Fabric Manager failures, PCIe link degradation, NVLink fabric issues, GPU clock throttling, and CUDA context failures — failure modes invisible to DCGM-based monitoring.

Closes #883 (NVSentinel not detecting fabric health on H100s)
Related: #889, #890

Motivation

A customer's NVIDIA Fabric Manager failed to start after maintenance and stayed broken for 20 days undetected. NVSentinel's existing monitors (DCGM-based, syslog-based) missed it because individual GPUs appear healthy to DCGM even when Fabric Manager is down. This PR fills the gap with service-level and infrastructure health checks.

What's Included

1. Native Integration (health-monitors/fabric-manager-monitor/)

Follows the same architecture as gpu-health-monitor:

  • Click CLI with all configuration as CLI flags
  • Structlog JSON logging with module/version injection
  • gRPC HealthEvent protobuf to platform-connector via Unix domain socket
  • State caching — only state transitions generate events
  • Exponential backoff retries (10 retries, 1.5x delay)
  • Poetry project with multi-stage Dockerfile
  • CallbackInterface ABC + ThreadPoolExecutor for async callbacks

2. Standalone Demo (demos/fabric-manager-monitor/)

Lightweight version exposing Prometheus metrics and alerts without requiring the full NVSentinel pipeline. Useful for evaluating the checks independently.

Check Categories

# Check Fatal Detection Method
1 Fabric Manager service down Yes nsenter + systemctl show
2 GPU services (persistenced, DCGM) No nsenter + systemctl show
3 PCIe link downtraining (Gen5→Gen3, x16→x8) Yes nsenter + nvidia-smi
4 NVLink fabric (CRC errors, bandwidth) Yes DCGM exporter HTTP
5 CUDA context validation Yes Subprocess torch test
6 GPU clock throttling No nsenter + nvidia-smi

False-Positive Mitigations

  • Boot grace period: Suppresses alerts during node startup (default 300s)
  • Flap detection: Tracks service restart frequency within a configurable window
  • GPU Idle filter: Ignores benign idle throttle reason (bitmask 0x0000000000000001)
  • NVLink correlation: Bandwidth-zero is only flagged unhealthy when correlated with Fabric Manager being down
  • State caching: Only state transitions generate HealthEvents

Validated On

  • 2x P4d.24xlarge (8x A100-SXM4-40GB each) — Amazon Linux 2023, EKS 1.32
  • All 6 check categories produce correct metrics
  • GPU Idle downclocking correctly filtered as benign
  • 25 unit tests passing

Bug Fixes Discovered During Live Testing

  1. nsenter requires root — non-root USER in Dockerfile has empty CapEff even with privileged: true
  2. NRestarts property not recognized by systemd 252 (Amazon Linux 2023) — now queried separately with fallback
  3. GPU Idle (0x0000000000000001) false positive on clock throttle detection — added benign reasons filter

Test Plan

  • Unit tests (25 tests): pytest tests/
  • Live deployment on P4d.24xlarge nodes
  • Verified all Prometheus metrics correct via curl localhost:9101/metrics
  • Verified Fabric Manager detection: systemctl stop nvidia-fabricmanagerfabric_manager_up drops to 0
  • CI pipeline integration
  • Integration test with platform-connector gRPC endpoint

Summary by CodeRabbit

  • New Features

    • Fabric Manager Monitor demo: standalone GPU node health validator exposing Prometheus metrics (port 9101) with service, PCIe, NVLink, clock throttling, and CUDA validation checks; deployable as a DaemonSet with ConfigMap and RBAC.
  • Monitoring / Alerts

    • ServiceMonitor and PrometheusRule bundle with seven alerts covering fabric manager, NVLink, PCIe, GPU clocks, CUDA validation, GPU services, and flapping.
  • Packaging

    • Container images, CLI entrypoint, and gRPC publishing support for health events.
  • Documentation

    • New READMEs with quick-start, metrics, alerts, and deployment guidance.
  • Tests

    • Unit tests for checks and monitor logic.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 20, 2026

📝 Walkthrough

Walkthrough

Adds a Fabric Manager Monitor demo and a production-grade fabric-manager-monitor health monitor implementing multiple GPU health checks, Prometheus metrics, Kubernetes manifests, container builds, gRPC platform-connector publishing, and unit tests.

Changes

Cohort / File(s) Summary
Docs & READMEs
demos/README.md, demos/fabric-manager-monitor/README.md, health-monitors/fabric-manager-monitor/README.md
New demo entry and comprehensive READMEs describing the Fabric Manager Monitor, deployment, metrics, alerts, and operational guidance.
Demo Container & Runtime
demos/fabric-manager-monitor/Dockerfile, demos/fabric-manager-monitor/requirements.txt
Added demo Dockerfile and Python requirements to build/run the demo monitor container.
Demo Config & K8s Manifests
demos/fabric-manager-monitor/k8s/configmap.yaml, .../daemonset.yaml, .../rbac.yaml, .../servicemonitor.yaml
Added ConfigMap, privileged DaemonSet (hostPID, nsenter), RBAC, ServiceMonitor, and PrometheusRule with seven alert rules and a headless Service.
Demo Monitor Core
demos/fabric-manager-monitor/monitor.py, demos/fabric-manager-monitor/config.py, demos/fabric-manager-monitor/metrics.py
New demo monitor entrypoint, environment-driven MonitorConfig, and Prometheus metric definitions used by the demo.
Demo Checkers & Tests
demos/fabric-manager-monitor/checks/*, demos/fabric-manager-monitor/tests/*
Multiple demo check modules (service, pcie, clock, fabric, cuda) plus unit tests for config and check logic.
Production Packaging & Build
health-monitors/fabric-manager-monitor/Dockerfile, health-monitors/fabric-manager-monitor/pyproject.toml
Multi-stage build and Poetry project metadata to build/package the production monitor as a wheel and runtime image.
Production Checker Types & Watcher
health-monitors/.../checkers/types.py, .../checkers/watcher.py
Added CheckResult dataclass, CallbackInterface, and FabricManagerWatcher orchestrator that runs checks, collects results, updates metrics, and dispatches callbacks.
Production Checkers
health-monitors/.../checkers/service_check.py, .../pcie_check.py, .../clock_check.py, .../fabric_check.py, .../cuda_validation.py
Implemented robust production checkers: systemd/journal service checks with flap detection, PCIe link checks, clock throttling detection, NVLink fabric via DCGM metrics, and subprocess CUDA validation.
Platform Connector Integration
health-monitors/.../platform_connector/event_processor.py, .../platform_connector/metrics.py, .../protos/*, .../protos/*_grpc.py
PlatformConnectorEventProcessor to convert CheckResult → HealthEvent protobufs and publish over UDS gRPC with retries, caching, and publishing metrics; added protobuf stubs and gRPC client/server code.
CLI, Logger & Metrics (prod)
health-monitors/.../cli.py, .../logger.py, .../metrics.py
New Click CLI to run the watcher, structured JSON logger initializer, and centralized Prometheus metric declarations for the production monitor.
Packaging, init & license headers
demos/fabric-manager-monitor/checks/__init__.py, multiple __init__.py files
Added package initializers, license headers, and packaging scaffolding across demo and production packages.

Sequence Diagram(s)

sequenceDiagram
    participant Watcher as FabricManagerWatcher
    participant ServiceChk as ServiceChecker
    participant PCIeChk as PCIeChecker
    participant ClockChk as ClockChecker
    participant NVLinkChk as NVLinkFabricChecker
    participant CUDAVal as CUDAValidator
    participant Processor as PlatformConnectorEventProcessor
    participant gRPC as gRPC_UDS

    Watcher->>Watcher: start poll loop
    par Parallel checks
        Watcher->>ServiceChk: check_fabric_manager()
        ServiceChk-->>Watcher: FabricManagerStatus
        Watcher->>PCIeChk: check()
        PCIeChk-->>Watcher: List[PCIeStatus]
        Watcher->>ClockChk: check()
        ClockChk-->>Watcher: List[ClockStatus]
        Watcher->>NVLinkChk: check()
        NVLinkChk-->>Watcher: NVLinkStatus
        opt CUDA enabled
            Watcher->>CUDAVal: check()
            CUDAVal-->>Watcher: CUDAValidationResult
        end
    end
    Watcher->>Watcher: aggregate CheckResult[]
    Watcher->>Processor: health_check_completed(CheckResult[])
    Processor->>Processor: build HealthEvent(s), dedupe via cache
    Processor->>gRPC: send with retries
    gRPC-->>Processor: ack / error
    Processor->>Processor: update cache & metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hop through logs and NVLink lanes so bright,
I sniff the clocks and chase the PCIe light,
I count restarts, parse journals, and hum,
Prometheus listens while gRPC goes drum,
A carrot for healthy GPUs — sleep well, little chum.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.04% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the main change: adding a fabric-manager-monitor component for GPU infrastructure health checks. It is concise, directly related to the primary objective.
Linked Issues check ✅ Passed The PR objectives and implementation align with linked issue #883: detects Fabric Manager service failures, NVLink fabric degradation, PCIe downtraining, CUDA validation failures, and clock throttling via multiple checks (systemctl, nvidia-smi, DCGM exporter, torch). Implements monitoring for fabric.state and fabric.status scenarios described in the issue.
Out of Scope Changes check ✅ Passed All changes are in-scope: new health-monitors/fabric-manager-monitor/ directory with CLI, logging, gRPC integration, checkers, metrics, and demos/fabric-manager-monitor/ with Dockerfile, K8s manifests, and standalone Prometheus metrics. No modifications to existing NVSentinel core components or unrelated systems.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (13)
demos/fabric-manager-monitor/checks/__init__.py-1-1 (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Missing Apache 2.0 license header.

Every other __init__.py in this PR includes the license header. This file should follow the same pattern.

🔧 Proposed fix
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 """Health check modules for GPU Node Health Validator."""

Based on learnings: "All files must include Apache 2.0 license header".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/checks/__init__.py` at line 1, Add the missing
Apache 2.0 license header at the top of the checks package initializer (the
checks.__init__ module); copy the exact license block used in the other
__init__.py files in this PR and prepend it as the first lines of the file so
the header is consistent across the repository.
demos/fabric-manager-monitor/README.md-38-39 (1)

38-39: ⚠️ Potential issue | 🟡 Minor

port-forward to a DaemonSet selects an arbitrary pod.

On a multi-node cluster, kubectl port-forward ds/fabric-manager-monitor routes to one pod (not necessarily the node exhibiting the health issue). A more useful Quick Start instruction would forward to the specific pod on the target node:

# Forward to the pod running on a specific node
NODE=<node-name>
POD=$(kubectl get pod -n nvsentinel -o wide --field-selector spec.nodeName=${NODE} \
  -l app=fabric-manager-monitor -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward -n nvsentinel pod/${POD} 9101:9101
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/README.md` around lines 38 - 39, The current
README uses `kubectl port-forward` against the DaemonSet which picks an
arbitrary pod; update the Quick Start to forward to the pod on a specific node
instead: instruct the user to set NODE=<node-name>, locate the pod on that node
using `kubectl get pod` with `--field-selector spec.nodeName=${NODE}` and the
label `app=fabric-manager-monitor` (assign result to POD), then run `kubectl
port-forward -n nvsentinel pod/${POD} 9101:9101`; replace the two-line example
that port-forwards the DaemonSet and the following curl line with this
pod-targeted workflow and keep the curl check the same.
demos/fabric-manager-monitor/k8s/rbac.yaml-24-26 (1)

24-26: ⚠️ Potential issue | 🟡 Minor

Remove cluster-wide pods access—the monitor does not use Kubernetes pod API.

The monitor code uses only nsenter-based systemd checks, HTTP calls to the DCGM exporter, and direct subprocess calls. No Kubernetes client library or pod API calls are present anywhere in the codebase. The get, list verbs on pods violate least privilege.

Either:

  1. Remove the pods resource entirely if not needed, or
  2. If pod access is genuinely required (e.g., to locate the DCGM exporter dynamically), replace ClusterRole with a namespace-scoped Role + RoleBinding for the nvsentinel namespace.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/k8s/rbac.yaml` around lines 24 - 26, The
ClusterRole currently grants cluster-wide access to the "pods" resource with
verbs ["get","list"] which is unnecessary; remove the pods entry from the
resources list in k8s/rbac.yaml if the monitor never uses the Kubernetes pod
API, or if pod access is actually required, convert the ClusterRole into a
namespace-scoped Role and replace the ClusterRoleBinding with a RoleBinding
targeted to the "nvsentinel" namespace so the "pods" verbs are limited to that
namespace only; look for the resources: ["pods"] and any
ClusterRole/ClusterRoleBinding declarations in rbac.yaml (and their bindings) to
apply the change.
demos/fabric-manager-monitor/checks/fabric_check.py-72-79 (1)

72-79: ⚠️ Potential issue | 🟡 Minor

Return type annotation is incorrect — actual return is Dict[str, list].

_fetch_metrics is annotated as -> Dict[str, float], but _parse_prometheus_text (Line 83) builds and returns a Dict[str, list] (metric name → list of float values). The downstream _sum_metric correctly treats the values as lists. The annotation should match.

Proposed fix
-    def _fetch_metrics(self) -> Dict[str, float]:
+    def _fetch_metrics(self) -> Dict[str, list]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/checks/fabric_check.py` around lines 72 - 79,
The return type annotation of _fetch_metrics is wrong: it currently declares
Dict[str, float] but calls _parse_prometheus_text which returns a mapping of
metric names to lists of floats. Update _fetch_metrics's signature to ->
Dict[str, List[float]] (import List from typing if missing) so it matches the
actual return type used by _sum_metric and other callers; verify any related
type hints (e.g., _parse_prometheus_text, _sum_metric) are consistent.
demos/fabric-manager-monitor/checks/fabric_check.py-81-96 (1)

81-96: ⚠️ Potential issue | 🟡 Minor

Return type annotation on _parse_prometheus_text is also Dict[str, float] but returns Dict[str, list].

Same mismatch — the docstring says {metric_name: [values]} (list), but the signature says Dict[str, float].

Proposed fix
-    def _parse_prometheus_text(self, text: str) -> Dict[str, float]:
+    def _parse_prometheus_text(self, text: str) -> Dict[str, list]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/checks/fabric_check.py` around lines 81 - 96,
The return type annotation of _parse_prometheus_text is incorrect: it declares
Dict[str, float] but actually builds and returns a Dict[str, list] (metrics
holds lists of values); update the function signature and any related type
references to return Dict[str, List[float]] (and import typing.List if needed)
so the annotation matches the docstring and the actual returned data structure.
demos/README.md-27-37 (1)

27-37: ⚠️ Potential issue | 🟡 Minor

Missing **Time:** field — inconsistent with existing demo entries.

The other two demos both include a **Time:** line (e.g., "5-10 minutes"). The new entry omits it, breaking the established README format. Add an estimated time or note that a GPU cluster is required for accurate timing.

📝 Suggested addition
 **Requirements:** Docker, kubectl, Kubernetes cluster with GPU nodes, Prometheus Operator
+
+**Time:** 10-20 minutes (requires GPU node access)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/README.md` around lines 27 - 37, The Fabric Manager Monitor demo entry
is missing the "**Time:**" line which breaks consistency with other demo
entries; update the `Fabric Manager Monitor` section in demos/README.md (the
block under "### [Fabric Manager Monitor](fabric-manager-monitor/)") to include
an estimated time (e.g., "**Time:** 5-10 minutes" or a note like "Time depends
on GPU cluster availability") placed with the other metadata lines so the format
matches the other demo entries.
health-monitors/fabric-manager-monitor/Dockerfile-18-18 (1)

18-18: ⚠️ Potential issue | 🟡 Minor

ARG default does not support shell command substitution.

Docker's ARG directive treats the default value as a literal string. $(date -u +%FT%TZ) will not be evaluated — the default will be the literal string $(date -u +%FT%TZ). Since the comment says CI passes this value, consider using a simple static fallback instead.

Proposed fix
-ARG BUILD_DATE=$(date -u +%FT%TZ)
+ARG BUILD_DATE="unknown"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@health-monitors/fabric-manager-monitor/Dockerfile` at line 18, ARG BUILD_DATE
currently uses a shell substitution literal '$(date -u +%FT%TZ)', which Docker
will not evaluate; change the default to a static fallback (e.g., an empty
string or a fixed timestamp like "1970-01-01T00:00:00Z" or "unknown") and rely
on CI to pass the real value via --build-arg BUILD_DATE when building, updating
the Dockerfile's ARG BUILD_DATE declaration accordingly.
demos/fabric-manager-monitor/k8s/servicemonitor.yaml-84-95 (1)

84-95: ⚠️ Potential issue | 🟡 Minor

PCIe link degraded alert uses severity: warning, but the check is classified as fatal.

The README and check categories table classify PCIe link degradation as fatal (is_fatal=True), which triggers RESTART_BM remediation in the event processor. A warning severity alert is inconsistent with that classification — consider critical to match.

Proposed fix
           labels:
-            severity: warning
+            severity: critical
             component: fabric-manager-monitor
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/k8s/servicemonitor.yaml` around lines 84 - 95,
The GPUPCIeLinkDegraded alert currently sets labels.severity to "warning" but
the PCIe link degradation check is classified as fatal (is_fatal=True) and
should trigger RESTART_BM; update the alert definition for GPUPCIeLinkDegraded
in servicemonitor.yaml to change labels.severity from "warning" to "critical" so
the alert severity aligns with the check classification and remediation logic.
health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/fabric_check.py-96-111 (1)

96-111: ⚠️ Potential issue | 🟡 Minor

Prometheus text parser mishandles lines with optional timestamps.

line.rsplit(" ", 1) will parse metric{labels} 123.0 1609459200 as value_str = "1609459200" (the timestamp), silently discarding the actual value. While DCGM exporter typically omits timestamps, this is a latent correctness issue if the exporter configuration changes or a proxy adds them.

Proposed fix: split on whitespace and take the second token
     def _parse_prometheus_text(self, text: str) -> Dict[str, list]:
         """Parse Prometheus exposition format into {metric_name: [values]}."""
         metrics: Dict[str, list] = {}
         for line in text.splitlines():
             line = line.strip()
             if not line or line.startswith("#"):
                 continue
-            # Lines look like: DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="0",...} 1234.0
             try:
-                name_and_labels, value_str = line.rsplit(" ", 1)
-                name = name_and_labels.split("{")[0]
-                value = float(value_str)
+                # Format: metric_name{labels} value [timestamp]
+                parts = line.split()
+                if len(parts) < 2:
+                    continue
+                name = parts[0].split("{")[0]
+                value = float(parts[-2] if len(parts) >= 3 else parts[1])
                 metrics.setdefault(name, []).append(value)
             except (ValueError, IndexError):
                 continue
         return metrics

Actually, the simpler and more correct approach: always take parts[1] as the value (Prometheus spec: metric_name [labels] value [timestamp]):

Simpler fix
             try:
-                name_and_labels, value_str = line.rsplit(" ", 1)
-                name = name_and_labels.split("{")[0]
-                value = float(value_str)
+                parts = line.split()
+                name = parts[0].split("{")[0]
+                value = float(parts[1])
                 metrics.setdefault(name, []).append(value)
             except (ValueError, IndexError):
                 continue

Wait — parts[0] would be metric_name{labels} since labels have no spaces. So parts[1] is always the value. This is the correct approach.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/fabric_check.py`
around lines 96 - 111, The _parse_prometheus_text function incorrectly uses
rsplit(" ", 1) which treats an optional timestamp as the value; change parsing
to split the line on whitespace (e.g., parts = line.split()) and treat parts[0]
as the metric name (split off labels via .split("{")[0]) and parts[1] as the
value (float(parts[1])); keep the existing try/except around float conversion
and IndexError to ignore malformed lines and continue populating the metrics
dict in the same way as before.
health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/clock_check.py-69-79 (1)

69-79: ⚠️ Potential issue | 🟡 Minor

Exact-match benign check may not handle compound throttle reason strings.

nvidia-smi --query-gpu=clocks_throttle_reasons.active can return compound strings when multiple reasons are active (e.g., pipe-separated reasons). An exact match against _BENIGN_REASONS will fail for a string like "GPU Idle | Applications Clocks Setting", causing it to fall through to elif reason_str:throttled = True.

This errs on the safe side (false-positive, not false-negative), but it undermines the stated GPU Idle false-positive mitigation. Consider splitting the reason string and checking whether all active reasons are benign:

Proposed fix
             # GPU Idle causes low clock ratio but isn't a real throttle.
             # Only flag as throttled for non-benign reasons.
-            if reason_str in self._BENIGN_REASONS:
-                status.throttled = False
-            elif reason_str:
-                status.throttled = True
+            if reason_str:
+                # Split on common delimiters (pipe, comma) and strip whitespace
+                individual_reasons = [r.strip() for r in reason_str.replace("|", ",").split(",") if r.strip()]
+                non_benign = [r for r in individual_reasons if r not in self._BENIGN_REASONS]
+                status.throttled = len(non_benign) > 0
+            else:
+                # No reason data available; keep ratio-based throttled flag
+                pass
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/clock_check.py`
around lines 69 - 79, The current exact-match against _BENIGN_REASONS fails for
compound reason strings (e.g., "GPU Idle | Applications Clocks Setting"); in the
loop over clocks where you set status.throttle_reasons = reason_str, split
reason_str into individual reasons (split on pipe '|' and other common
separators, trim whitespace) and treat the status as not throttled only if every
non-empty reason token is in _BENIGN_REASONS; otherwise, if any token is
non-empty and not benign, set status.throttled = True; ensure empty/blank
reason_str still results in throttled = False and preserve assignment to
status.throttle_reasons.
health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/watcher.py-103-107 (1)

103-107: ⚠️ Potential issue | 🟡 Minor

log.exception() outside an except block produces a misleading traceback.

done_callback is invoked by the thread pool executor, not inside an except block. log.exception() at line 106 will append NoneType: None as the traceback, which is confusing. Use log.error() instead:

Proposed fix
         def done_callback(class_name: str, future):
             e = future.exception()
             if e is not None:
-                log.exception(f"Callback failed: {e}")
+                log.error("Callback %s failed: %s", class_name, e, exc_info=e)
                 metrics.callback_failures.labels(class_name, "health_check_completed").inc()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/watcher.py`
around lines 103 - 107, done_callback currently calls log.exception(...) outside
an except block which produces a misleading traceback; change the call in
done_callback to use log.error(...) instead and pass the exception information
properly (e.g. include the exception message and set exc_info so the traceback
is logged) while leaving the metrics.callback_failures.labels(class_name,
"health_check_completed").inc() behavior unchanged.
health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py-146-153 (1)

146-153: ⚠️ Potential issue | 🟡 Minor

int() conversion of MainPID can raise ValueError.

If MainPID contains an unexpected non-numeric value (e.g., empty string from a parsing edge case), int(props.get("MainPID", "0")) will raise ValueError. This is caught by the outer except Exception block, which returns active=False — a misleading result when only the PID field is malformed while the service is actually active.

Proposed fix
             return ServiceStatus(
                 name=service_name,
                 active=(active_state == "active"),
                 sub_state=props.get("SubState", ""),
-                main_pid=int(props.get("MainPID", "0")),
+                main_pid=int(props.get("MainPID", "0") or "0"),
                 n_restarts=n_restarts,
                 start_timestamp=props.get("ExecMainStartTimestamp", ""),
             )

This handles the empty-string case. For full robustness, wrap in a try/except:

try:
    main_pid = int(props.get("MainPID", "0") or "0")
except ValueError:
    main_pid = 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py`
around lines 146 - 153, The int() conversion of props.get("MainPID", "0") in the
ServiceStatus return can raise ValueError and incorrectly mark a service as
inactive; update the code in the function that builds ServiceStatus (where
service_name, active_state and props are used) to safely parse MainPID by
evaluating props.get("MainPID", "0") or "0" and wrapping int(...) in a
try/except ValueError (fallback main_pid = 0), then pass that main_pid into
ServiceStatus instead of calling int(...) inline.
health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py-184-203 (1)

184-203: ⚠️ Potential issue | 🟡 Minor

Flap detection misses restarts if the systemd NRestarts counter resets.

If the counter resets to 0 (e.g., after daemon-reload or unit file change), the comparison current_restarts > last_count will be false until the counter surpasses the previously stored value, silently missing restart events. Consider also detecting counter decreases as an indication of a reset.

Proposed fix
         last_count = self._last_restart_count[service_name]
-        if current_restarts > last_count:
+        if current_restarts < last_count:
+            # Counter reset detected (daemon-reload, unit file change, etc.)
+            log.info("Restart counter reset for %s: %d -> %d", service_name, last_count, current_restarts)
+            self._last_restart_count[service_name] = current_restarts
+        elif current_restarts > last_count:
             # New restarts detected -- record timestamp for each
             now = time.monotonic()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py`
around lines 184 - 203, _update_flap_tracking currently only records new
restarts when current_restarts > last_count and thus misses events if the
systemd NRestarts counter resets (current_restarts < last_count). Modify
_update_flap_tracking to detect a decrease in current_restarts vs
_last_restart_count (counter reset) and in that case treat the current_restarts
value as new restart events: record a timestamp for each of the current_restarts
(append now current_restarts times to self._restart_history[service_name]) and
then update _last_restart_count[service_name] = current_restarts; keep the
existing logic for the normal increase case and the pruning logic unchanged (use
time.monotonic() and self._flap_window and deque operations as already used).

Comment on lines +1 to +3
prometheus_client>=0.20.0
requests>=2.31.0
pyyaml>=6.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find demos/fabric-manager-monitor -type f -name "Dockerfile" -o -name "README*" | head -20

Repository: NVIDIA/NVSentinel

Length of output: 139


🏁 Script executed:

cat demos/fabric-manager-monitor/Dockerfile 2>/dev/null || echo "No Dockerfile found in root of demos/fabric-manager-monitor"

Repository: NVIDIA/NVSentinel

Length of output: 598


🏁 Script executed:

find demos/fabric-manager-monitor -type f -name "*.md" | xargs grep -l "CUDA\|torch\|pytorch" 2>/dev/null

Repository: NVIDIA/NVSentinel

Length of output: 99


🏁 Script executed:

grep -r "subprocess.*torch\|torch.*test" demos/fabric-manager-monitor --include="*.py" --include="*.md" 2>/dev/null

Repository: NVIDIA/NVSentinel

Length of output: 43


🏁 Script executed:

grep -r "import torch" demos/fabric-manager-monitor --include="*.py" 2>/dev/null

Repository: NVIDIA/NVSentinel

Length of output: 136


🏁 Script executed:

cat demos/fabric-manager-monitor/README.md

Repository: NVIDIA/NVSentinel

Length of output: 3299


🏁 Script executed:

cat demos/fabric-manager-monitor/checks/cuda_validation.py

Repository: NVIDIA/NVSentinel

Length of output: 3287


Add PyTorch to requirements.txt.

The CUDA validation check (check 5 in README) requires PyTorch to execute. While the code gracefully handles missing PyTorch by returning a failed validation result, PyTorch should be explicitly listed in requirements.txt so the check can function as intended. Currently, if the Dockerfile installs only from this file, the CUDA validation will always report failure with "PyTorch not available" rather than performing actual CUDA context and memory tests.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/requirements.txt` around lines 1 - 3, Add
PyTorch to requirements.txt so the README's CUDA validation check (check 5) can
actually run; update the file (the listed lines with prometheus_client,
requests, pyyaml) to include the PyTorch package (e.g., add a "torch>=2.0.0"
entry) and ensure Dockerfile/pip install uses that requirements.txt so the
validation will import torch instead of failing with "PyTorch not available"
during the CUDA validation check.

default="EXECUTE_REMEDIATION",
help="Event processing strategy: EXECUTE_REMEDIATION or STORE_ONLY",
)
@click.option("--verbose", type=bool, default=False, help="Enable debug logging")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Click python library type=bool vs is_flag=True documentation

💡 Result:

In Click, type=bool and is_flag=True mean different things:

  • type=bool (or type=click.BOOL): the option expects a value (e.g. --debug true, --debug false) and Click will parse common truthy/falsey strings ("1"/"0", "true"/"false", "yes"/"no", etc.). Click documents this as the bool/click.BOOL parameter type. [1]

  • is_flag=True: the option becomes a boolean flag that takes no value; it’s False when absent and True when present (e.g. --verbose). Click documents this under “Boolean flags”. [2]

Related: defining a pair like --shout/--no-shout makes Click treat it as a boolean flag automatically (implicit is_flag=True). [2]

Sources:
[1] Click “Parameter Types” (bool / click.BOOL) documentation. (click.palletsprojects.com)
[2] Click “Options” documentation (Boolean flags, --flag/--no-flag, is_flag=True). (pocoo-click.readthedocs.io)

Citations:


🏁 Script executed:

fd cli.py | grep -E "fabric-manager-monitor"

Repository: NVIDIA/NVSentinel

Length of output: 129


🏁 Script executed:

cat -n health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py | head -150

Repository: NVIDIA/NVSentinel

Length of output: 6862


🏁 Script executed:

# Let me also check if there are any type hints on the cli function parameters
cat -n health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py | sed -n '77,94p'

Repository: NVIDIA/NVSentinel

Length of output: 545


--verbose with type=bool requires an explicit argument value; use is_flag=True instead.

With type=bool, Click expects --verbose true or --verbose false. Running --verbose alone (the natural usage) will fail because Click expects a value. Use is_flag=True to make it a boolean flag.

Proposed fix
-@click.option("--verbose", type=bool, default=False, help="Enable debug logging")
+@click.option("--verbose", is_flag=True, default=False, help="Enable debug logging")

Additionally, the cli() function parameters lack type hints, which violates the coding guideline requiring type hints for all functions.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@click.option("--verbose", type=bool, default=False, help="Enable debug logging")
`@click.option`("--verbose", is_flag=True, default=False, help="Enable debug logging")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py` at line
76, The Click option for verbose currently uses type=bool which requires an
explicit value; change the `@click.option` call for "--verbose" to use
is_flag=True (and keep default=False) so the flag can be passed without a value,
and update any help text if needed. Also add missing type hints to the cli()
function signature (e.g., annotate parameters like verbose: bool and any others
present) to comply with the coding guideline; ensure the function body and any
callers accept the annotated types.

protobuf = ">=6.31.1,<7.0.0"
googleapis-common-protos = ">=1.56.0"
structlog = "^25.1.0"
requests = "^2.31.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

requests = "^2.31.0" allows versions vulnerable to CVE-2024-35195 and CVE-2024-47081.

CVE-2024-35195 (Medium) affects requests prior to 2.32.2 — if the first request in a Session uses verify=False, all subsequent requests to the same host will continue ignoring cert verification. CVE-2024-47081 is also present in earlier releases, affecting netrc credential handling with a maliciously crafted URL. The minimum version 2.31.0 is below both fix points. The lower bound should be raised to ^2.32.3 (the current stable release).

🔒 Proposed fix
-requests = "^2.31.0"
+requests = "^2.32.3"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
requests = "^2.31.0"
requests = "^2.32.3"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@health-monitors/fabric-manager-monitor/pyproject.toml` at line 17, Update the
requests dependency in pyproject.toml to require a fixed minimum that includes
CVE fixes: replace the current spec "requests = \"^2.31.0\"" with a constraint
that forces at least 2.32.3 (e.g., "^2.32.3") so the project pulls versions that
include fixes for CVE-2024-35195 and CVE-2024-47081; ensure you then run your
dependency install/lock command to regenerate the lockfile (poetry lock / poetry
update) so the new safe version is recorded.

Comment on lines +28 to +29
[tool.poetry.scripts]
fabric_manager_monitor = "fabric_manager_monitor.cli:cli"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if a Helm chart exists for fabric-manager-monitor
echo "=== Searching for fabric-manager-monitor Helm chart ==="
fd "Chart.yaml" distros/kubernetes/nvsentinel/charts/ 2>/dev/null | grep -i fabric || echo "No fabric-manager chart found"

echo ""
echo "=== All Helm charts in distros/kubernetes/nvsentinel/charts/ ==="
fd "Chart.yaml" distros/kubernetes/nvsentinel/charts/ 2>/dev/null

echo ""
echo "=== Health monitors in health-monitors/ ==="
ls -d health-monitors/*/ 2>/dev/null || echo "Directory not found"

echo ""
echo "=== Checking fabric-manager-monitor pyproject.toml ==="
if [ -f "health-monitors/fabric-manager-monitor/pyproject.toml" ]; then
  grep -A 2 "\[tool.poetry.scripts\]" health-monitors/fabric-manager-monitor/pyproject.toml
else
  echo "File not found"
fi

Repository: NVIDIA/NVSentinel

Length of output: 2088


Add Helm chart for fabric-manager-monitor native NVSentinel integration.

The health monitor under health-monitors/fabric-manager-monitor/ exposes a console-scripts entry point (fabric_manager_monitor) but lacks a corresponding Helm chart under distros/kubernetes/nvsentinel/charts/. All other health monitors in the repository include one: csp-health-monitor, gpu-health-monitor, kubernetes-object-monitor, and syslog-health-monitor.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@health-monitors/fabric-manager-monitor/pyproject.toml` around lines 28 - 29,
The project defines a console script entry point "fabric_manager_monitor"
(specified as fabric_manager_monitor.cli:cli) but there is no corresponding Helm
chart for native NVSentinel integration; add a Helm chart named to match the
monitor (e.g., fabric-manager-monitor) mirroring the structure of the other
monitors' charts: include Chart.yaml, values.yaml, templates (Deployment,
Service, ServiceAccount, RBAC, ConfigMap as needed) and any required hooks, and
update the charts index/packaging so the new chart is published alongside the
existing monitor charts for NVSentinel.

…NVIDIA#883)

Add a new health monitor that detects Fabric Manager failures, PCIe link
degradation, NVLink fabric issues, GPU clock throttling, and CUDA context
failures — failure modes invisible to DCGM-based monitoring.

Two implementations included:

1. **demos/fabric-manager-monitor/** — Standalone DaemonSet exposing
   Prometheus metrics and alerts. Validated on 2x P4d.24xlarge
   (A100-SXM4-40GB) with Amazon Linux 2023, EKS 1.32.

2. **health-monitors/fabric-manager-monitor/** — Native NVSentinel
   integration using the same patterns as gpu-health-monitor: click CLI,
   structlog JSON logging, gRPC HealthEvent protobuf to platform-connector
   via Unix domain socket, state caching, exponential backoff retries.

Check categories:
- Fabric Manager service status via nsenter + systemctl
- Critical GPU services (persistenced, DCGM) via nsenter + systemctl
- PCIe link downtraining (Gen5→Gen3, x16→x8) via nsenter + nvidia-smi
- NVLink fabric health via DCGM exporter metrics (with FM correlation)
- CUDA context validation via subprocess
- GPU clock throttling with idle-state false-positive filtering

Bug fixes discovered during live testing:
- nsenter requires root (non-root USER has empty CapEff)
- systemd NRestarts property missing on Amazon Linux 2023 (systemd 252)
- GPU Idle (0x0000000000000001) false positive on clock throttle detection

Closes NVIDIA#883
Related: NVIDIA#889, NVIDIA#890

Signed-off-by: Anton Alexander <dmvevents@gmail.com>
@dmvevents dmvevents force-pushed the feature/fabric-manager-monitor branch from d54c7ca to c4c7789 Compare February 20, 2026 17:24
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🧹 Nitpick comments (35)
health-monitors/fabric-manager-monitor/Dockerfile (2)

53-58: No WORKDIR set in the runtime stage

Without an explicit WORKDIR, the COPY --from=build instructions deposit fabric_manager_monitor*.whl and constraints.txt into / (the filesystem root). While functionally correct (pip resolves ./fabric_manager_monitor*.whl relative to the working directory which defaults to /), it is unconventional and leaves artefacts in the root.

♻️ Proposed fix
 FROM public.ecr.aws/docker/library/python:3.10-slim-bookworm AS runtime

+WORKDIR /app
+
 # util-linux provides nsenter ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@health-monitors/fabric-manager-monitor/Dockerfile` around lines 53 - 58, The
runtime stage lacks a WORKDIR so the COPY --from=build deposits files at /; add
an explicit WORKDIR (e.g., WORKDIR /app or /opt/fabric_manager_monitor) in the
runtime stage before the COPY and RUN lines so that COPY --from=build
/app/dist/*.whl ./ and COPY --from=build /app/constraints.txt ./ place artifacts
into that directory and the subsequent pip install ./fabric_manager_monitor*.whl
--constraint constraints.txt runs from that working directory; ensure the chosen
directory is used consistently for any later file references and leaves the
filesystem root clean.

24-25: Upgrade Poetry to 2.3.2 and install the required export plugin

Poetry 1.8.2 is over a year outdated; the current stable version is 2.3.2. Upgrading requires addressing two breaking changes in Poetry 2.x:

  • poetry export (line 39) was moved to a separate plugin and will fail with "command does not exist" unless poetry-plugin-export is installed.
  • The [tool.poetry] format used in all pyproject.toml files is backward-compatible with Poetry 2.x, but verify your pyproject.toml does not have any 1.x-specific config that Poetry 2.x no longer supports.
♻️ Proposed update for Poetry 2.x
 RUN --mount=type=cache,target=/root/.cache/pip \
-    pip install poetry==1.8.2
+    pip install poetry==2.3.2 poetry-plugin-export
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@health-monitors/fabric-manager-monitor/Dockerfile` around lines 24 - 25,
Update the Dockerfile to install Poetry 2.3.2 instead of 1.8.2 and also install
the export plugin so `poetry export` works: change the pip install target from
poetry==1.8.2 to poetry==2.3.2 and add installation of the export plugin
(poetry-plugin-export) in the same RUN step; then verify the repository's
pyproject.toml files are compatible with Poetry 2.x (no deprecated 1.x-only
settings) and adjust any 1.x-specific config before relying on `poetry export`.
demos/fabric-manager-monitor/k8s/daemonset.yaml (1)

46-51: Consider adding a readinessProbe alongside the liveness probe.

Without a readiness probe, the Service starts routing traffic (and Prometheus starts scraping) before the monitor completes its initial health check pass and populates metrics. A simple HTTP readiness probe on /metrics (or a dedicated /ready endpoint) would prevent scraping stale or empty metric series during startup.

✨ Proposed addition
           livenessProbe:
             httpGet:
               path: /
               port: metrics
             initialDelaySeconds: 10
             periodSeconds: 30
+          readinessProbe:
+            httpGet:
+              path: /metrics
+              port: metrics
+            initialDelaySeconds: 15
+            periodSeconds: 30
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/k8s/daemonset.yaml` around lines 46 - 51, Add a
readinessProbe next to the existing livenessProbe in the DaemonSet spec so the
Service/Prometheus don't scrape before the monitor is ready; specifically, under
the same container that defines livenessProbe (look for livenessProbe and
httpGet path: / port: metrics) add a readinessProbe that uses httpGet to hit
/metrics (or /ready if you have a dedicated endpoint) with conservative
initialDelaySeconds and periodSeconds values similar to the livenessProbe so the
pod only becomes ready after metrics are populated.
demos/fabric-manager-monitor/checks/cuda_validation.py (1)

91-92: Move import json to the top-level import block.

Per PEP 8, all imports should be at the top of the file.

♻️ Proposed fix
+import json
 import logging
 import subprocess
 import sys
 import textwrap
-            import json
             data = json.loads(result.stdout.strip())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/checks/cuda_validation.py` around lines 91 - 92,
Move the local import json inside the CUDA validation check to the module-level
imports: remove the inline "import json" and add "import json" at the top import
block of the file so json.loads(result.stdout.strip()) in the CUDA validation
code (where result and data are used) uses the top-level import per PEP8.
demos/fabric-manager-monitor/tests/test_monitor.py (2)

101-101: Rename unused mock arguments to _ prefix to satisfy Ruff ARG002.

MockPCIe, MockClk, and MockNVL are needed as parameters (the @patch decorators inject them) but are never referenced in the test body. Prefix with _ to make the intentional non-use explicit.

♻️ Proposed fix (both affected test methods)
-    def test_check_cycle_all_healthy(self, MockSvc, MockPCIe, MockClk, MockNVL):
+    def test_check_cycle_all_healthy(self, MockSvc, _MockPCIe, _MockClk, _MockNVL):
-    def test_check_cycle_fabric_manager_down(self, MockSvc, MockPCIe, MockClk, MockNVL):
+    def test_check_cycle_fabric_manager_down(self, MockSvc, _MockPCIe, _MockClk, _MockNVL):

Also applies to: 126-126

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/tests/test_monitor.py` at line 101, Rename the
unused patched parameters in the test functions (e.g.,
test_check_cycle_all_healthy and the other affected test method) by prefixing
them with an underscore so Ruff ARG002 is satisfied; specifically change
MockPCIe, MockClk, and MockNVL to _MockPCIe, _MockClk, and _MockNVL (or _pcie,
_clk, _nvl) in the test function signatures so the `@patch-injected` mocks remain
but are clearly marked as intentionally unused.

81-95: _make_monitor is missing type annotations (coding guideline).

♻️ Proposed fix
+from typing import Any
 ...
-    def _make_monitor(self, **overrides):
+    def _make_monitor(self, **overrides: Any) -> FabricManagerMonitor:

As per coding guidelines: "Type hints required for all functions in Python code."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/tests/test_monitor.py` around lines 81 - 95, The
helper function _make_monitor lacks type annotations; add annotations for the
parameter and return type (e.g., def _make_monitor(self, **overrides: Any) ->
FabricManagerMonitor) and type the locals where appropriate (declare defaults:
Dict[str, Any] or annotate the config variable as MonitorConfig) and ensure
necessary typing imports (from typing import Any, Dict) are added so
_make_monitor, MonitorConfig usage and the returned FabricManagerMonitor are
fully type hinted.
demos/fabric-manager-monitor/checks/service_check.py (3)

142-153: Silent except Exception: pass in _get_restart_count swallows unexpected errors.

If result.stdout is malformed and int(val) raises ValueError, the failure is invisible. This makes it impossible to distinguish "NRestarts not supported" from a parse error. Log the exception at DEBUG level.

♻️ Proposed fix
         except Exception:
-            pass
+            logger.debug("Failed to parse NRestarts for %s", service_name, exc_info=True)
         return 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/checks/service_check.py` around lines 142 - 153,
The try/except in _get_restart_count silently swallows all errors; modify the
except block in the _get_restart_count method to catch Exception as e and log
the exception at DEBUG (e.g., using self._logger.debug or the existing logger)
including contextual info (service_name and result.stdout if available), then
return 0 as before; ensure you still only parse when result.returncode == 0 and
result.stdout is non-empty and do not change the return behavior.

76-84: Prefer list unpacking over concatenation (Ruff RUF005).

♻️ Proposed fix
-        full_cmd = ["nsenter", "-t", "1", "-m", "--"] + cmd
+        full_cmd = ["nsenter", "-t", "1", "-m", "--", *cmd]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/checks/service_check.py` around lines 76 - 84,
Replace the list concatenation in _run_host_cmd with list unpacking to satisfy
Ruff RUF005: instead of building full_cmd via ["nsenter", "-t", "1", "-m", "--"]
+ cmd, construct the argument list using unpacking (e.g., ["nsenter", "-t", "1",
"-m", "--", *cmd]) or pass the literal prefix plus *cmd directly into
subprocess.run so the command list is created with unpacking; update the symbol
_run_host_cmd accordingly to use the unpacked list when calling subprocess.run.

218-222: except (subprocess.TimeoutExpired, Exception)TimeoutExpired is redundant.

subprocess.TimeoutExpired is a subclass of Exception, so the first element of the tuple is always shadowed by the second. Simplify to except Exception.

♻️ Proposed fix
-        except (subprocess.TimeoutExpired, Exception) as e:
+        except Exception as e:
             logger.warning("Journal parsing failed for %s: %s", service_name, e)
             return []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/checks/service_check.py` around lines 218 - 222,
The except clause in the journal parsing block uses a redundant exception tuple;
change "except (subprocess.TimeoutExpired, Exception) as e" to a single "except
Exception as e" in the function/method containing the journal parse (the block
that logs "Journal parsing failed for %s: %s" and returns []), keeping the
logger.warning("Journal parsing failed for %s: %s", service_name, e) and the
return [] intact.
health-monitors/fabric-manager-monitor/fabric_manager_monitor/logger.py (1)

56-63: Structlog processor arguments logger and method_name are required by the protocol but unused — prefix with _.

Ruff ARG001 is correct here. The structlog processor protocol mandates the three-argument signature, but the arguments aren't used inside inject_module_version. Using a _ prefix suppresses the linter warning and clearly signals intentional non-use.

♻️ Proposed fix
     def inject_module_version(
-        logger: logging.Logger | None,
-        method_name: str,
+        _logger: logging.Logger | None,
+        _method_name: str,
         event_dict: dict[str, Any],
     ) -> dict[str, Any]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/logger.py`
around lines 56 - 63, The structlog processor function inject_module_version
currently declares parameters logger and method_name but doesn't use them,
triggering Ruff ARG001; rename those parameters to _logger and _method_name in
the inject_module_version signature to mark them as intentionally unused while
leaving event_dict unchanged and still setting event_dict["module"] = module and
event_dict["version"] = version before returning it.
health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/types.py (1)

34-40: Consider default factories for required list fields.

error_codes and entities_impacted must be supplied explicitly (even as []) by every caller. Giving them field(default_factory=list) would reduce boilerplate at call sites while keeping the type contract intact.

♻️ Proposed refactor
+import dataclasses
 from typing import Dict, List, Optional

 `@dataclasses.dataclass`
 class CheckResult:
     ...
-    error_codes: List[str]
+    error_codes: List[str] = dataclasses.field(default_factory=list)
     message: str
-    entities_impacted: List[Dict[str, str]]
+    entities_impacted: List[Dict[str, str]] = dataclasses.field(default_factory=list)
     metadata: Optional[Dict[str, str]] = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/types.py`
around lines 34 - 40, The fields error_codes and entities_impacted are currently
required to be passed by every caller; change them to use
dataclasses.field(default_factory=list) so callers can omit them and get empty
lists by default, and ensure dataclasses.field is imported where the dataclass
(the one that declares error_codes and entities_impacted) is defined; keep the
type hints as List[str] and List[Dict[str,str]] respectively to preserve the
contract.
demos/fabric-manager-monitor/README.md (1)

30-34: Helm chart for production integration is not mentioned.

The demo uses raw k8s/ manifests. Per established repo conventions, the production counterpart under health-monitors/fabric-manager-monitor/ should also include a Helm chart under distros/kubernetes/nvsentinel/charts/. Consider adding a note (or a TODO) here pointing readers toward the Helm chart for the integrated version.

Based on learnings: "Health monitors should include Helm chart in distros/kubernetes/nvsentinel/charts/."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/README.md` around lines 30 - 34, The README's
"Deploy (assumes nvsentinel namespace exists)" section currently lists raw
manifests (rbac.yaml, configmap.yaml, daemonset.yaml, servicemonitor.yaml); add
a brief note or TODO immediately after that block pointing readers to the
production Helm chart for the integrated version
(distros/kubernetes/nvsentinel/charts/) and, if present, the specific chart for
this monitor (e.g., fabric-manager-monitor) so users know where to find the
Helm-based deployment alternative for production integration.
demos/fabric-manager-monitor/tests/test_service_check.py (4)

156-159: Unused loop variable name — rename to _name.

Per Ruff B007, the loop variable name on line 158 is unused. Rename to _name or use _ to indicate it's intentionally ignored.

-        for name, status in results.items():
+        for _name, status in results.items():
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/tests/test_service_check.py` around lines 156 -
159, The loop in the test `test_service_check.py` iterates over
`results.items()` but the `name` variable is never used; rename the unused loop
variable `name` to `_name` (or `_`) in the `for name, status in results.items()`
line so it conforms to Ruff B007 and signals the variable is intentionally
ignored while leaving `status` unchanged.

8-12: sys.path manipulation is fragile — prefer a proper package setup.

Inserting parent directories into sys.path at lines 11-12 is brittle and can cause import shadowing. Since the coding guidelines require Poetry for dependency management, consider adding a [tool.pytest.ini_options] pythonpath setting in pyproject.toml or installing the package in editable mode (poetry install) so imports resolve naturally.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/tests/test_service_check.py` around lines 8 -
12, The test currently manipulates sys.path via sys.path.insert(0,
os.path.join(os.path.dirname(__file__), "..")) which is brittle; remove that
sys.path manipulation from tests and instead ensure imports resolve via proper
packaging: either install the project in editable mode with Poetry (run poetry
install) or add a pytest pythonpath configuration under
[tool.pytest.ini_options] in pyproject.toml so tests can import the package
normally (remove the sys.path.insert call and rely on the package name used in
imports).

104-116: time.sleep(1.1) makes this test slow and potentially flaky.

Real sleeps in unit tests add latency and can cause intermittent failures under CI load. Consider mocking time.monotonic to simulate the passage of time deterministically.

♻️ Proposed refactor
+    `@patch`("checks.service_check.time")
-    def test_flap_detection_window_expiry(self):
+    def test_flap_detection_window_expiry(self, mock_time):
+        current_time = 1000.0
+        mock_time.monotonic.return_value = current_time
         checker = ServiceChecker(flap_window=1, flap_threshold=3)
         checker._restart_history["test"] = deque()
         checker._last_restart_count["test"] = 0
 
         # Add restarts
         checker._update_flap_tracking("test", 3)
         assert checker.is_flapping("test")
 
-        # Wait for window to expire
-        time.sleep(1.1)
+        # Simulate window expiry
+        mock_time.monotonic.return_value = current_time + 1.1
         checker._update_flap_tracking("test", 3)  # same count, triggers prune
         assert not checker.is_flapping("test")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/tests/test_service_check.py` around lines 104 -
116, The test test_flap_detection_window_expiry uses time.sleep and should
instead simulate time progression by mocking time.monotonic so it runs fast and
deterministically; update the test to patch time.monotonic (via
unittest.mock.patch or the test framework's monkeypatch) to return an initial
timestamp for the first _update_flap_tracking("test", 3) call and then a
timestamp > flap_window later before the second _update_flap_tracking call so
ServiceChecker._update_flap_tracking and is_flapping behave the same without
real sleep; ensure you reference ServiceChecker, _update_flap_tracking, and
is_flapping when adding the patch/unpatch around the two update calls.

118-141: Unused timeout parameter in the side_effect function.

The timeout keyword argument on line 120 is never used. This was flagged by Ruff (ARG001). Prefix it with an underscore to signal intent.

-        def side_effect(cmd, timeout=10):
+        def side_effect(cmd, timeout=10):  # noqa: ARG001

Or simply:

-        def side_effect(cmd, timeout=10):
+        def side_effect(cmd, **_kwargs):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/tests/test_service_check.py` around lines 118 -
141, In test_check_fabric_manager_with_journal_errors the side_effect function
declares a timeout parameter that is never used (ARG001); update the side_effect
signature to accept _timeout (or prefix timeout with an underscore) so the
linter knows it is intentionally unused, leaving the logic inside side_effect
and the mock_run.side_effect assignment unchanged; this touches the side_effect
used in the test_check_fabric_manager_with_journal_errors test for
ServiceChecker.check_fabric_manager.
demos/fabric-manager-monitor/checks/clock_check.py (2)

36-41: Use frozenset for the class-level _BENIGN_REASONS constant.

Ruff RUF012 flags this as a mutable class attribute. Since this set is never mutated, declare it as a frozenset to express immutability.

-    _BENIGN_REASONS = {
+    _BENIGN_REASONS: frozenset = frozenset({
         "Not Active",
         "0x0000000000000000",  # No throttle
         "0x0000000000000001",  # GPU Idle — normal when no workload running
-    }
+    })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/checks/clock_check.py` around lines 36 - 41,
Replace the mutable class-level set _BENIGN_REASONS with an immutable frozenset
to satisfy the linter (RUF012); locate the _BENIGN_REASONS symbol in
clock_check.py and change its declaration so it uses frozenset(...) (keeping the
same strings) to express immutability and avoid mutable class attribute
warnings.

155-157: Silent exception swallowing in _query_throttle_reasons hides diagnostic context.

The bare except Exception: return [] discards all error information. If nvidia-smi is intermittently failing, this makes it invisible. Add a log statement so operators can diagnose throttle-reason query failures.

♻️ Proposed fix
-        except Exception:
+        except Exception:
+            logger.warning("Failed to query throttle reasons", exc_info=True)
             return []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/checks/clock_check.py` around lines 155 - 157,
In _query_throttle_reasons, avoid swallowing exceptions silently; inside the
except Exception block in function _query_throttle_reasons, log the caught
exception (including exception message and stacktrace) via the module's logger
(or processLogger) before returning [] so intermittent nvidia-smi failures are
visible to operators—i.e., replace the bare except with logging the
error/context and then return [].
health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py (1)

95-95: exit shadows the Python builtin.

Using exit as a variable name shadows the builtin exit() function. Consider renaming to exit_event or shutdown_event for clarity and to avoid accidental misuse.

-    exit = Event()
+    exit_event = Event()

(Update all references on lines 132, 156 accordingly.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py` at line
95, The variable named exit (Event()) shadows the Python builtin exit(); rename
it (e.g., exit_event or shutdown_event) and update all references to the new
name (including where it’s waited on or set in the CLI logic and handlers
referenced in the same module such as the code blocks that currently use
exit.wait() or exit.set()); ensure imports remain the same and run tests to
confirm no remaining references to the old name.
health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/cuda_validation.py (1)

34-70: Inline CUDA test script uses assert which is stripped under python -O.

If the monitor container is ever invoked with optimization flags (python -O), the assert on line 62 of the embedded script would be silently removed, making the verification no-op. Replace with an explicit if/raise to guarantee the check always runs.

♻️ Proposed fix (inside the _CUDA_TEST_SCRIPT string)
             # Allocate, write, read back, verify
             t = torch.randn(1024, device="cuda")
-            assert t.sum().isfinite(), f"GPU {i}: non-finite sum"
+            if not t.sum().isfinite():
+                raise RuntimeError(f"GPU {i}: non-finite sum")
             del t
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/cuda_validation.py`
around lines 34 - 70, The embedded CUDA test (_CUDA_TEST_SCRIPT) currently uses
an assert for the CUDA tensor check (assert t.sum().isfinite(), f"GPU {i}:
non-finite sum") which is removed under python -O; update the script to replace
that assert with an explicit runtime check such as: compute s = t.sum() and if
not s.isfinite(): raise RuntimeError(f"GPU {i}: non-finite sum") so the
verification always runs regardless of optimization flags and the exception is
caught and reported into results["errors"] as the surrounding code expects.
health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/pcie_check.py (1)

47-81: Consider log.exception in exception handlers to capture tracebacks.

In the except blocks (lines 73–81), log.error discards the traceback. Using log.exception (or log.error(..., exc_info=True)) would preserve diagnostic context, especially for the generic Exception handler on line 79.

♻️ Proposed fix
         except subprocess.TimeoutExpired:
-            log.error("nvidia-smi PCIe query timed out")
+            log.exception("nvidia-smi PCIe query timed out")
             return []
         except FileNotFoundError:
-            log.error("nvidia-smi not found")
+            log.exception("nvidia-smi not found")
             return []
         except Exception as e:
-            log.error(f"PCIe check failed: {e}")
+            log.exception(f"PCIe check failed: {e}")
             return []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/pcie_check.py`
around lines 47 - 81, In the check method of pcie_check.py (function check in
the PCIe checker), replace the log.error calls inside the exception handlers for
subprocess.TimeoutExpired, FileNotFoundError, and the generic Exception handler
so they record the traceback (use log.exception(...) or log.error(...,
exc_info=True)) while keeping the existing messages ("nvidia-smi PCIe query
timed out", "nvidia-smi not found", "PCIe check failed: ..."); this ensures the
check() method logs full diagnostic context for TimeoutExpired,
FileNotFoundError, and the Exception handler.
demos/fabric-manager-monitor/checks/fabric_check.py (1)

41-49: Silently assuming healthy on DCGM fetch failure may mask real outages.

When _fetch_metrics() throws (e.g., DCGM exporter is down), check() returns healthy=True with an error string. This means a prolonged DCGM exporter outage will never surface an unhealthy signal. Consider logging at warning level here so operators have visibility, and potentially exposing the fetch-failure state through a dedicated metric or check result so the monitoring gap doesn't go unnoticed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/checks/fabric_check.py` around lines 41 - 49,
The current check() swallows _fetch_metrics() exceptions and returns
NVLinkStatus(healthy=True) which hides outages; change this to log the exception
at warning (use the instance logger, e.g., self.logger.warning(...), and include
the exception string) and return NVLinkStatus(healthy=False, error=f"Failed to
fetch DCGM metrics: {e}"); additionally, if a telemetry/metrics client exists on
the class, increment or emit a distinct fetch-failure metric (e.g.,
"nvlink_fetch_failures") so monitoring can alert on DCGM exporter outages
(references: check(), _fetch_metrics(), NVLinkStatus).
demos/fabric-manager-monitor/checks/pcie_check.py (1)

29-91: Extract PCIe checking logic into a shared library to reduce duplication.

The PCIeStatus dataclass and core checking methods (check() and _parse_output()) are duplicated between the demo and production paths. The demo version lacks the production-specific to_check_results() integration method, but the core logic is identical. Consider creating a shared utility module that both implementations can import, allowing the production version to add its integration layer on top.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/checks/pcie_check.py` around lines 29 - 91, The
PCIe checking logic (PCIeStatus dataclass and
PCIeChecker.check()/_parse_output()) is duplicated between demo and production;
extract these into a shared utility module (e.g., pcie/utils) that exports
PCIeStatus and PCIeChecker (or a parsing function) so both
demos/fabric-manager-monitor and production code import the common
implementation; remove the duplicated definitions from
demos/fabric-manager-monitor/checks/pcie_check.py and update the
production-specific implementation to keep its integration method
(to_check_results()) by importing the shared PCIeStatus and either subclassing
or wrapping PCIeChecker to add to_check_results().
health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/watcher.py (2)

128-135: Exception handler loses traceback — use log.exception instead of log.error.

When a check raises an unexpected exception, the current log.error with an f-string loses the full traceback, making debugging much harder. log.exception automatically includes the traceback.

Proposed fix
                         except Exception as e:
-                            log.error(f"Check '{name}' failed with exception: {e}")
+                            log.exception(f"Check '{name}' failed with exception")
                             metrics.check_errors.labels(name).inc()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/watcher.py`
around lines 128 - 135, The exception handler inside the loop over
self._checkers is using log.error and loses the traceback; change the except
block in watcher.py where it catches Exception as e (around the loop that calls
check_func() and extends results) to call log.exception with a descriptive
message (e.g., "Check '{name}' failed") and still increment
metrics.check_errors.labels(name).inc(), so the full traceback is captured while
preserving the existing metrics and control flow.

66-66: Unbounded ThreadPoolExecutor may accumulate blocked threads.

ThreadPoolExecutor() without max_workers defaults to min(32, os.cpu_count() + 4) in Python 3.8+, but each callback thread can block for minutes due to the exponential backoff in send_health_event_with_retries. Over multiple check cycles with a failing platform-connector, threads accumulate until the pool is exhausted, stalling all callbacks.

Consider setting an explicit max_workers (e.g., max_workers=len(callbacks) or a small fixed number).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/watcher.py`
at line 66, The ThreadPoolExecutor created in watcher.py as
self._callback_thread_pool = ThreadPoolExecutor() can grow and exhaust threads
because send_health_event_with_retries blocks for minutes; set an explicit
max_workers to limit concurrency (for example max_workers=len(self._callbacks)
or a small fixed number) when constructing the ThreadPoolExecutor to prevent
accumulation, and ensure the pool is initialized using that value where
self._callback_thread_pool is created; reference the ThreadPoolExecutor
instantiation and the send_health_event_with_retries callback behavior when
making the change.
health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/clock_check.py (2)

150-186: _query_throttle_reasons returns untyped List[dict].

Per coding guidelines, type hints are required and dataclasses should be used for structured data. This method returns List[dict] with implicit keys "gpu_index" (int) and "reasons" (str). Consider using a TypedDict or a small dataclass to make the contract explicit and catch key mismatches at development time.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/clock_check.py`
around lines 150 - 186, The _query_throttle_reasons function currently returns
List[dict]; define an explicit shape (prefer a TypedDict named ThrottleReason
with fields gpu_index: int and reasons: str, or alternatively a small dataclass)
and update the function signature to return List[ThrottleReason]; then construct
and append instances of that TypedDict/dataclass instead of raw dicts, add the
necessary import from typing (TypedDict, List) or dataclasses, and keep the same
parsing/exception behavior.

56-60: Consider frozenset for the class-level constant.

_BENIGN_REASONS is a mutable set defined as a class attribute. Since it's intended as a constant lookup table, frozenset communicates immutability and avoids accidental mutation.

-    _BENIGN_REASONS = {
+    _BENIGN_REASONS = frozenset({
         "Not Active",
         "0x0000000000000000",  # No throttle
         "0x0000000000000001",  # GPU Idle -- normal when no workload running
-    }
+    })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/clock_check.py`
around lines 56 - 60, _BENIGN_REASONS is defined as a mutable set but used as a
constant lookup table; change its definition to use frozenset to make it
immutable and signal intent. Locate the class/module-level symbol
_BENIGN_REASONS in clock_check.py and replace the set literal with a frozenset
literal (keeping the same string members) so any accidental mutations are
prevented and the constant is clearly immutable.
demos/fabric-manager-monitor/monitor.py (2)

91-93: Missing return type annotation on _handle_signal.

As per coding guidelines, type hints are required for all functions.

Proposed fix
-    def _handle_signal(self, signum, frame):
+    def _handle_signal(self, signum: int, frame) -> None:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/monitor.py` around lines 91 - 93, The method
_handle_signal lacks type annotations; add parameter and return type hints:
annotate signum as int and frame as Optional[FrameType] and the return as None
(e.g. def _handle_signal(self, signum: int, frame: Optional[FrameType]) ->
None:). Also add the necessary imports (from types import FrameType and from
typing import Optional) at the top of the module so the annotations resolve.

114-117: Dead code: restart counter block is a no-op.

This entire block does nothing — it checks fm_status.n_restarts > 0, then executes pass. The imported fabric_manager_restarts_total metric is never updated anywhere in this file. Either wire up the metric or remove this dead code to avoid confusion.

Proposed fix — remove dead block
-                    # Update restart counter (set to current total — Counter only goes up)
-                    if fm_status.n_restarts > 0:
-                        # We use _total suffix via Counter; increment by delta
-                        pass  # Counter tracks via flap detection instead
-
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/monitor.py` around lines 114 - 117, The
restart-counter block in monitor.py is dead: it checks fm_status.n_restarts > 0
then does nothing, and the imported metric fabric_manager_restarts_total is
never updated; remove the no-op if-block (the three lines checking
fm_status.n_restarts and the pass) and either delete the unused
import/registration of fabric_manager_restarts_total or, if you prefer to track
restarts, replace the block with code that sets/increments
fabric_manager_restarts_total (using fm_status.n_restarts or delta) so the
metric is actually updated; reference symbols: fm_status.n_restarts and
fabric_manager_restarts_total.
health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py (1)

146-152: gRPC channel is created and torn down on every retry attempt.

The with grpc.insecure_channel(...) block inside the retry loop means a new UDS connection + HTTP/2 handshake per attempt. Move channel creation outside the loop so retries reuse the same connection, reducing latency and resource churn.

Proposed fix
     delay = INITIAL_DELAY
-    for _ in range(MAX_RETRIES):
-        with grpc.insecure_channel(f"unix://{self._socket_path}") as chan:
-            stub = platformconnector_pb2_grpc.PlatformConnectorStub(chan)
-            try:
-                stub.HealthEventOccurredV1(platformconnector_pb2.HealthEvents(events=health_events, version=1))
-                metrics.events_sent_success.inc()
-                return True
-            except grpc.RpcError as e:
-                log.error(f"Failed to send health event to UDS: {e}")
-                sleep(delay)
-                delay *= 1.5
-                continue
+    with grpc.insecure_channel(f"unix://{self._socket_path}") as chan:
+        stub = platformconnector_pb2_grpc.PlatformConnectorStub(chan)
+        for _ in range(MAX_RETRIES):
+            try:
+                stub.HealthEventOccurredV1(platformconnector_pb2.HealthEvents(events=health_events, version=1))
+                metrics.events_sent_success.inc()
+                return True
+            except grpc.RpcError as e:
+                log.error(f"Failed to send health event to UDS: {e}")
+                sleep(delay)
+                delay = min(delay * 1.5, MAX_DELAY)
+                continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py`
around lines 146 - 152, The retry loop currently creates a new gRPC channel each
iteration (using grpc.insecure_channel(f"unix://{self._socket_path}")), causing
repeated UDS/TLS handshakes; move channel creation outside the for _ in
range(MAX_RETRIES) loop so a single channel is reused for all retries: open the
channel once (using grpc.insecure_channel with self._socket_path), construct
PlatformConnectorStub(chan) once, then keep the try/except and retry logic that
calls stub.HealthEventOccurredV1(platformconnector_pb2.HealthEvents(...)) inside
the loop; on success increment metrics.events_sent_success and return True as
before, and ensure the channel is properly closed (or used in a single with
block) after the retry loop completes/fails.
health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py (5)

100-108: Prefer list unpacking over concatenation (RUF005); document trust assumption for service_name

Ruff RUF005 flags the + concatenation; use spread syntax instead. More importantly, service_name is caller-supplied (including via check_all_gpu_services(..., service_names=[...])) and flows directly into nsenter/systemctl arguments. While shell=False prevents shell injection, malformed or adversarial service names can still produce unexpected systemctl behavior (e.g., names containing ../ or whitespace). Document the implicit trust assumption or add a basic allowlist/regex guard in callers that accept external input.

♻️ Proposed fix (RUF005)
-        full_cmd = ["nsenter", "-t", "1", "-m", "--"] + cmd
+        full_cmd = ["nsenter", "-t", "1", "-m", "--", *cmd]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py`
around lines 100 - 108, The _run_host_cmd function currently concatenates lists
with ["nsenter", "-t", "1", "-m", "--"] + cmd and callers pass service names
into commands (e.g., check_all_gpu_services -> service_name), which can lead to
unexpected systemctl behavior; change the call site to build the final argv
using unpacking (["nsenter", "-t", "1", "-m", "--", *cmd] or equivalent) instead
of +, and either add a simple allowlist/regex validation for service_name in the
callers that accept external input (e.g., check_all_gpu_services) or document
the implicit trust assumption for service_name in the function/method docstrings
so callers know they must sanitize inputs before passing them to _run_host_cmd.

87-90: Mutable class attribute DEFAULT_GPU_SERVICES should be annotated as ClassVar

Without ClassVar, type checkers treat this as an instance field, and it can be inadvertently mutated at the class level (affecting all instances). Ruff RUF012 flags this correctly.

♻️ Proposed fix
+from typing import ClassVar, Dict, List, Optional
 
 class ServiceChecker:
     """Checks host systemd services via nsenter."""
 
     # Additional GPU-related services to check alongside Fabric Manager
-    DEFAULT_GPU_SERVICES = [
+    DEFAULT_GPU_SERVICES: ClassVar[List[str]] = [
         "nvidia-persistenced",
         "nv-hostengine",
     ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py`
around lines 87 - 90, DEFAULT_GPU_SERVICES is a mutable class attribute and
should be annotated as a ClassVar so type checkers don't treat it as an instance
field; import ClassVar from typing and change the declaration of
DEFAULT_GPU_SERVICES to include the ClassVar annotation (e.g.
DEFAULT_GPU_SERVICES: ClassVar[list[str]] = [...]) so the linter (RUF012) is
satisfied and instances won't be considered to own/mutate this list; update the
declaration in the module where DEFAULT_GPU_SERVICES is defined and ensure the
typing import is added if missing.

210-227: Fragile manual field unpacking in check_fabric_manager — use dataclasses.fields for resilience

If ServiceStatus gains new fields in the future, check_fabric_manager will silently omit them without a compile-time error.

♻️ Proposed fix
+import dataclasses
 
 def check_fabric_manager(self) -> FabricManagerStatus:
     """Check Fabric Manager with journal error analysis."""
     base = self.check_service("nvidia-fabricmanager")
     journal_errors = self._parse_journal_errors("nvidia-fabricmanager")
     flapping = self.is_flapping("nvidia-fabricmanager")
 
-    return FabricManagerStatus(
-        name=base.name,
-        active=base.active,
-        sub_state=base.sub_state,
-        main_pid=base.main_pid,
-        n_restarts=base.n_restarts,
-        start_timestamp=base.start_timestamp,
-        error=base.error,
-        journal_errors=journal_errors,
-        flapping=flapping,
-    )
+    base_fields = {f.name: getattr(base, f.name) for f in dataclasses.fields(base)}
+    return FabricManagerStatus(**base_fields, journal_errors=journal_errors, flapping=flapping)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py`
around lines 210 - 227, check_fabric_manager manually unpacks fields from the
ServiceStatus returned by check_service, which will silently omit new fields;
instead build the FabricManagerStatus from a dynamic copy of the base dataclass
(use dataclasses.fields or dataclasses.asdict on the ServiceStatus instance
returned by check_service inside check_fabric_manager) then merge/override the
additional keys journal_errors (from _parse_journal_errors) and flapping (from
is_flapping) and construct FabricManagerStatus(**merged_dict); this ensures any
new ServiceStatus fields are preserved without manual updates.

254-258: Redundant exception type in except clause; TimeoutExpired is already a subclass of Exception

subprocess.TimeoutExpired is a subclass of SubprocessError → OSError → Exception, so the explicit listing is redundant and the clause is functionally except Exception. Consider narrowing to only expected exceptions, or at minimum remove the redundant type. Ruff BLE001 is valid here.

♻️ Proposed fix
-        except (subprocess.TimeoutExpired, Exception) as e:
-            log.warning(f"Journal parsing failed for {service_name}: {e}")
+        except subprocess.TimeoutExpired:
+            log.warning("Journal parsing timed out", service=service_name)
+        except OSError as e:
+            log.warning("Journal parsing failed", service=service_name, error=str(e))
         return []

Also consider moving return found (line 254) to an else block of the try to satisfy Ruff TRY300 and make the non-exceptional path explicit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py`
around lines 254 - 258, The except tuple currently uses
(subprocess.TimeoutExpired, Exception) which is redundant because TimeoutExpired
is a subclass of Exception; change the handler to either narrowly catch the
specific expected exceptions (e.g., subprocess.SubprocessError or the concrete
errors your journal parsing can raise) or collapse to a single except Exception
as e and remove TimeoutExpired from the tuple, and move the existing return
found out of the try and into an else block so the non-exception path is
explicit (references: the except clause handling journal parsing errors for
service_name and the variable found in service_check.py).

168-182: Silent except … pass in _get_restart_count swallows diagnostic failures (S110)

If nsenter fails for a non-benign reason (e.g., missing root privilege, kernel capability gap), the error is silently discarded and 0 is returned, making flap detection silently inoperable. The benign case (NRestarts unsupported on older systemd) and the fatal case (infrastructure failure) produce the same result.

♻️ Proposed fix
-        except Exception:
-            pass
+        except ValueError:
+            pass  # malformed NRestarts value; treat as 0
+        except Exception as e:
+            log.warning("Failed to query NRestarts", service=service_name, error=str(e))
         return 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py`
around lines 168 - 182, The helper _get_restart_count currently swallows all
exceptions, masking real failures; update it to distinguish benign "property
unsupported" from command/runtime failures by: after calling
self._run_host_cmd(service_name) check result.returncode and if non-zero log the
failure with service_name and result.stderr (using self.logger.error or process
logger) and re-raise or return an error indicator instead of silently returning
0; keep returning 0 only when the command succeeded but stdout is empty or the
NRestarts property is absent; also change the broad except Exception: pass to
either catch specific expected exceptions or log the exception (with exception
info) and re-raise so infrastructure errors aren’t hidden. Ensure references to
_get_restart_count, _run_host_cmd and service_name are used in your changes.

Comment on lines +1 to +90
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fabric-manager-monitor
namespace: nvsentinel
labels:
app.kubernetes.io/name: fabric-manager-monitor
app.kubernetes.io/instance: nvsentinel
app.kubernetes.io/component: metrics
spec:
selector:
matchLabels:
app.kubernetes.io/name: fabric-manager-monitor
app.kubernetes.io/instance: nvsentinel
template:
metadata:
labels:
app.kubernetes.io/name: fabric-manager-monitor
app.kubernetes.io/instance: nvsentinel
app.kubernetes.io/component: metrics
spec:
serviceAccountName: fabric-manager-monitor
hostPID: true # Required for nsenter to check host systemd services and nvidia-smi
nodeSelector:
# Matches any node with NVIDIA GPUs (set by GPU Feature Discovery)
feature.node.kubernetes.io/pci-10de.present: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: monitor
image: fabric-manager-monitor:latest
envFrom:
- configMapRef:
name: fabric-manager-monitor
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
ports:
- name: metrics
containerPort: 9101
protocol: TCP
livenessProbe:
httpGet:
path: /
port: metrics
initialDelaySeconds: 10
periodSeconds: 30
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 100m
memory: 128Mi
securityContext:
privileged: true # Required for nsenter into host mount namespace
volumeMounts:
- name: host-dbus
mountPath: /var/run/dbus
readOnly: true
volumes:
- name: host-dbus
hostPath:
path: /var/run/dbus
type: Directory

---
apiVersion: v1
kind: Service
metadata:
name: fabric-manager-monitor
namespace: nvsentinel
labels:
app.kubernetes.io/name: fabric-manager-monitor
app.kubernetes.io/instance: nvsentinel
app.kubernetes.io/component: metrics
spec:
selector:
app.kubernetes.io/name: fabric-manager-monitor
app.kubernetes.io/instance: nvsentinel
ports:
- name: metrics
port: 9101
targetPort: 9101
protocol: TCP
clusterIP: None # Headless — each pod scraped individually
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

DaemonSet variant selection logic is undocumented — coding guideline violation.

The coding guidelines require **/daemonset*.yaml files to explain DaemonSet variant selection logic (e.g., kata vs. regular nodes) in Helm chart documentation. This manifest has a single undifferentiated DaemonSet with no mention of kata node handling. For production Helm chart delivery the monitor will also need a corresponding chart under distros/kubernetes/nvsentinel/charts/fabric-manager-monitor/ (per project conventions) that documents variant selection logic.

As per coding guidelines: "Explain DaemonSet variant selection logic in Helm chart documentation" (**/daemonset*.yaml). Based on learnings: "Health monitors should include Helm chart in distros/kubernetes/nvsentinel/charts/" and "Separate DaemonSets should be created for kata vs regular nodes using nodeAffinity based on kata.enabled label".

🧰 Tools
🪛 Checkov (3.2.334)

[medium] 1-71: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 1-71: Container should not be privileged

(CKV_K8S_16)


[medium] 1-71: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 1-71: Containers should not share the host process ID namespace

(CKV_K8S_17)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/k8s/daemonset.yaml` around lines 1 - 90, The
DaemonSet named fabric-manager-monitor currently lacks documented variant
selection and does not handle kata vs regular nodes; update the deployment by
adding separate DaemonSet manifests (or templated Helm chart resources) for each
variant and document the selection logic in the chart: keep the existing
fabric-manager-monitor DaemonSet for regular GPU nodes (current
feature.node.kubernetes.io/pci-10de.present selector), add a second DaemonSet
that targets kata nodes via nodeAffinity matching the kata.enabled label (and
inverse affinity for the regular one if using both), ensure both manifests
expose the same ports/env/volumes, and add clear Helm chart documentation
describing when each DaemonSet is deployed and how to enable/disable the kata
variant (include references to the NODE_NAME env and service
fabric-manager-monitor for scraping).

Comment on lines +22 to +30
serviceAccountName: fabric-manager-monitor
hostPID: true # Required for nsenter to check host systemd services and nvidia-smi
nodeSelector:
# Matches any node with NVIDIA GPUs (set by GPU Feature Discovery)
feature.node.kubernetes.io/pci-10de.present: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Privileged + hostPID security posture — acknowledge Checkov findings and consider capability-scoping.

Checkov flags these as HIGH (CKV_K8S_16: privileged) and MEDIUM (CKV_K8S_17: hostPID, CKV_K8S_20: allowPrivilegeEscalation, CKV_K8S_23: root containers). The inline comments correctly explain the nsenter requirement, but two hardening options are worth considering even in a demo:

  1. Capability-based approach: nsenter for mount/UTS/network namespaces only requires CAP_SYS_ADMIN (and optionally CAP_SYS_PTRACE). Full privileged: true grants all 40+ capabilities and removes seccomp/AppArmor profiles — a much larger blast radius than needed.

  2. allowPrivilegeEscalation: false should be explicitly set even when privileged: true to satisfy policy engines (OPA/Kyverno) and pass Checkov CKV_K8S_20.

🛡️ Capability-scoped alternative to full privileged mode
           securityContext:
-            privileged: true  # Required for nsenter into host mount namespace
+            capabilities:
+              add:
+                - SYS_ADMIN    # nsenter namespace traversal
+                - SYS_PTRACE   # /proc/<pid> access for nsenter
+            allowPrivilegeEscalation: false
+            runAsNonRoot: false  # nsenter + nvidia-smi require root; document explicitly

Note: Test thoroughly — some nsenter invocations may still require full privileged depending on the target namespace types (e.g., --mount into the host mount namespace typically needs SYS_ADMIN + SYS_CHROOT).

Also applies to: 59-60

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/k8s/daemonset.yaml` around lines 22 - 30, The
DaemonSet is using a broad privileged posture; update the Pod/Container
securityContext to avoid privileged: true where possible and instead scope
capabilities required by nsenter (e.g., add kernel capabilities like
CAP_SYS_ADMIN and optionally CAP_SYS_PTRACE under
securityContext.capabilities.add) and explicitly set
securityContext.allowPrivilegeEscalation: false; keep hostPID: true if you still
need host PID access but remove or avoid privileged: true (or document why it
must remain) so Checkov flags CKV_K8S_16/17/20/23 are mitigated—look for the Pod
spec fields securityContext, capabilities, privileged, allowPrivilegeEscalation,
and hostPID to apply these changes.

Comment on lines +114 to +144
def to_check_results(self, statuses: List[PCIeStatus], node_name: str) -> List[CheckResult]:
"""Convert PCIeStatus list to CheckResult list for the watcher."""
results = []
for pcie in statuses:
if pcie.degraded:
results.append(
CheckResult(
check_name="PcieLinkDegraded",
is_healthy=False,
is_fatal=True,
error_codes=["PCIE_LINK_DEGRADED"],
message=(
f"PCIe link degraded on {node_name} GPU {pcie.gpu_index}: "
f"Gen{pcie.link_gen_current} x{pcie.link_width_current} "
f"(max Gen{pcie.link_gen_max} x{pcie.link_width_max})"
),
entities_impacted=[{"entityType": "GPU", "entityValue": str(pcie.gpu_index)}],
)
)
else:
results.append(
CheckResult(
check_name="PcieLinkDegraded",
is_healthy=True,
is_fatal=False,
error_codes=[],
message=f"PCIe link healthy on {node_name} GPU {pcie.gpu_index}",
entities_impacted=[{"entityType": "GPU", "entityValue": str(pcie.gpu_index)}],
)
)
return results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd health-monitors/fabric-manager-monitor && find . -name "pcie_check.py" -o -name "watcher.py" | head -20

Repository: NVIDIA/NVSentinel

Length of output: 153


🏁 Script executed:

# First, let's see the full pcie_check.py file to understand the check() method
cat -n fabric_manager_monitor/checkers/pcie_check.py | head -200

Repository: NVIDIA/NVSentinel

Length of output: 138


🏁 Script executed:

# Search for watcher.py to see how it handles empty results
find . -name "watcher.py" -type f

Repository: NVIDIA/NVSentinel

Length of output: 144


🏁 Script executed:

# Check how the check() method works and when it returns empty list
rg -A 20 "def check\(" fabric_manager_monitor/checkers/pcie_check.py

Repository: NVIDIA/NVSentinel

Length of output: 146


🏁 Script executed:

cat -n health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/pcie_check.py | head -150

Repository: NVIDIA/NVSentinel

Length of output: 6348


🏁 Script executed:

# Look for the check() method in pcie_check.py
rg -A 25 "def check\(" health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/pcie_check.py

Repository: NVIDIA/NVSentinel

Length of output: 945


🏁 Script executed:

# Check the watcher to see how it uses results from checkers
cat health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/watcher.py | head -200

Repository: NVIDIA/NVSentinel

Length of output: 8247


🏁 Script executed:

rg -A 15 "_run_pcie_checks" health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/watcher.py

Repository: NVIDIA/NVSentinel

Length of output: 1615


🏁 Script executed:

# Also check if there's any special handling for empty results
rg -B 5 -A 10 "if results" health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/watcher.py

Repository: NVIDIA/NVSentinel

Length of output: 956


🏁 Script executed:

# Check the CheckResult type to understand the data structure
cat health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/types.py

Repository: NVIDIA/NVSentinel

Length of output: 1860


🏁 Script executed:

# Get the full _run_pcie_checks method to see if it calls to_check_results
rg -A 30 "_run_pcie_checks" health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/watcher.py

Repository: NVIDIA/NVSentinel

Length of output: 3014


🏁 Script executed:

# Search for all calls to to_check_results in the codebase
rg "to_check_results" health-monitors/fabric-manager-monitor/

Repository: NVIDIA/NVSentinel

Length of output: 1396


🏁 Script executed:

# Check how clock_check.py handles empty statuses
rg -A 30 "def to_check_results" health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/clock_check.py

Repository: NVIDIA/NVSentinel

Length of output: 1644


🏁 Script executed:

# Also check the fabric_check.py to see if there's a pattern for handling empty results
rg -B 5 -A 40 "def to_check_results" health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/fabric_check.py

Repository: NVIDIA/NVSentinel

Length of output: 2086


🏁 Script executed:

# Check service_check.py for comparison
rg -B 5 -A 20 "def check\(" health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py | head -50

Repository: NVIDIA/NVSentinel

Length of output: 43


🏁 Script executed:

# Check if there's an "error" field in PCIeStatus that could be used
rg -A 5 "class PCIeStatus" health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/pcie_check.py

Repository: NVIDIA/NVSentinel

Length of output: 191


🏁 Script executed:

# Check service_check.py to see how it handles errors - looking for a model to follow
cat health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py | head -100

Repository: NVIDIA/NVSentinel

Length of output: 3209


🏁 Script executed:

# Let's check if there's existing error reporting pattern by looking at the full check method context
cat health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/pcie_check.py | sed -n '47,82p'

Repository: NVIDIA/NVSentinel

Length of output: 1271


🏁 Script executed:

# Check if any tests exist that show expected behavior for empty statuses
find health-monitors/fabric-manager-monitor -name "*test*.py" -o -name "test_*" -type f | head -10

Repository: NVIDIA/NVSentinel

Length of output: 43


to_check_results() silently returns empty list when nvidia-smi fails, masking monitoring gaps.

When check() fails (returns [] due to nvidia-smi errors, timeout, or exceptions), to_check_results() also returns []. This means:

  • No callbacks are fired (watcher's if results: check skips them)
  • Node health defaults to True (watcher's if results else True logic)
  • The monitoring failure is silent—the platform has no signal that PCIe health is unknown

Consider emitting an error-state CheckResult when statuses is empty, so the platform can track monitoring gaps. The ServiceChecker pattern provides a model: always return a status object with an error field set when the check itself fails.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/pcie_check.py`
around lines 114 - 144, to_check_results currently returns an empty list when
statuses is empty (e.g., nvidia-smi failure), which silences monitoring; update
to_check_results to detect when statuses == [] and instead return a single
error-state CheckResult (e.g., check_name "PcieLinkDegraded" or
"PcieCheckFailed") with is_healthy=False, is_fatal=False,
error_codes=["PCIE_CHECK_FAILED"], a clear message like "Unable to determine
PCIe status for {node_name}: check execution failed or timed out", and an
appropriate entities_impacted (e.g., node or no specific GPU), so consumers and
the watcher can surface monitoring gaps; keep existing behavior for non-empty
List[PCIeStatus].

self._fabric_manager_down = False

# Initialize checkers and build the check list based on enabled flags
self._checkers: List[tuple[str, callable]] = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the actual file to understand the context
cd health-monitors/fabric-manager-monitor
head -n 100 fabric_manager_monitor/checkers/watcher.py | cat -n

Repository: NVIDIA/NVSentinel

Length of output: 4527


🏁 Script executed:

# Run the suggested command from the review to find other instances
rg -n 'callable\]' --type=py

Repository: NVIDIA/NVSentinel

Length of output: 201


🏁 Script executed:

# Also search for imports and see if Callable is used elsewhere
rg -n 'from typing import|Callable' --type=py fabric_manager_monitor/checkers/watcher.py

Repository: NVIDIA/NVSentinel

Length of output: 143


🏁 Script executed:

# Check Python version requirements in the project
fd -e "pyproject.toml" -o -e "setup.py" -o -e ".python-version" | head -5 | xargs cat

Repository: NVIDIA/NVSentinel

Length of output: 177


Replace callable (lowercase) with proper Callable type annotation.

callable is a built-in function, not a type. Use typing.Callable with the appropriate signature. This will be flagged by static type checkers.

Proposed fix
-from typing import List
+from typing import Callable, List, Tuple
 ...
-        self._checkers: List[tuple[str, callable]] = []
+        self._checkers: List[Tuple[str, Callable[[], List[CheckResult]]]] = []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/watcher.py`
at line 72, The annotation for self._checkers uses the builtin callable instead
of the typing Callable; update the type to use typing.Callable (for example
List[tuple[str, Callable[..., Any]]]) and add the corresponding import (from
typing import Callable, Any) at the top of watcher.py so static type checkers
recognize the correct type for the check functions referenced by self._checkers.

- Fix flapping alert: use increase() > 3 instead of rate() > 0.5
  (old threshold required ~300 restarts in 10m to trigger)
- Fix PCIe degradation not affecting overall_healthy in demo monitor
- Fix requests CVE-2024-35195: bump minimum to >=2.32.2
- Fix click --verbose: use is_flag=True instead of type=bool
- Fix grpcio-tools: move from runtime to dev dependencies
- Cap exponential backoff: MAX_RETRIES=5, MAX_DELAY=15s
  (was 10 retries with uncapped delay blocking ~9 minutes)
- Remove unnecessary cluster-wide pods RBAC access
- Add Apache 2.0 license headers to __init__.py files
- Fix type annotations: Dict[str, float] -> Dict[str, List[float]]
- Improve port-forward docs to target specific node's pod

Signed-off-by: Anton Alexander <dmvevents@gmail.com>
- Fix CUDAValidationResult: use field(default_factory=list) instead of
  mutable default `errors: list = None`
- Fix logger: add "critical" to _LEVEL_MAP (was silently falling back
  to INFO)
- Fix Dockerfile: remove non-functional ARG shell substitution and
  rm -rf inside BuildKit cache mount
- Fix cache key sorting: sort entities by explicit fields instead of
  dict __repr__
- Fix Prometheus text parser: handle optional timestamps in exposition
  format (split on space, take second field as value)
- Add GENERAL_ERROR patterns to journal error parsing
- Pin demo image tag to 0.1.0 instead of :latest
- Bump requests to >=2.32.2 in native integration

Signed-off-by: Anton Alexander <dmvevents@gmail.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (8)
demos/fabric-manager-monitor/monitor.py (1)

91-93: Unused frame parameter in signal handler.

Ruff ARG002 flags frame as unused. Since the signal handler signature is mandated by Python's signal module, rename to _frame (or annotate with # noqa: ARG002) to silence the lint warning without changing behaviour.

♻️ Proposed fix
-    def _handle_signal(self, signum, frame):
+    def _handle_signal(self, signum: int, _frame: object) -> None:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/monitor.py` around lines 91 - 93, The signal
handler _handle_signal currently declares an unused parameter named frame which
triggers Ruff ARG002; update the handler signature in the monitor class by
renaming the parameter from frame to _frame (or alternately add a `# noqa:
ARG002` comment) so the function still matches the signal.signal callback
signature but the linter stops flagging it; ensure the body remains unchanged
(logger.info and self._shutdown.set()) and only the parameter name/annotation is
modified.
demos/fabric-manager-monitor/checks/fabric_check.py (2)

88-95: rsplit(" ", 1) will parse the timestamp as the value on lines that include a Prometheus timestamp.

Per the Prometheus exposition format, a sample line may have an optional millisecond-epoch timestamp: metric{labels} value timestamp. Using rsplit(" ", 1) splits at the rightmost space, so value_str would be the timestamp, not the sample value.

DCGM exporter typically omits timestamps, so this is benign in practice. If you want robustness, split on whitespace and take the second token instead:

♻️ Proposed fix
-                name_and_labels, value_str = line.rsplit(" ", 1)
-                name = name_and_labels.split("{")[0]
-                value = float(value_str)
+                parts = line.split()
+                name = parts[0].split("{")[0]
+                value = float(parts[1])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/checks/fabric_check.py` around lines 88 - 95,
The parser currently uses line.rsplit(" ", 1) which can pick up an optional
Prometheus timestamp as value_str; instead split the line on whitespace (e.g.,
parts = line.split()), ensure there are at least two tokens, set name_and_labels
= parts[0] and value_str = parts[1], then proceed to extract name =
name_and_labels.split("{")[0], convert value = float(value_str), and append to
metrics as before (metrics.setdefault(name, []).append(value)); also keep the
existing try/except or check token length to avoid IndexError/ValueError.

83-83: Internal Dict[str, list] annotations should be Dict[str, List[float]] to match the declared return/parameter types.

Line 83's variable annotation and line 98's parameter annotation use the unparameterized list, while the public signatures on lines 81 and 72 declare Dict[str, List[float]]. This creates a silent mismatch that mypy/pyright will flag.

♻️ Proposed fix
-        metrics: Dict[str, list] = {}
+        metrics: Dict[str, List[float]] = {}
-    def _sum_metric(self, metrics: Dict[str, list], name: str) -> float:
+    def _sum_metric(self, metrics: Dict[str, List[float]], name: str) -> float:

Also applies to: 98-98

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/checks/fabric_check.py` at line 83, Update the
internal annotations that use the bare list to the more specific List[float]:
change the variable annotation metrics: Dict[str, list] to metrics: Dict[str,
List[float]] and any function parameter annotated as Dict[str, list] (e.g., the
parameter named metrics) to Dict[str, List[float]]; also add "from typing import
List" if it's not already imported so the List type is available. Ensure all
occurrences of the bare list in these Dict[...] annotations are replaced
consistently with List[float] to match the public signatures.
demos/fabric-manager-monitor/k8s/servicemonitor.yaml (1)

1-6: Consider adding a Helm chart for the production health monitor.

The demo ships raw k8s manifests, which is fine for standalone evaluation. However, the companion production health monitor (health-monitors/fabric-manager-monitor/) has no Helm chart in distros/kubernetes/nvsentinel/charts/.

Based on learnings: "Health monitors should include Helm chart in distros/kubernetes/nvsentinel/charts/"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/k8s/servicemonitor.yaml` around lines 1 - 6, The
ServiceMonitor manifest for kind: ServiceMonitor and metadata.name:
fabric-manager-monitor is committed as a raw k8s YAML but the repo expects a
Helm chart for production monitors; create a Helm chart named
fabric-manager-monitor in the distro charts collection (the repo's nvsentinel
charts area) that templates the ServiceMonitor (templates/servicemonitor.yaml)
and exposes values (values.yaml) for name, namespace, labels/selectors and
endpoints, include Chart.yaml and README, and add the chart to whatever charts
index/packaging used by the repo so the production health monitor is deployable
via Helm instead of a raw manifest.
health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py (2)

131-134: Prefix unused signal handler parameters with _ to suppress linting warnings.

signum and frame are required by the signal.signal contract but intentionally unused; prefixing with _ is the conventional signal.

♻️ Proposed fix
-    def process_exit_signal(signum, frame):
+    def process_exit_signal(_signum, _frame):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py` around
lines 131 - 134, The signal handler process_exit_signal currently declares
unused parameters signum and frame; rename them to _signum and _frame in the
process_exit_signal function signature to suppress lint warnings while
preserving the signal.signal contract, leaving the body unchanged (it should
still call exit.set(), prom_server.shutdown(), and t.join()) so behavior is
unaffected.

159-160: Consider adding a Helm chart for the production fabric-manager-monitor component.

The PR wires production-grade gRPC, signal handling, and Prometheus metrics but there is no Helm chart for it under distros/kubernetes/nvsentinel/charts/. This is a deployment gap for Kubernetes-based installations.

Based on learnings: "Health monitors should include Helm chart in distros/kubernetes/nvsentinel/charts/."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py` around
lines 159 - 160, The project lacks a Helm chart for deploying the production
fabric-manager-monitor; add a new Helm chart in the nvsentinel charts directory
for the fabric-manager-monitor component that packages the service invoked by
the CLI entrypoint (the if __name__ == "__main__": cli() binary). Create
Chart.yaml and values.yaml, and templates including Deployment (container image,
GRPC port, resources, env for signal/metrics config), Service, ServiceMonitor
(Prometheus scraping), RBAC (if needed), ConfigMap for config, and probes
(liveness/readiness); ensure templates expose the same gRPC/metrics ports used
by the code and make chart values configurable (image/tag, replicas, resources),
then add the chart to the repo packaging/CI so it’s included in release
artifacts.
health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py (2)

136-137: Use log.exception to preserve the traceback on unexpected errors.

log.error(f"Exception while sending health events: {e}") silently discards the stack trace. Swap to log.exception so the full context is visible in structured logs, and optionally narrow the catch type.

♻️ Proposed fix
-            except Exception as e:
-                log.error(f"Exception while sending health events: {e}")
+            except Exception:
+                log.exception("Exception while sending health events")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py`
around lines 136 - 137, The except block currently calls log.error(f"Exception
while sending health events: {e}") which loses the traceback; replace that call
with log.exception("Exception while sending health events") in the except block
(preserving the exception variable or removing it since log.exception captures
the traceback) and consider narrowing the except Exception to a more specific
exception type inside event_processor.py where the send-health-events logic runs
so we don't swallow unrelated errors.

154-158: log.error drops the traceback; sleep runs wastefully after the last retry.

Three nits in the retry handler:

  1. log.error (line 155) → log.exception to include the RPC error's traceback in structured output.
  2. sleep(delay) executes even on the final failed attempt — move the sleep before delay update and guard it with a retry-count check, or restructure with else.
  3. continue (line 158) is redundant at the end of the loop body.
♻️ Proposed fix
         for attempt in range(MAX_RETRIES):
             with grpc.insecure_channel(f"unix://{self._socket_path}") as chan:
                 stub = platformconnector_pb2_grpc.PlatformConnectorStub(chan)
                 try:
                     stub.HealthEventOccurredV1(platformconnector_pb2.HealthEvents(events=health_events, version=1))
                     metrics.events_sent_success.inc()
                     return True
                 except grpc.RpcError as e:
-                    log.error(f"Failed to send health event to UDS: {e}")
-                    sleep(delay)
-                    delay = min(delay * 1.5, MAX_DELAY)
-                    continue
+                    log.exception(f"Failed to send health event to UDS: {e}")
+                    if attempt < MAX_RETRIES - 1:
+                        sleep(delay)
+                        delay = min(delay * 1.5, MAX_DELAY)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py`
around lines 154 - 158, The retry handler currently swallows traceback and
always sleeps even after the final attempt; change the except grpc.RpcError as e
block to call log.exception(...) instead of log.error(...) to include the
traceback, move the sleep(delay) to occur before updating delay (delay =
min(delay * 1.5, MAX_DELAY)) and only execute it if there are remaining retries
(use the loop's retry counter or compare current attempt index to max_retries),
and remove the redundant continue at the end of the except block so control
naturally loops or exits based on the retry condition.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@demos/fabric-manager-monitor/k8s/servicemonitor.yaml`:
- Around line 111-121: The alert GPUServiceDown currently sets labels.severity
to "critical" which conflicts with the PR's stated non-fatal classification for
GPU services; update the alert definition (alert name GPUServiceDown, labels
block) to use severity: warning instead of critical so the alert severity aligns
with the non-fatal classification used for GPU services like persistenced/DCGM.

In `@demos/fabric-manager-monitor/monitor.py`:
- Around line 114-117: The metric fabric_manager_restarts_total is never
incremented because the loop handling fm_status.n_restarts contains a no-op; fix
by storing the previous restart count per FM (e.g., a module-level dict like
prev_restarts keyed by FM id/host) and in the monitoring loop (where
fm_status.n_restarts is read) compute delta = max(0, fm_status.n_restarts -
prev_restarts[key]); if delta > 0 call the Counter's inc(delta) to increment
fabric_manager_restarts_total, then update prev_restarts[key] =
fm_status.n_restarts so subsequent cycles use the new baseline; ensure this
logic is applied where the current code comments reference "increment by delta"
and that the symbol fabric_manager_restarts_total is the Counter you call inc on
so the FabricManagerFlapping alert can fire.

In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py`:
- Line 95: Rename the threading.Event variable currently named exit to avoid
shadowing the built-in: change its declaration (Event() assigned to exit) to
exit_event, update the closure process_exit_signal to call exit_event.set()
instead of exit.set(), and update any callers such as watcher.start(...) to pass
exit_event (e.g., watcher.start(exit_event)); ensure all references to the old
symbol exit in that scope are replaced with exit_event.
- Around line 106-108: Wrap the call to
get_package_version("fabric-manager-monitor") in a try/except that catches
importlib.metadata.PackageNotFoundError, defaulting version to a safe value
(e.g., "dev" or "unknown") when the package is not installed, and ensure
set_default_structured_logger_with_level("fabric-manager-monitor", version,
log_level) is always called; reference get_package_version,
PackageNotFoundError, and set_default_structured_logger_with_level when making
the change so the structured logger is initialized even on bare source
checkouts.

---

Duplicate comments:
In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py`:
- Around line 77-94: The cli function signature lacks type annotations; update
the cli(...) declaration to add explicit Python type hints for each parameter
(e.g., platform_connector_socket: str, port: int, poll_interval: float,
node_name: str, boot_grace_period: int, flap_window: int, flap_threshold: int,
enable_fabric_check: bool, enable_pcie_check: bool, enable_clock_check: bool,
enable_nvlink_check: bool, enable_cuda_validation: bool, dcgm_exporter_url:
Optional[str], clock_throttle_ratio: float, processing_strategy: str, verbose:
bool) and add a return annotation (e.g., -> None). Use Optional[...] where
parameters can be None and import typing types if needed; update the cli
symbol's signature only so callers remain compatible.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py`:
- Around line 64-67: _build_cache_key generates cache keys by sorting
entities_impacted with key=str, which depends on dict insertion order; change
the sort to use explicit stable fields so semantically identical dicts produce
the same key: sort entities_impacted by a tuple of (e.get('entityType',''),
e.get('entityValue','')) (or the canonical field names used in your entities)
and then build the entity_str using those fields (e.g.,
f"{entityType}:{entityValue}") so ordering and missing-key cases are
deterministic; update the _build_cache_key function to use this explicit-key
sorting and access pattern.

---

Nitpick comments:
In `@demos/fabric-manager-monitor/checks/fabric_check.py`:
- Around line 88-95: The parser currently uses line.rsplit(" ", 1) which can
pick up an optional Prometheus timestamp as value_str; instead split the line on
whitespace (e.g., parts = line.split()), ensure there are at least two tokens,
set name_and_labels = parts[0] and value_str = parts[1], then proceed to extract
name = name_and_labels.split("{")[0], convert value = float(value_str), and
append to metrics as before (metrics.setdefault(name, []).append(value)); also
keep the existing try/except or check token length to avoid
IndexError/ValueError.
- Line 83: Update the internal annotations that use the bare list to the more
specific List[float]: change the variable annotation metrics: Dict[str, list] to
metrics: Dict[str, List[float]] and any function parameter annotated as
Dict[str, list] (e.g., the parameter named metrics) to Dict[str, List[float]];
also add "from typing import List" if it's not already imported so the List type
is available. Ensure all occurrences of the bare list in these Dict[...]
annotations are replaced consistently with List[float] to match the public
signatures.

In `@demos/fabric-manager-monitor/k8s/servicemonitor.yaml`:
- Around line 1-6: The ServiceMonitor manifest for kind: ServiceMonitor and
metadata.name: fabric-manager-monitor is committed as a raw k8s YAML but the
repo expects a Helm chart for production monitors; create a Helm chart named
fabric-manager-monitor in the distro charts collection (the repo's nvsentinel
charts area) that templates the ServiceMonitor (templates/servicemonitor.yaml)
and exposes values (values.yaml) for name, namespace, labels/selectors and
endpoints, include Chart.yaml and README, and add the chart to whatever charts
index/packaging used by the repo so the production health monitor is deployable
via Helm instead of a raw manifest.

In `@demos/fabric-manager-monitor/monitor.py`:
- Around line 91-93: The signal handler _handle_signal currently declares an
unused parameter named frame which triggers Ruff ARG002; update the handler
signature in the monitor class by renaming the parameter from frame to _frame
(or alternately add a `# noqa: ARG002` comment) so the function still matches
the signal.signal callback signature but the linter stops flagging it; ensure
the body remains unchanged (logger.info and self._shutdown.set()) and only the
parameter name/annotation is modified.

In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py`:
- Around line 131-134: The signal handler process_exit_signal currently declares
unused parameters signum and frame; rename them to _signum and _frame in the
process_exit_signal function signature to suppress lint warnings while
preserving the signal.signal contract, leaving the body unchanged (it should
still call exit.set(), prom_server.shutdown(), and t.join()) so behavior is
unaffected.
- Around line 159-160: The project lacks a Helm chart for deploying the
production fabric-manager-monitor; add a new Helm chart in the nvsentinel charts
directory for the fabric-manager-monitor component that packages the service
invoked by the CLI entrypoint (the if __name__ == "__main__": cli() binary).
Create Chart.yaml and values.yaml, and templates including Deployment (container
image, GRPC port, resources, env for signal/metrics config), Service,
ServiceMonitor (Prometheus scraping), RBAC (if needed), ConfigMap for config,
and probes (liveness/readiness); ensure templates expose the same gRPC/metrics
ports used by the code and make chart values configurable (image/tag, replicas,
resources), then add the chart to the repo packaging/CI so it’s included in
release artifacts.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py`:
- Around line 136-137: The except block currently calls log.error(f"Exception
while sending health events: {e}") which loses the traceback; replace that call
with log.exception("Exception while sending health events") in the except block
(preserving the exception variable or removing it since log.exception captures
the traceback) and consider narrowing the except Exception to a more specific
exception type inside event_processor.py where the send-health-events logic runs
so we don't swallow unrelated errors.
- Around line 154-158: The retry handler currently swallows traceback and always
sleeps even after the final attempt; change the except grpc.RpcError as e block
to call log.exception(...) instead of log.error(...) to include the traceback,
move the sleep(delay) to occur before updating delay (delay = min(delay * 1.5,
MAX_DELAY)) and only execute it if there are remaining retries (use the loop's
retry counter or compare current attempt index to max_retries), and remove the
redundant continue at the end of the except block so control naturally loops or
exits based on the retry condition.

Comment on lines +111 to +121
- alert: GPUServiceDown
expr: nvidia_service_up == 0
for: 3m
labels:
severity: critical
component: fabric-manager-monitor
annotations:
summary: "{{ $labels.service_name }} down on {{ $labels.node }}"
description: >-
NVIDIA service {{ $labels.service_name }} has been down for more
than 3 minutes on {{ $labels.node }}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

severity: critical inconsistent with the PR's non-fatal classification for GPU services.

The PR description explicitly classifies GPU services (persistenced, DCGM) as non-fatal. Using severity: critical for this alert will page on-call for a condition that the monitor itself treats as non-fatal, creating alert-fatigue risk. Consider severity: warning to align with the stated impact.

🔧 Proposed fix
         - alert: GPUServiceDown
-          expr: nvidia_service_up == 0
-          for: 3m
-          labels:
-            severity: critical
+          expr: nvidia_service_up == 0
+          for: 3m
+          labels:
+            severity: warning
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/k8s/servicemonitor.yaml` around lines 111 - 121,
The alert GPUServiceDown currently sets labels.severity to "critical" which
conflicts with the PR's stated non-fatal classification for GPU services; update
the alert definition (alert name GPUServiceDown, labels block) to use severity:
warning instead of critical so the alert severity aligns with the non-fatal
classification used for GPU services like persistenced/DCGM.

Comment on lines +114 to +117
# Update restart counter (set to current total — Counter only goes up)
if fm_status.n_restarts > 0:
# We use _total suffix via Counter; increment by delta
pass # Counter tracks via flap detection instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

fabric_manager_restarts_total is never incremented — FabricManagerFlapping alert can never fire.

The pass at line 117 is a deliberate no-op, so increase(fabric_manager_restarts_total[10m]) > 3 in the PrometheusRule will always evaluate to 0 and the alert is effectively dead code.

The intended fix based on the comment ("increment by delta") would require tracking the previous restart count across cycles:

🐛 Proposed fix
 class FabricManagerMonitor:
     def __init__(self, config: MonitorConfig):
         ...
         self._fabric_manager_down = False
         self._nvlink_bandwidth_zero = False
+        self._prev_fm_restarts: int = 0
-                    if fm_status.n_restarts > 0:
-                        # We use _total suffix via Counter; increment by delta
-                        pass  # Counter tracks via flap detection instead
+                    delta = max(0, fm_status.n_restarts - self._prev_fm_restarts)
+                    if delta > 0:
+                        fabric_manager_restarts_total.labels(node).inc(delta)
+                    self._prev_fm_restarts = fm_status.n_restarts
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/monitor.py` around lines 114 - 117, The metric
fabric_manager_restarts_total is never incremented because the loop handling
fm_status.n_restarts contains a no-op; fix by storing the previous restart count
per FM (e.g., a module-level dict like prev_restarts keyed by FM id/host) and in
the monitoring loop (where fm_status.n_restarts is read) compute delta = max(0,
fm_status.n_restarts - prev_restarts[key]); if delta > 0 call the Counter's
inc(delta) to increment fabric_manager_restarts_total, then update
prev_restarts[key] = fm_status.n_restarts so subsequent cycles use the new
baseline; ensure this logic is applied where the current code comments reference
"increment by delta" and that the symbol fabric_manager_restarts_total is the
Counter you call inc on so the FabricManagerFlapping alert can fire.

processing_strategy,
verbose,
):
exit = Event()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

exit shadows the Python built-in; rename to exit_event.

The threading.Event variable named exit masks the built-in exit() within the entire function scope, including the process_exit_signal closure.

🐛 Proposed fix
-    exit = Event()
+    exit_event = Event()

Apply the same rename at lines 131 (exit_event.set()), and 156 (watcher.start(exit_event)).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py` at line
95, Rename the threading.Event variable currently named exit to avoid shadowing
the built-in: change its declaration (Event() assigned to exit) to exit_event,
update the closure process_exit_signal to call exit_event.set() instead of
exit.set(), and update any callers such as watcher.start(...) to pass exit_event
(e.g., watcher.start(exit_event)); ensure all references to the old symbol exit
in that scope are replaced with exit_event.

Comment on lines +106 to +108
version = get_package_version("fabric-manager-monitor")
log_level = "debug" if verbose else os.getenv("LOG_LEVEL", "info")
set_default_structured_logger_with_level("fabric-manager-monitor", version, log_level)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

get_package_version() can raise PackageNotFoundError before the structured logger is initialized.

If the package isn't installed (e.g., bare source checkout), this throws an uncaught importlib.metadata.PackageNotFoundError before set_default_structured_logger_with_level is ever called, producing an unformatted traceback instead of a clean error.

🛡️ Proposed fix
+from importlib.metadata import PackageNotFoundError
 ...
-    version = get_package_version("fabric-manager-monitor")
+    try:
+        version = get_package_version("fabric-manager-monitor")
+    except PackageNotFoundError:
+        version = "unknown"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
version = get_package_version("fabric-manager-monitor")
log_level = "debug" if verbose else os.getenv("LOG_LEVEL", "info")
set_default_structured_logger_with_level("fabric-manager-monitor", version, log_level)
try:
version = get_package_version("fabric-manager-monitor")
except PackageNotFoundError:
version = "unknown"
log_level = "debug" if verbose else os.getenv("LOG_LEVEL", "info")
set_default_structured_logger_with_level("fabric-manager-monitor", version, log_level)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py` around
lines 106 - 108, Wrap the call to get_package_version("fabric-manager-monitor")
in a try/except that catches importlib.metadata.PackageNotFoundError, defaulting
version to a safe value (e.g., "dev" or "unknown") when the package is not
installed, and ensure
set_default_structured_logger_with_level("fabric-manager-monitor", version,
log_level) is always called; reference get_package_version,
PackageNotFoundError, and set_default_structured_logger_with_level when making
the change so the structured logger is initialized even on bare source
checkouts.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (17)
demos/fabric-manager-monitor/k8s/daemonset.yaml (1)

62-68: /var/run/dbus hostPath mount may be redundant with nsenter --mount.

When nsenter enters the host's mount namespace, the host's D-Bus socket (/var/run/dbus/system_bus_socket) becomes accessible naturally. Explicitly bind-mounting it into the container is redundant if all systemctl calls go through nsenter. If any code path issues D-Bus calls directly (without nsenter), add a clarifying comment explaining why.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/k8s/daemonset.yaml` around lines 62 - 68, The
hostPath volume and volumeMount named "host-dbus" that binds /var/run/dbus is
likely redundant when the container uses nsenter to enter the host mount
namespace (nsenter) because /var/run/dbus/system_bus_socket will already be
accessible; remove the volumeMount entry (mountPath: /var/run/dbus) and the
volumes entry for "host-dbus" to avoid an unnecessary bind-mount, or if there
are any code paths that call D-Bus directly (e.g., systemctl calls not wrapped
in nsenter), keep the "host-dbus" entries but add a clear comment referencing
nsenter, systemctl, and system_bus_socket explaining why the explicit hostPath
mount is required.
health-monitors/fabric-manager-monitor/fabric_manager_monitor/logger.py (1)

58-59: Prefix unused-but-required processor arguments with _ to silence Ruff ARG001.

logger and method_name are mandated by the structlog processor protocol and cannot be removed, but the Ruff linter (ARG001) flags them as unused. The idiomatic Python fix is a _ prefix.

♻️ Proposed refactor
     def inject_module_version(
-        logger: logging.Logger | None,
-        method_name: str,
+        _logger: logging.Logger | None,
+        _method_name: str,
         event_dict: dict[str, Any],
     ) -> dict[str, Any]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/logger.py`
around lines 58 - 59, The structlog processor function currently declares
parameters named logger and method_name which are required by the processor
protocol but unused and trigger Ruff ARG001; fix this by prefixing those
parameters with an underscore (e.g., change logger -> _logger and method_name ->
_method_name) in the processor function signature (and update any internal
references if they exist) so the names remain present for the protocol but are
treated as intentionally unused.
health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py (5)

22-22: import logging as log uses the root logger.

All log.debug/error/warning calls go through the root logger rather than a module-scoped logger, losing module-level filtering and source attribution. The sibling file cuda_validation.py correctly uses logger = logging.getLogger(__name__).

♻️ Proposed fix
-import logging as log
+import logging
+
+log = logging.getLogger(__name__)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py`
at line 22, The module currently imports the root logger via "import logging as
log", causing all calls to use the root logger; replace that with "import
logging" and add a module-scoped logger variable (logger =
logging.getLogger(__name__)), then update all uses of
log.debug/log.error/log.warning (etc.) in event_processor.py to use
logger.debug/logger.error/logger.warning so log records carry module-level
source and filtering. Ensure the new logger is configured consistently with
existing sibling modules (e.g., cuda_validation.py) and keep existing log
messages unchanged.

132-132: Non-Pythonic len() truthiness check.

if len(health_events): should be if health_events:.

♻️ Proposed fix
-            if len(health_events):
+            if health_events:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py`
at line 132, Replace the non-Pythonic truthiness check using len() by checking
the list directly: change the condition `if len(health_events):` to `if
health_events:` in the event processing code that references the variable
`health_events` (look for the block inside the EventProcessor/event processing
function in event_processor.py where `health_events` is evaluated) so the code
uses idiomatic Python truthiness.

139-140: log.error() silently drops the exception traceback at both catch sites.

Lines 140 and 158 use log.error(f"... {e}"), which records only the stringified exception. Ruff TRY400 confirms this — use log.exception() (or log.error(..., exc_info=True)) so the full traceback is available in logs for diagnosis.

♻️ Proposed fix
-                    log.error(f"Failed to send health event to UDS: {e}")
+                    log.exception(f"Failed to send health event to UDS: {e}")
-                log.error(f"Exception while sending health events: {e}")
+                log.exception(f"Exception while sending health events: {e}")

Also applies to: 157-158

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py`
around lines 139 - 140, The except blocks in event_processor.py currently call
log.error(f"Exception while sending health events: {e}") which drops the
traceback; update those catch sites (the two except Exception handlers around
the send/publish logic in EventProcessor/event processing routine) to log the
full traceback by using log.exception("Exception while sending health events")
or log.error("Exception while sending health events", exc_info=True) instead so
the stacktrace is captured for debugging.

149-161: New gRPC channel opened on every retry iteration.

grpc.insecure_channel(...) is constructed (and torn down by the with block) on each pass through the retry loop. Every retry pays the full Unix-domain-socket connection-establishment cost, negating the benefit of reconnect logic built into gRPC channels. Move channel creation outside the loop.

♻️ Proposed refactor
-        delay = INITIAL_DELAY
-        for _ in range(MAX_RETRIES):
-            with grpc.insecure_channel(f"unix://{self._socket_path}") as chan:
-                stub = platformconnector_pb2_grpc.PlatformConnectorStub(chan)
-                try:
-                    stub.HealthEventOccurredV1(platformconnector_pb2.HealthEvents(events=health_events, version=1))
-                    metrics.events_sent_success.inc()
-                    return True
-                except grpc.RpcError as e:
-                    log.error(f"Failed to send health event to UDS: {e}")
-                    sleep(delay)
-                    delay = min(delay * 1.5, MAX_DELAY)
-                    continue
+        delay = INITIAL_DELAY
+        with grpc.insecure_channel(f"unix://{self._socket_path}") as chan:
+            stub = platformconnector_pb2_grpc.PlatformConnectorStub(chan)
+            for _ in range(MAX_RETRIES):
+                try:
+                    stub.HealthEventOccurredV1(platformconnector_pb2.HealthEvents(events=health_events, version=1))
+                    metrics.events_sent_success.inc()
+                    return True
+                except grpc.RpcError as e:
+                    log.error(f"Failed to send health event to UDS: {e}", exc_info=True)
+                    sleep(delay)
+                    delay = min(delay * 1.5, MAX_DELAY)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py`
around lines 149 - 161, The retry loop currently creates a new gRPC channel
inside the loop (grpc.insecure_channel) so each retry reconnects; move channel
creation outside the for loop and reuse the same channel and stub for all
retries: create chan = grpc.insecure_channel(f"unix://{self._socket_path}") and
stub = platformconnector_pb2_grpc.PlatformConnectorStub(chan) before the for _
in range(MAX_RETRIES) loop, then call stub.HealthEventOccurredV1(...) inside the
try block, keeping the existing backoff logic (INITIAL_DELAY, MAX_DELAY, sleep)
and metrics.events_sent_success.inc(), and ensure the channel is closed/cleaned
up after the loop completes or on success.

87-89: health_events_publish_time_to_grpc_channel metric includes retry sleep time.

The with ... .time(): context wraps the entire method body, including per-retry sleep(delay) calls (up to ~26 s cumulative). This makes the metric report inflated values on transient failures and conflates gRPC call latency with backoff delays, undermining its usefulness as an SLI.

Consider measuring only the actual gRPC call, or emit a separate metric for retry count/total delay.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py`
around lines 87 - 89, The current timing context around
metrics.health_events_publish_time_to_grpc_channel.labels("health_check_completed_to_grpc_channel").time()
wraps the entire retry loop (including sleep(delay)), inflating latency; to fix,
move the .time() context so it only surrounds the actual gRPC call (the call to
the client/producer inside the retry loop) and remove sleeps from that scope,
and add a separate metric (e.g., health_events_retry_delay_seconds_total or
health_events_retries_count) updated inside the retry branch to record
cumulative backoff delay and retry count; reference the existing metric name
health_events_publish_time_to_grpc_channel and the retry/sleep logic in the
method to locate where to reposition the timer and where to increment the new
retry metric.
demos/fabric-manager-monitor/checks/cuda_validation.py (2)

87-88: Move import json to module-level imports.

import json inside the method body violates PEP 8, which requires all imports at the top of the file. As per coding guidelines, PEP 8 must be followed for Python code.

♻️ Proposed fix
 import logging
 import subprocess
 import sys
 import textwrap
+import json
 from dataclasses import dataclass, field
 from typing import List, Optional
-            import json
             data = json.loads(result.stdout.strip())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/checks/cuda_validation.py` around lines 87 - 88,
The local "import json" found inside the function in cuda_validation.py should
be moved to the module-level imports to comply with PEP 8: remove the inline
"import json" and add "import json" alongside the other top-of-file imports so
that the code that calls json.loads(result.stdout.strip()) uses the module-level
json; ensure there are no name conflicts and run tests to verify nothing else
relied on the local import.

46-53: assert in the subprocess script can be silently stripped by -O.

assert t.sum().isfinite() is the sole GPU correctness check in the per-device loop. If Python is ever invoked with -O (or -OO), this assertion is a no-op and the function silently returns passed=True regardless of the tensor value. Replace with an explicit check.

♻️ Proposed fix
-            assert t.sum().isfinite(), f"GPU {i}: non-finite sum"
+            if not t.sum().isfinite():
+                raise RuntimeError(f"GPU {i}: non-finite sum")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/checks/cuda_validation.py` around lines 46 - 53,
The GPU correctness check uses an assert that can be disabled with -O; replace
the assert in the per-device loop (the line checking t.sum().isfinite() for
tensor t on GPU index i) with an explicit conditional that treats non-finite
results as an error: if the sum is not finite, append a descriptive error to
results["errors"] (or raise an Exception) and set results["passed"]=False so the
existing except/handling logic records the failure reliably even when Python is
run with optimizations.
health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/fabric_check.py (1)

87-87: Use Dict[str, List[float]] for the metrics mapping type.

All three methods (_fetch_metrics, _parse_prometheus_text, _sum_metric) use the unparameterized Dict[str, list]. The demo counterpart was updated to Dict[str, List[float]] in a prior review; the production file still lags.

♻️ Proposed fix
-    def _fetch_metrics(self) -> Dict[str, list]:
+    def _fetch_metrics(self) -> Dict[str, List[float]]:
 
-    def _parse_prometheus_text(self, text: str) -> Dict[str, list]:
+    def _parse_prometheus_text(self, text: str) -> Dict[str, List[float]]:
         metrics: Dict[str, list] = {}
+        metrics: Dict[str, List[float]] = {}
 
-    def _sum_metric(self, metrics: Dict[str, list], name: str) -> float:
+    def _sum_metric(self, metrics: Dict[str, List[float]], name: str) -> float:

Also applies to: 96-96, 117-117

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/fabric_check.py`
at line 87, Update the typing for metrics maps to use List[float] instead of
unparameterized list: change return/arg types in _fetch_metrics,
_parse_prometheus_text, and _sum_metric from Dict[str, list] to Dict[str,
List[float]] and add the List import from typing if missing; ensure all three
function signatures and any local variable annotations that currently use
Dict[str, list] are updated to Dict[str, List[float]] for consistency.
demos/fabric-manager-monitor/checks/fabric_check.py (2)

45-49: Narrow the exception type to requests.exceptions.RequestException.

Catching bare Exception silently swallows programming errors like AttributeError or TypeError. The only expected errors here are network/HTTP failures.

♻️ Proposed fix
-        except Exception as e:
+        except requests.exceptions.RequestException as e:
             return NVLinkStatus(
                 healthy=True,  # can't determine -- assume healthy
                 error=f"Failed to fetch DCGM metrics: {e}",
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/checks/fabric_check.py` around lines 45 - 49,
The except block in fabric_check.py that currently reads "except Exception as e"
(returning NVLinkStatus with error about DCGM metrics) is too broad; narrow it
to catch only network/HTTP errors by changing it to catch
requests.exceptions.RequestException (or RequestException) and ensure you import
RequestException from requests at the top of the module; keep the same return
behavior (healthy=True and error message) but only for RequestException so
programming errors (e.g., AttributeError/TypeError) still surface.

102-104: _sum_metric parameter type is less precise than its callers.

_fetch_metrics and _parse_prometheus_text return Dict[str, List[float]], but _sum_metric accepts the looser Dict[str, list]. Align them.

♻️ Proposed fix
-    def _sum_metric(self, metrics: Dict[str, list], name: str) -> float:
+    def _sum_metric(self, metrics: Dict[str, List[float]], name: str) -> float:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/checks/fabric_check.py` around lines 102 - 104,
The _sum_metric function's parameter type is too loose (Dict[str, list])
compared to its callers; change the type annotation of metrics in _sum_metric to
Dict[str, List[float]] to match _fetch_metrics and _parse_prometheus_text, and
add/ensure the List import from typing is present; update any related type hints
or imports so the function signature reads metrics: Dict[str, List[float]] while
keeping the float return.
health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py (5)

92-95: Annotate DEFAULT_GPU_SERVICES as ClassVar.

Ruff (RUF012) flags mutable class-level attributes that lack a ClassVar annotation.

♻️ Proposed fix
+from typing import ClassVar
 
 class ServiceChecker:
-    DEFAULT_GPU_SERVICES = [
+    DEFAULT_GPU_SERVICES: ClassVar[List[str]] = [
         "nvidia-persistenced",
         "nv-hostengine",
     ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py`
around lines 92 - 95, DEFAULT_GPU_SERVICES is a mutable class-level list and
should be annotated as a ClassVar to satisfy RUF012; update the declaration of
DEFAULT_GPU_SERVICES to use typing.ClassVar (e.g., ClassVar[list[str]] or
ClassVar[List[str]]), and add the required import(s) for ClassVar (and List if
using List) at the top of the module so the class attribute is properly
annotated as a non-instance variable.

185-186: Silent except Exception: pass in _get_restart_count swallows unexpected errors.

The intent is to handle older systemd that doesn't expose NRestarts, which only warrants catching subprocess.TimeoutExpired and ValueError (from int(val)). Bare except Exception: pass also masks unrelated failures (e.g., a broken nsenter binary) with no diagnostic signal.

♻️ Proposed fix
-        except Exception:
-            pass
+        except (subprocess.TimeoutExpired, ValueError):
+            pass
+        except Exception:
+            log.debug("Unexpected error fetching NRestarts for %s", service_name, exc_info=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py`
around lines 185 - 186, In _get_restart_count, remove the bare "except
Exception: pass" that swallows errors; instead catch only the expected
exceptions (subprocess.TimeoutExpired and ValueError from int(val)) and handle
them as before, and for any other unexpected exceptions (e.g.,
subprocess.CalledProcessError or runtime errors) log the exception via the
module logger or re-raise it so failures (like a broken nsenter) are not
silently ignored; update the except blocks around the subprocess call and int
conversion in _get_restart_count to implement this narrower handling and
logging.

259-263: Redundant subprocess.TimeoutExpired in the except tuple; also move return found to else.

subprocess.TimeoutExpired is a subclass of Exception, so the tuple is equivalent to except Exception. Consider narrowing to specific types instead, and move return found to an else block (Ruff TRY300) for clearer control flow.

♻️ Proposed fix
-            return found
-
-        except (subprocess.TimeoutExpired, Exception) as e:
+        else:
+            return found
+
+        except (subprocess.TimeoutExpired, OSError, Exception) as e:

Or, if you only expect subprocess / OS failures:

-        except (subprocess.TimeoutExpired, Exception) as e:
+        except Exception as e:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py`
around lines 259 - 263, The except clause currently catches
(subprocess.TimeoutExpired, Exception) which is redundant because TimeoutExpired
is an Exception; change the handler to catch only the specific error types you
expect (e.g., subprocess.TimeoutExpired or OSError) rather than Exception,
update the except line accordingly, and move the "return found" out of the try
block into an else block so successful parsing returns found in the else branch
while the except branch logs via log.warning(f"Journal parsing failed for
{service_name}: {e}") and returns [].

88-95: Helm chart not included for the new health monitor.

Per project conventions, health monitors should ship a Helm chart under distros/kubernetes/nvsentinel/charts/. The PR does not appear to include one for fabric-manager-monitor, leaving Kubernetes deployment unguided.

Based on learnings: "Health monitors should include Helm chart in distros/kubernetes/nvsentinel/charts/."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py`
around lines 88 - 95, Add a Helm chart for the new fabric-manager-monitor so the
ServiceChecker-based monitor (referencing ServiceChecker and
DEFAULT_GPU_SERVICES) can be deployed via Kubernetes; create a chart named
fabric-manager-monitor in the nvsentinel charts collection that includes
Chart.yaml, values.yaml (exposing image, replicas, and GPU service list driven
by DEFAULT_GPU_SERVICES), templates for
Deployment/Service/ServiceAccount/ConfigMap (mounting any config/env needed by
ServiceChecker), and update the nvsentinel umbrella/values to reference/install
this new chart so the monitor is included in cluster deployments.

107-107: Prefer list unpacking over concatenation (RUF005).

♻️ Proposed fix
-        full_cmd = ["nsenter", "-t", "1", "-m", "--"] + cmd
+        full_cmd = ["nsenter", "-t", "1", "-m", "--", *cmd]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py`
at line 107, The current construction of full_cmd uses list concatenation
("full_cmd = [...] + cmd"); change this to use list unpacking so the literal
nsenter list and the cmd iterable are merged via the star-unpacking operator
into a single list. Update the assignment for full_cmd (the variable built from
the nsenter prefix and cmd) to use unpacking and ensure cmd is an iterable of
strings before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@demos/fabric-manager-monitor/checks/cuda_validation.py`:
- Around line 68-104: Add a to_check_results(self) -> CheckResults (or the
project’s common check results type) method to class CUDAValidator that adapts a
CUDAValidationResult into the watcher pipeline format: call self.check() to get
a CUDAValidationResult, map its passed, error, gpu_count and errors fields into
the same structure other checkers (fabric, PCIe, clock) return, and return that
CheckResults object so watcher.py (called at watcher.py:301) can consume it;
refer to existing to_check_results implementations in the other checker classes
for the exact keys/shape and mirror their behavior in error/failed cases.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/fabric_check.py`:
- Around line 58-64: When _fetch_metrics raises we currently return
NVLinkStatus(healthy=True) but leave bandwidth_zero at its dataclass default
(True), which causes to_check_results to treat the status as degraded; change
the exception branch in the code that calls _fetch_metrics to return
NVLinkStatus with bandwidth_zero explicitly set to False (e.g.,
NVLinkStatus(healthy=True, bandwidth_zero=False, error=...)) so to_check_results
will not raise NvlinkFabricDegraded when we “assume healthy”; update the return
in the try/except around _fetch_metrics and keep the error text in the error
field.

In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/logger.py`:
- Around line 47-49: _parse_log_level currently falls back silently to
logging.INFO and its docstring omits valid keys; change _parse_log_level to
first normalize the input and check membership in _LEVEL_MAP, and if the key is
unknown call the module logger (or logging.warning) to emit a clear warning
including the invalid string and that logging.INFO will be used, then return the
default; also expand the _parse_log_level docstring to explicitly list accepted
keys (e.g. "debug", "info", "warning", "warn", "error", "critical") and note the
function is case-insensitive and strips whitespace so typos can be diagnosed via
the warning; reference symbols: _parse_log_level and _LEVEL_MAP.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py`:
- Around line 97-137: The code has a TOCTOU race on self.entity_cache in
health_check_completed: concurrent callbacks can read the same stale/missing
cache entries, both decide to send identical HealthEvents, and then both write
back — causing duplicates; fix by adding a threading.Lock (e.g.,
self._entity_cache_lock) and use it to protect the read-check-update sequence
around the cache: acquire the lock when building cache_key via _build_cache_key
and checking self.entity_cache to decide which results need sending, mark those
keys as "in-flight" (or store pending_cache_updates) while still holding the
lock, then release the lock before calling
send_health_event_with_retries(health_events); after a successful send,
re-acquire the lock and commit the pending_cache_updates into self.entity_cache
(and clear any in-flight markers) so only one thread sends for a given
cache_key; apply this around the loop that builds health_events and when writing
to self.entity_cache to prevent overlapping sends.

---

Duplicate comments:
In `@demos/fabric-manager-monitor/k8s/daemonset.yaml`:
- Around line 1-20: The current DaemonSet (fabric-manager-monitor) lacks the
required kata vs regular node split and is missing its Helm chart; update the
deployment by creating two DaemonSet manifests (or templated Helm resources)
that target kata and regular nodes using nodeAffinity with the kata.enabled
label (e.g., separate DaemonSets named fabric-manager-monitor-kata and
fabric-manager-monitor-regular or Helm templates in daemonset*.yaml), and add a
Helm chart under distros/kubernetes/nvsentinel/charts/fabric-manager-monitor/
including README/docs that explain the DaemonSet variant selection logic and
helm values (kata.enabled flag) per the project coding guideline.
- Around line 46-49: The livenessProbe configured under livenessProbe.httpGet
currently targets path "/" which 404s because
prometheus_client.start_http_server() only serves "/metrics"; change the probe
to use path "/metrics" (i.e., update livenessProbe.httpGet.path to "/metrics")
so the probe checks the actual metrics endpoint; while here, verify the probe
uses the same port name "metrics" and consider adding sensible timing settings
(initialDelaySeconds/failureThreshold) on the livenessProbe if not already
present to avoid premature restarts.
- Around line 22-30: The DaemonSet is using hostPID and likely a privileged
container for nsenter; to satisfy Checkov add an explicit container
securityContext that sets allowPrivilegeEscalation: false (and, if you only need
nsenter, prefer dropping full privileged mode and grant minimal capabilities
such as CAP_SYS_ADMIN and optionally CAP_SYS_PTRACE via
securityContext.capabilities.add). Update the container spec (referencing the
DaemonSet container definitions where serviceAccountName and hostPID are set) to
include securityContext.allowPrivilegeEscalation: false (and optionally set
capabilities to ["SYS_ADMIN","SYS_PTRACE"] and remove privileged: true if
feasible) so policy engines stop flagging CKV_K8S_16/17/20/23.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py`:
- Around line 43-47: The NVSWITCH_ERROR patterns list contains a redundant
"NVSwitch" entry because _parse_journal_errors lowercases both the journal text
and patterns before matching; remove the duplicate ("NVSwitch" or "nvswitch")
from ErrorCategory.NVSWITCH_ERROR's list so each pattern is unique (leave only
one of "nvswitch" or "NVSwitch"), keeping the remaining patterns intact to
preserve matching behavior in _parse_journal_errors.

In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/logger.py`:
- Around line 37-44: The review noted a duplicate comment about the _LEVEL_MAP
dictionary; no code change is needed because _LEVEL_MAP already includes the
"critical": logging.CRITICAL entry — simply mark the duplicate review as
resolved and leave the _LEVEL_MAP constant in logger.py (symbol: _LEVEL_MAP)
unchanged.

---

Nitpick comments:
In `@demos/fabric-manager-monitor/checks/cuda_validation.py`:
- Around line 87-88: The local "import json" found inside the function in
cuda_validation.py should be moved to the module-level imports to comply with
PEP 8: remove the inline "import json" and add "import json" alongside the other
top-of-file imports so that the code that calls
json.loads(result.stdout.strip()) uses the module-level json; ensure there are
no name conflicts and run tests to verify nothing else relied on the local
import.
- Around line 46-53: The GPU correctness check uses an assert that can be
disabled with -O; replace the assert in the per-device loop (the line checking
t.sum().isfinite() for tensor t on GPU index i) with an explicit conditional
that treats non-finite results as an error: if the sum is not finite, append a
descriptive error to results["errors"] (or raise an Exception) and set
results["passed"]=False so the existing except/handling logic records the
failure reliably even when Python is run with optimizations.

In `@demos/fabric-manager-monitor/checks/fabric_check.py`:
- Around line 45-49: The except block in fabric_check.py that currently reads
"except Exception as e" (returning NVLinkStatus with error about DCGM metrics)
is too broad; narrow it to catch only network/HTTP errors by changing it to
catch requests.exceptions.RequestException (or RequestException) and ensure you
import RequestException from requests at the top of the module; keep the same
return behavior (healthy=True and error message) but only for RequestException
so programming errors (e.g., AttributeError/TypeError) still surface.
- Around line 102-104: The _sum_metric function's parameter type is too loose
(Dict[str, list]) compared to its callers; change the type annotation of metrics
in _sum_metric to Dict[str, List[float]] to match _fetch_metrics and
_parse_prometheus_text, and add/ensure the List import from typing is present;
update any related type hints or imports so the function signature reads
metrics: Dict[str, List[float]] while keeping the float return.

In `@demos/fabric-manager-monitor/k8s/daemonset.yaml`:
- Around line 62-68: The hostPath volume and volumeMount named "host-dbus" that
binds /var/run/dbus is likely redundant when the container uses nsenter to enter
the host mount namespace (nsenter) because /var/run/dbus/system_bus_socket will
already be accessible; remove the volumeMount entry (mountPath: /var/run/dbus)
and the volumes entry for "host-dbus" to avoid an unnecessary bind-mount, or if
there are any code paths that call D-Bus directly (e.g., systemctl calls not
wrapped in nsenter), keep the "host-dbus" entries but add a clear comment
referencing nsenter, systemctl, and system_bus_socket explaining why the
explicit hostPath mount is required.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/fabric_check.py`:
- Line 87: Update the typing for metrics maps to use List[float] instead of
unparameterized list: change return/arg types in _fetch_metrics,
_parse_prometheus_text, and _sum_metric from Dict[str, list] to Dict[str,
List[float]] and add the List import from typing if missing; ensure all three
function signatures and any local variable annotations that currently use
Dict[str, list] are updated to Dict[str, List[float]] for consistency.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py`:
- Around line 92-95: DEFAULT_GPU_SERVICES is a mutable class-level list and
should be annotated as a ClassVar to satisfy RUF012; update the declaration of
DEFAULT_GPU_SERVICES to use typing.ClassVar (e.g., ClassVar[list[str]] or
ClassVar[List[str]]), and add the required import(s) for ClassVar (and List if
using List) at the top of the module so the class attribute is properly
annotated as a non-instance variable.
- Around line 185-186: In _get_restart_count, remove the bare "except Exception:
pass" that swallows errors; instead catch only the expected exceptions
(subprocess.TimeoutExpired and ValueError from int(val)) and handle them as
before, and for any other unexpected exceptions (e.g.,
subprocess.CalledProcessError or runtime errors) log the exception via the
module logger or re-raise it so failures (like a broken nsenter) are not
silently ignored; update the except blocks around the subprocess call and int
conversion in _get_restart_count to implement this narrower handling and
logging.
- Around line 259-263: The except clause currently catches
(subprocess.TimeoutExpired, Exception) which is redundant because TimeoutExpired
is an Exception; change the handler to catch only the specific error types you
expect (e.g., subprocess.TimeoutExpired or OSError) rather than Exception,
update the except line accordingly, and move the "return found" out of the try
block into an else block so successful parsing returns found in the else branch
while the except branch logs via log.warning(f"Journal parsing failed for
{service_name}: {e}") and returns [].
- Around line 88-95: Add a Helm chart for the new fabric-manager-monitor so the
ServiceChecker-based monitor (referencing ServiceChecker and
DEFAULT_GPU_SERVICES) can be deployed via Kubernetes; create a chart named
fabric-manager-monitor in the nvsentinel charts collection that includes
Chart.yaml, values.yaml (exposing image, replicas, and GPU service list driven
by DEFAULT_GPU_SERVICES), templates for
Deployment/Service/ServiceAccount/ConfigMap (mounting any config/env needed by
ServiceChecker), and update the nvsentinel umbrella/values to reference/install
this new chart so the monitor is included in cluster deployments.
- Line 107: The current construction of full_cmd uses list concatenation
("full_cmd = [...] + cmd"); change this to use list unpacking so the literal
nsenter list and the cmd iterable are merged via the star-unpacking operator
into a single list. Update the assignment for full_cmd (the variable built from
the nsenter prefix and cmd) to use unpacking and ensure cmd is an iterable of
strings before merging.

In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/logger.py`:
- Around line 58-59: The structlog processor function currently declares
parameters named logger and method_name which are required by the processor
protocol but unused and trigger Ruff ARG001; fix this by prefixing those
parameters with an underscore (e.g., change logger -> _logger and method_name ->
_method_name) in the processor function signature (and update any internal
references if they exist) so the names remain present for the protocol but are
treated as intentionally unused.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py`:
- Line 22: The module currently imports the root logger via "import logging as
log", causing all calls to use the root logger; replace that with "import
logging" and add a module-scoped logger variable (logger =
logging.getLogger(__name__)), then update all uses of
log.debug/log.error/log.warning (etc.) in event_processor.py to use
logger.debug/logger.error/logger.warning so log records carry module-level
source and filtering. Ensure the new logger is configured consistently with
existing sibling modules (e.g., cuda_validation.py) and keep existing log
messages unchanged.
- Line 132: Replace the non-Pythonic truthiness check using len() by checking
the list directly: change the condition `if len(health_events):` to `if
health_events:` in the event processing code that references the variable
`health_events` (look for the block inside the EventProcessor/event processing
function in event_processor.py where `health_events` is evaluated) so the code
uses idiomatic Python truthiness.
- Around line 139-140: The except blocks in event_processor.py currently call
log.error(f"Exception while sending health events: {e}") which drops the
traceback; update those catch sites (the two except Exception handlers around
the send/publish logic in EventProcessor/event processing routine) to log the
full traceback by using log.exception("Exception while sending health events")
or log.error("Exception while sending health events", exc_info=True) instead so
the stacktrace is captured for debugging.
- Around line 149-161: The retry loop currently creates a new gRPC channel
inside the loop (grpc.insecure_channel) so each retry reconnects; move channel
creation outside the for loop and reuse the same channel and stub for all
retries: create chan = grpc.insecure_channel(f"unix://{self._socket_path}") and
stub = platformconnector_pb2_grpc.PlatformConnectorStub(chan) before the for _
in range(MAX_RETRIES) loop, then call stub.HealthEventOccurredV1(...) inside the
try block, keeping the existing backoff logic (INITIAL_DELAY, MAX_DELAY, sleep)
and metrics.events_sent_success.inc(), and ensure the channel is closed/cleaned
up after the loop completes or on success.
- Around line 87-89: The current timing context around
metrics.health_events_publish_time_to_grpc_channel.labels("health_check_completed_to_grpc_channel").time()
wraps the entire retry loop (including sleep(delay)), inflating latency; to fix,
move the .time() context so it only surrounds the actual gRPC call (the call to
the client/producer inside the retry loop) and remove sleeps from that scope,
and add a separate metric (e.g., health_events_retry_delay_seconds_total or
health_events_retries_count) updated inside the retry branch to record
cumulative backoff delay and retry count; reference the existing metric name
health_events_publish_time_to_grpc_channel and the retry/sleep logic in the
method to locate where to reposition the timer and where to increment the new
retry metric.

Comment on lines +68 to +104
class CUDAValidator:
"""Validates CUDA context creation and memory on each GPU."""

def check(self) -> CUDAValidationResult:
"""Run CUDA validation script as a subprocess."""
try:
result = subprocess.run(
[sys.executable, "-c", _CUDA_TEST_SCRIPT],
capture_output=True,
text=True,
timeout=120, # generous timeout for multi-GPU test
)

if result.returncode != 0:
return CUDAValidationResult(
passed=False,
error=f"CUDA test script failed: {result.stderr.strip()}",
)

import json
data = json.loads(result.stdout.strip())
return CUDAValidationResult(
passed=data.get("passed", False),
gpu_count=data.get("gpu_count", 0),
errors=data.get("errors", []),
)

except subprocess.TimeoutExpired:
return CUDAValidationResult(
passed=False,
error="CUDA validation timed out",
)
except Exception as e:
return CUDAValidationResult(
passed=False,
error=str(e),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

rg -n "to_check_results" --type py -A3

Repository: NVIDIA/NVSentinel

Length of output: 4248


🏁 Script executed:

# Also check the full file length and content beyond line 104
wc -l demos/fabric-manager-monitor/checks/cuda_validation.py

Repository: NVIDIA/NVSentinel

Length of output: 119


🏁 Script executed:

# List all files in the checks directory to see other checker implementations
ls -la demos/fabric-manager-monitor/checks/

Repository: NVIDIA/NVSentinel

Length of output: 559


Add missing to_check_results method to CUDAValidator.

The demos/fabric-manager-monitor/checks/cuda_validation.py file (104 lines) is missing the to_check_results method. This method exists in the production code at health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/cuda_validation.py:120 and is required to integrate CUDA validation results with the watcher pipeline (called at watcher.py:301). All other checkers (fabric, PCIe, clock) implement this method following the same pattern.

🧰 Tools
🪛 Ruff (0.15.1)

[error] 74-74: subprocess call: check for execution of untrusted input

(S603)


[warning] 100-100: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@demos/fabric-manager-monitor/checks/cuda_validation.py` around lines 68 -
104, Add a to_check_results(self) -> CheckResults (or the project’s common check
results type) method to class CUDAValidator that adapts a CUDAValidationResult
into the watcher pipeline format: call self.check() to get a
CUDAValidationResult, map its passed, error, gpu_count and errors fields into
the same structure other checkers (fabric, PCIe, clock) return, and return that
CheckResults object so watcher.py (called at watcher.py:301) can consume it;
refer to existing to_check_results implementations in the other checker classes
for the exact keys/shape and mirror their behavior in error/failed cases.

Comment on lines +58 to +64
try:
metrics = self._fetch_metrics()
except Exception as e:
return NVLinkStatus(
healthy=True, # can't determine -- assume healthy
error=f"Failed to fetch DCGM metrics: {e}",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

DCGM fetch failure silently triggers NvlinkFabricDegraded when Fabric Manager is also down.

When _fetch_metrics raises, the returned NVLinkStatus has healthy=True (as intended) but bandwidth_zero=True (the dataclass default). In to_check_results (line 130):

fabric_nvlink_degraded = not status.healthy or (status.bandwidth_zero and fabric_manager_down)
#                       = False              or (True              and True              )
#                       = True   ← fires NvlinkFabricDegraded despite "assume healthy" intent

The status.error field is never consulted in to_check_results, so the false-positive path is silent. Explicitly set bandwidth_zero=False in the error branch:

🐛 Proposed fix
-        except Exception as e:
+        except requests.exceptions.RequestException as e:
             return NVLinkStatus(
                 healthy=True,  # can't determine -- assume healthy
+                bandwidth_zero=False,  # unknown; don't trigger false positive
                 error=f"Failed to fetch DCGM metrics: {e}",
             )
🧰 Tools
🪛 Ruff (0.15.1)

[warning] 60-60: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/fabric_check.py`
around lines 58 - 64, When _fetch_metrics raises we currently return
NVLinkStatus(healthy=True) but leave bandwidth_zero at its dataclass default
(True), which causes to_check_results to treat the status as degraded; change
the exception branch in the code that calls _fetch_metrics to return
NVLinkStatus with bandwidth_zero explicitly set to False (e.g.,
NVLinkStatus(healthy=True, bandwidth_zero=False, error=...)) so to_check_results
will not raise NvlinkFabricDegraded when we “assume healthy”; update the return
in the try/except around _fetch_metrics and keep the error text in the error
field.

Comment on lines +47 to +49
def _parse_log_level(level: str) -> int:
"""Convert a string log level to a logging constant."""
return _LEVEL_MAP.get(level.lower().strip(), logging.INFO)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silent fallback in _parse_log_level can mask misconfiguration; docstring also incomplete.

An unrecognised level string (e.g. a typo like "debugg" or an env-var value like "WARN" with uppercase) silently produces logging.INFO, enabling full verbose output in production with no diagnostic.

Separately, the level arg docstring on line 76 lists "debug", "info", "warn", "error" but omits "critical" and "warning", both of which are valid keys in _LEVEL_MAP.

🛡️ Proposed fix
 def _parse_log_level(level: str) -> int:
     """Convert a string log level to a logging constant."""
-    return _LEVEL_MAP.get(level.lower().strip(), logging.INFO)
+    normalized = level.lower().strip()
+    if normalized not in _LEVEL_MAP:
+        print(
+            f"[logger] Unknown log level {level!r}; defaulting to INFO.",
+            file=sys.stderr,
+        )
+    return _LEVEL_MAP.get(normalized, logging.INFO)

And update the docstring:

-        level: The log level as a string (e.g., "debug", "info", "warn", "error").
+        level: The log level as a string (e.g., "debug", "info", "warn",
+               "warning", "error", "critical").
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/logger.py`
around lines 47 - 49, _parse_log_level currently falls back silently to
logging.INFO and its docstring omits valid keys; change _parse_log_level to
first normalize the input and check membership in _LEVEL_MAP, and if the key is
unknown call the module logger (or logging.warning) to emit a clear warning
including the invalid string and that logging.INFO will be used, then return the
default; also expand the _parse_log_level docstring to explicitly list accepted
keys (e.g. "debug", "info", "warning", "warn", "error", "critical") and note the
function is case-insensitive and strips whitespace so typos can be diagnosed via
the warning; reference symbols: _parse_log_level and _LEVEL_MAP.

Comment on lines +97 to +137
for result in results:
cache_key = self._build_cache_key(result.check_name, result.entities_impacted)
cached = self.entity_cache.get(cache_key)

# Only send if state changed (or first observation)
if cached is None or cached.is_fatal != result.is_fatal or cached.is_healthy != result.is_healthy:
entities = [
platformconnector_pb2.Entity(entityType=e["entityType"], entityValue=e["entityValue"])
for e in result.entities_impacted
]

recommended_action = self._get_recommended_action(result)

health_event = platformconnector_pb2.HealthEvent(
version=self._version,
agent=self._agent,
componentClass=self._component_class,
checkName=result.check_name,
isFatal=result.is_fatal,
isHealthy=result.is_healthy,
message=result.message,
recommendedAction=recommended_action,
errorCode=result.error_codes,
entitiesImpacted=entities,
metadata=result.metadata or {},
generatedTimestamp=timestamp,
nodeName=self._node_name,
processingStrategy=self._processing_strategy,
)
health_events.append(health_event)
pending_cache_updates[cache_key] = CachedEntityState(
is_fatal=result.is_fatal, is_healthy=result.is_healthy
)

log.debug(f"fabric manager health events to send: {len(health_events)}")
if len(health_events):
try:
if self.send_health_event_with_retries(health_events):
# Only update cache after successful send
for key, state in pending_cache_updates.items():
self.entity_cache[key] = state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

TOCTOU race on entity_cache when callbacks run concurrently.

health_check_completed reads self.entity_cache, builds events, then calls send_health_event_with_retries (which can block up to ~26 s across retries) before writing the cache. If a ThreadPoolExecutor fires overlapping callback invocations, two threads can both observe a stale/absent cache entry, both build identical events, and both send — producing duplicate HealthEvents on every check cycle until the first write lands.

A threading.Lock guarding the read-check-update sequence eliminates the race:

🔒 Proposed fix
+import threading
 ...
 class PlatformConnectorEventProcessor(CallbackInterface):
     def __init__(self, ...):
         ...
         self.entity_cache: dict[str, CachedEntityState] = {}
+        self._cache_lock = threading.Lock()

     def health_check_completed(self, results: List[CheckResult]) -> None:
         ...
-        for result in results:
-            cache_key = self._build_cache_key(result.check_name, result.entities_impacted)
-            cached = self.entity_cache.get(cache_key)
-            if cached is None or ...:
-                ...
-                pending_cache_updates[cache_key] = ...
+        with self._cache_lock:
+            for result in results:
+                cache_key = self._build_cache_key(result.check_name, result.entities_impacted)
+                cached = self.entity_cache.get(cache_key)
+                if cached is None or ...:
+                    ...
+                    pending_cache_updates[cache_key] = ...

         if health_events:
             try:
                 if self.send_health_event_with_retries(health_events):
-                    for key, state in pending_cache_updates.items():
-                        self.entity_cache[key] = state
+                    with self._cache_lock:
+                        for key, state in pending_cache_updates.items():
+                            self.entity_cache[key] = state
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py`
around lines 97 - 137, The code has a TOCTOU race on self.entity_cache in
health_check_completed: concurrent callbacks can read the same stale/missing
cache entries, both decide to send identical HealthEvents, and then both write
back — causing duplicates; fix by adding a threading.Lock (e.g.,
self._entity_cache_lock) and use it to protect the read-check-update sequence
around the cache: acquire the lock when building cache_key via _build_cache_key
and checking self.entity_cache to decide which results need sending, mark those
keys as "in-flight" (or store pending_cache_updates) while still holding the
lock, then release the lock before calling
send_health_event_with_retries(health_events); after a successful send,
re-acquire the lock and commit the pending_cache_updates into self.entity_cache
(and clear any in-flight markers) so only one thread sends for a given
cache_key; apply this around the loop that builds health_events and when writing
to self.entity_cache to prevent overlapping sends.

@XRFXLP
Copy link
Member

XRFXLP commented Feb 23, 2026

hey @dmvevents

Thanks for putting this together — the Fabric Manager monitoring gap is a real problem and great to see someone tackling it.

For larger features like this, we typically start with a design doc to align on architecture early and reduce back-and-forth during review. Would you mind putting one together?

A couple of things the design doc would help clarify:

  1. PCIe and NVLink checks: The existing gpu-health-monitor already polls DCGM directly via pydcgm and has health watches for DCGM_HEALTH_WATCH_PCIE and DCGM_HEALTH_WATCH_NVLINK. Could these be extended there instead of re-implementing them via DCGM exporter HTTP scraping?
  2. Monitor scope: It might make sense to keep fabric-manager-monitor focused on what DCGM genuinely can't see — FM service health, GPU service liveness, CUDA validation — and let gpu-health-monitor handle the hardware-level checks it already covers. This would also avoid duplicate events hitting platform-connector for the same failure modes.
    Happy to discuss further once the doc is up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: NVSentinel not detecting fabric health on H100s

2 participants