feat: add fabric-manager-monitor for GPU infrastructure health checks#891
feat: add fabric-manager-monitor for GPU infrastructure health checks#891dmvevents wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds a Fabric Manager Monitor demo and a production-grade fabric-manager-monitor health monitor implementing multiple GPU health checks, Prometheus metrics, Kubernetes manifests, container builds, gRPC platform-connector publishing, and unit tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Watcher as FabricManagerWatcher
participant ServiceChk as ServiceChecker
participant PCIeChk as PCIeChecker
participant ClockChk as ClockChecker
participant NVLinkChk as NVLinkFabricChecker
participant CUDAVal as CUDAValidator
participant Processor as PlatformConnectorEventProcessor
participant gRPC as gRPC_UDS
Watcher->>Watcher: start poll loop
par Parallel checks
Watcher->>ServiceChk: check_fabric_manager()
ServiceChk-->>Watcher: FabricManagerStatus
Watcher->>PCIeChk: check()
PCIeChk-->>Watcher: List[PCIeStatus]
Watcher->>ClockChk: check()
ClockChk-->>Watcher: List[ClockStatus]
Watcher->>NVLinkChk: check()
NVLinkChk-->>Watcher: NVLinkStatus
opt CUDA enabled
Watcher->>CUDAVal: check()
CUDAVal-->>Watcher: CUDAValidationResult
end
end
Watcher->>Watcher: aggregate CheckResult[]
Watcher->>Processor: health_check_completed(CheckResult[])
Processor->>Processor: build HealthEvent(s), dedupe via cache
Processor->>gRPC: send with retries
gRPC-->>Processor: ack / error
Processor->>Processor: update cache & metrics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🟡 Minor comments (13)
demos/fabric-manager-monitor/checks/__init__.py-1-1 (1)
1-1:⚠️ Potential issue | 🟡 MinorMissing Apache 2.0 license header.
Every other
__init__.pyin this PR includes the license header. This file should follow the same pattern.🔧 Proposed fix
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + """Health check modules for GPU Node Health Validator."""Based on learnings: "All files must include Apache 2.0 license header".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/checks/__init__.py` at line 1, Add the missing Apache 2.0 license header at the top of the checks package initializer (the checks.__init__ module); copy the exact license block used in the other __init__.py files in this PR and prepend it as the first lines of the file so the header is consistent across the repository.demos/fabric-manager-monitor/README.md-38-39 (1)
38-39:⚠️ Potential issue | 🟡 Minor
port-forwardto a DaemonSet selects an arbitrary pod.On a multi-node cluster,
kubectl port-forward ds/fabric-manager-monitorroutes to one pod (not necessarily the node exhibiting the health issue). A more useful Quick Start instruction would forward to the specific pod on the target node:# Forward to the pod running on a specific node NODE=<node-name> POD=$(kubectl get pod -n nvsentinel -o wide --field-selector spec.nodeName=${NODE} \ -l app=fabric-manager-monitor -o jsonpath='{.items[0].metadata.name}') kubectl port-forward -n nvsentinel pod/${POD} 9101:9101🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/README.md` around lines 38 - 39, The current README uses `kubectl port-forward` against the DaemonSet which picks an arbitrary pod; update the Quick Start to forward to the pod on a specific node instead: instruct the user to set NODE=<node-name>, locate the pod on that node using `kubectl get pod` with `--field-selector spec.nodeName=${NODE}` and the label `app=fabric-manager-monitor` (assign result to POD), then run `kubectl port-forward -n nvsentinel pod/${POD} 9101:9101`; replace the two-line example that port-forwards the DaemonSet and the following curl line with this pod-targeted workflow and keep the curl check the same.demos/fabric-manager-monitor/k8s/rbac.yaml-24-26 (1)
24-26:⚠️ Potential issue | 🟡 MinorRemove cluster-wide
podsaccess—the monitor does not use Kubernetes pod API.The monitor code uses only nsenter-based systemd checks, HTTP calls to the DCGM exporter, and direct subprocess calls. No Kubernetes client library or pod API calls are present anywhere in the codebase. The
get, listverbs onpodsviolate least privilege.Either:
- Remove the
podsresource entirely if not needed, or- If pod access is genuinely required (e.g., to locate the DCGM exporter dynamically), replace
ClusterRolewith a namespace-scopedRole+RoleBindingfor thenvsentinelnamespace.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/k8s/rbac.yaml` around lines 24 - 26, The ClusterRole currently grants cluster-wide access to the "pods" resource with verbs ["get","list"] which is unnecessary; remove the pods entry from the resources list in k8s/rbac.yaml if the monitor never uses the Kubernetes pod API, or if pod access is actually required, convert the ClusterRole into a namespace-scoped Role and replace the ClusterRoleBinding with a RoleBinding targeted to the "nvsentinel" namespace so the "pods" verbs are limited to that namespace only; look for the resources: ["pods"] and any ClusterRole/ClusterRoleBinding declarations in rbac.yaml (and their bindings) to apply the change.demos/fabric-manager-monitor/checks/fabric_check.py-72-79 (1)
72-79:⚠️ Potential issue | 🟡 MinorReturn type annotation is incorrect — actual return is
Dict[str, list].
_fetch_metricsis annotated as-> Dict[str, float], but_parse_prometheus_text(Line 83) builds and returns aDict[str, list](metric name → list of float values). The downstream_sum_metriccorrectly treats the values as lists. The annotation should match.Proposed fix
- def _fetch_metrics(self) -> Dict[str, float]: + def _fetch_metrics(self) -> Dict[str, list]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/checks/fabric_check.py` around lines 72 - 79, The return type annotation of _fetch_metrics is wrong: it currently declares Dict[str, float] but calls _parse_prometheus_text which returns a mapping of metric names to lists of floats. Update _fetch_metrics's signature to -> Dict[str, List[float]] (import List from typing if missing) so it matches the actual return type used by _sum_metric and other callers; verify any related type hints (e.g., _parse_prometheus_text, _sum_metric) are consistent.demos/fabric-manager-monitor/checks/fabric_check.py-81-96 (1)
81-96:⚠️ Potential issue | 🟡 MinorReturn type annotation on
_parse_prometheus_textis alsoDict[str, float]but returnsDict[str, list].Same mismatch — the docstring says
{metric_name: [values]}(list), but the signature saysDict[str, float].Proposed fix
- def _parse_prometheus_text(self, text: str) -> Dict[str, float]: + def _parse_prometheus_text(self, text: str) -> Dict[str, list]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/checks/fabric_check.py` around lines 81 - 96, The return type annotation of _parse_prometheus_text is incorrect: it declares Dict[str, float] but actually builds and returns a Dict[str, list] (metrics holds lists of values); update the function signature and any related type references to return Dict[str, List[float]] (and import typing.List if needed) so the annotation matches the docstring and the actual returned data structure.demos/README.md-27-37 (1)
27-37:⚠️ Potential issue | 🟡 MinorMissing
**Time:**field — inconsistent with existing demo entries.The other two demos both include a
**Time:**line (e.g., "5-10 minutes"). The new entry omits it, breaking the established README format. Add an estimated time or note that a GPU cluster is required for accurate timing.📝 Suggested addition
**Requirements:** Docker, kubectl, Kubernetes cluster with GPU nodes, Prometheus Operator + +**Time:** 10-20 minutes (requires GPU node access)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/README.md` around lines 27 - 37, The Fabric Manager Monitor demo entry is missing the "**Time:**" line which breaks consistency with other demo entries; update the `Fabric Manager Monitor` section in demos/README.md (the block under "### [Fabric Manager Monitor](fabric-manager-monitor/)") to include an estimated time (e.g., "**Time:** 5-10 minutes" or a note like "Time depends on GPU cluster availability") placed with the other metadata lines so the format matches the other demo entries.health-monitors/fabric-manager-monitor/Dockerfile-18-18 (1)
18-18:⚠️ Potential issue | 🟡 Minor
ARGdefault does not support shell command substitution.Docker's
ARGdirective treats the default value as a literal string.$(date -u +%FT%TZ)will not be evaluated — the default will be the literal string$(date -u +%FT%TZ). Since the comment says CI passes this value, consider using a simple static fallback instead.Proposed fix
-ARG BUILD_DATE=$(date -u +%FT%TZ) +ARG BUILD_DATE="unknown"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/Dockerfile` at line 18, ARG BUILD_DATE currently uses a shell substitution literal '$(date -u +%FT%TZ)', which Docker will not evaluate; change the default to a static fallback (e.g., an empty string or a fixed timestamp like "1970-01-01T00:00:00Z" or "unknown") and rely on CI to pass the real value via --build-arg BUILD_DATE when building, updating the Dockerfile's ARG BUILD_DATE declaration accordingly.demos/fabric-manager-monitor/k8s/servicemonitor.yaml-84-95 (1)
84-95:⚠️ Potential issue | 🟡 MinorPCIe link degraded alert uses
severity: warning, but the check is classified as fatal.The README and check categories table classify PCIe link degradation as fatal (
is_fatal=True), which triggersRESTART_BMremediation in the event processor. Awarningseverity alert is inconsistent with that classification — considercriticalto match.Proposed fix
labels: - severity: warning + severity: critical component: fabric-manager-monitor🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/k8s/servicemonitor.yaml` around lines 84 - 95, The GPUPCIeLinkDegraded alert currently sets labels.severity to "warning" but the PCIe link degradation check is classified as fatal (is_fatal=True) and should trigger RESTART_BM; update the alert definition for GPUPCIeLinkDegraded in servicemonitor.yaml to change labels.severity from "warning" to "critical" so the alert severity aligns with the check classification and remediation logic.health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/fabric_check.py-96-111 (1)
96-111:⚠️ Potential issue | 🟡 MinorPrometheus text parser mishandles lines with optional timestamps.
line.rsplit(" ", 1)will parsemetric{labels} 123.0 1609459200asvalue_str = "1609459200"(the timestamp), silently discarding the actual value. While DCGM exporter typically omits timestamps, this is a latent correctness issue if the exporter configuration changes or a proxy adds them.Proposed fix: split on whitespace and take the second token
def _parse_prometheus_text(self, text: str) -> Dict[str, list]: """Parse Prometheus exposition format into {metric_name: [values]}.""" metrics: Dict[str, list] = {} for line in text.splitlines(): line = line.strip() if not line or line.startswith("#"): continue - # Lines look like: DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="0",...} 1234.0 try: - name_and_labels, value_str = line.rsplit(" ", 1) - name = name_and_labels.split("{")[0] - value = float(value_str) + # Format: metric_name{labels} value [timestamp] + parts = line.split() + if len(parts) < 2: + continue + name = parts[0].split("{")[0] + value = float(parts[-2] if len(parts) >= 3 else parts[1]) metrics.setdefault(name, []).append(value) except (ValueError, IndexError): continue return metricsActually, the simpler and more correct approach: always take
parts[1]as the value (Prometheus spec:metric_name [labels] value [timestamp]):Simpler fix
try: - name_and_labels, value_str = line.rsplit(" ", 1) - name = name_and_labels.split("{")[0] - value = float(value_str) + parts = line.split() + name = parts[0].split("{")[0] + value = float(parts[1]) metrics.setdefault(name, []).append(value) except (ValueError, IndexError): continueWait —
parts[0]would bemetric_name{labels}since labels have no spaces. Soparts[1]is always the value. This is the correct approach.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/fabric_check.py` around lines 96 - 111, The _parse_prometheus_text function incorrectly uses rsplit(" ", 1) which treats an optional timestamp as the value; change parsing to split the line on whitespace (e.g., parts = line.split()) and treat parts[0] as the metric name (split off labels via .split("{")[0]) and parts[1] as the value (float(parts[1])); keep the existing try/except around float conversion and IndexError to ignore malformed lines and continue populating the metrics dict in the same way as before.health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/clock_check.py-69-79 (1)
69-79:⚠️ Potential issue | 🟡 MinorExact-match benign check may not handle compound throttle reason strings.
nvidia-smi --query-gpu=clocks_throttle_reasons.activecan return compound strings when multiple reasons are active (e.g., pipe-separated reasons). An exact match against_BENIGN_REASONSwill fail for a string like"GPU Idle | Applications Clocks Setting", causing it to fall through toelif reason_str:→throttled = True.This errs on the safe side (false-positive, not false-negative), but it undermines the stated GPU Idle false-positive mitigation. Consider splitting the reason string and checking whether all active reasons are benign:
Proposed fix
# GPU Idle causes low clock ratio but isn't a real throttle. # Only flag as throttled for non-benign reasons. - if reason_str in self._BENIGN_REASONS: - status.throttled = False - elif reason_str: - status.throttled = True + if reason_str: + # Split on common delimiters (pipe, comma) and strip whitespace + individual_reasons = [r.strip() for r in reason_str.replace("|", ",").split(",") if r.strip()] + non_benign = [r for r in individual_reasons if r not in self._BENIGN_REASONS] + status.throttled = len(non_benign) > 0 + else: + # No reason data available; keep ratio-based throttled flag + pass🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/clock_check.py` around lines 69 - 79, The current exact-match against _BENIGN_REASONS fails for compound reason strings (e.g., "GPU Idle | Applications Clocks Setting"); in the loop over clocks where you set status.throttle_reasons = reason_str, split reason_str into individual reasons (split on pipe '|' and other common separators, trim whitespace) and treat the status as not throttled only if every non-empty reason token is in _BENIGN_REASONS; otherwise, if any token is non-empty and not benign, set status.throttled = True; ensure empty/blank reason_str still results in throttled = False and preserve assignment to status.throttle_reasons.health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/watcher.py-103-107 (1)
103-107:⚠️ Potential issue | 🟡 Minor
log.exception()outside anexceptblock produces a misleading traceback.
done_callbackis invoked by the thread pool executor, not inside anexceptblock.log.exception()at line 106 will appendNoneType: Noneas the traceback, which is confusing. Uselog.error()instead:Proposed fix
def done_callback(class_name: str, future): e = future.exception() if e is not None: - log.exception(f"Callback failed: {e}") + log.error("Callback %s failed: %s", class_name, e, exc_info=e) metrics.callback_failures.labels(class_name, "health_check_completed").inc()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/watcher.py` around lines 103 - 107, done_callback currently calls log.exception(...) outside an except block which produces a misleading traceback; change the call in done_callback to use log.error(...) instead and pass the exception information properly (e.g. include the exception message and set exc_info so the traceback is logged) while leaving the metrics.callback_failures.labels(class_name, "health_check_completed").inc() behavior unchanged.health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py-146-153 (1)
146-153:⚠️ Potential issue | 🟡 Minor
int()conversion ofMainPIDcan raiseValueError.If
MainPIDcontains an unexpected non-numeric value (e.g., empty string from a parsing edge case),int(props.get("MainPID", "0"))will raiseValueError. This is caught by the outerexcept Exceptionblock, which returnsactive=False— a misleading result when only the PID field is malformed while the service is actually active.Proposed fix
return ServiceStatus( name=service_name, active=(active_state == "active"), sub_state=props.get("SubState", ""), - main_pid=int(props.get("MainPID", "0")), + main_pid=int(props.get("MainPID", "0") or "0"), n_restarts=n_restarts, start_timestamp=props.get("ExecMainStartTimestamp", ""), )This handles the empty-string case. For full robustness, wrap in a try/except:
try: main_pid = int(props.get("MainPID", "0") or "0") except ValueError: main_pid = 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py` around lines 146 - 153, The int() conversion of props.get("MainPID", "0") in the ServiceStatus return can raise ValueError and incorrectly mark a service as inactive; update the code in the function that builds ServiceStatus (where service_name, active_state and props are used) to safely parse MainPID by evaluating props.get("MainPID", "0") or "0" and wrapping int(...) in a try/except ValueError (fallback main_pid = 0), then pass that main_pid into ServiceStatus instead of calling int(...) inline.health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py-184-203 (1)
184-203:⚠️ Potential issue | 🟡 MinorFlap detection misses restarts if the systemd
NRestartscounter resets.If the counter resets to 0 (e.g., after
daemon-reloador unit file change), the comparisoncurrent_restarts > last_countwill be false until the counter surpasses the previously stored value, silently missing restart events. Consider also detecting counter decreases as an indication of a reset.Proposed fix
last_count = self._last_restart_count[service_name] - if current_restarts > last_count: + if current_restarts < last_count: + # Counter reset detected (daemon-reload, unit file change, etc.) + log.info("Restart counter reset for %s: %d -> %d", service_name, last_count, current_restarts) + self._last_restart_count[service_name] = current_restarts + elif current_restarts > last_count: # New restarts detected -- record timestamp for each now = time.monotonic()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py` around lines 184 - 203, _update_flap_tracking currently only records new restarts when current_restarts > last_count and thus misses events if the systemd NRestarts counter resets (current_restarts < last_count). Modify _update_flap_tracking to detect a decrease in current_restarts vs _last_restart_count (counter reset) and in that case treat the current_restarts value as new restart events: record a timestamp for each of the current_restarts (append now current_restarts times to self._restart_history[service_name]) and then update _last_restart_count[service_name] = current_restarts; keep the existing logic for the normal increase case and the pruning logic unchanged (use time.monotonic() and self._flap_window and deque operations as already used).
| prometheus_client>=0.20.0 | ||
| requests>=2.31.0 | ||
| pyyaml>=6.0 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find demos/fabric-manager-monitor -type f -name "Dockerfile" -o -name "README*" | head -20Repository: NVIDIA/NVSentinel
Length of output: 139
🏁 Script executed:
cat demos/fabric-manager-monitor/Dockerfile 2>/dev/null || echo "No Dockerfile found in root of demos/fabric-manager-monitor"Repository: NVIDIA/NVSentinel
Length of output: 598
🏁 Script executed:
find demos/fabric-manager-monitor -type f -name "*.md" | xargs grep -l "CUDA\|torch\|pytorch" 2>/dev/nullRepository: NVIDIA/NVSentinel
Length of output: 99
🏁 Script executed:
grep -r "subprocess.*torch\|torch.*test" demos/fabric-manager-monitor --include="*.py" --include="*.md" 2>/dev/nullRepository: NVIDIA/NVSentinel
Length of output: 43
🏁 Script executed:
grep -r "import torch" demos/fabric-manager-monitor --include="*.py" 2>/dev/nullRepository: NVIDIA/NVSentinel
Length of output: 136
🏁 Script executed:
cat demos/fabric-manager-monitor/README.mdRepository: NVIDIA/NVSentinel
Length of output: 3299
🏁 Script executed:
cat demos/fabric-manager-monitor/checks/cuda_validation.pyRepository: NVIDIA/NVSentinel
Length of output: 3287
Add PyTorch to requirements.txt.
The CUDA validation check (check 5 in README) requires PyTorch to execute. While the code gracefully handles missing PyTorch by returning a failed validation result, PyTorch should be explicitly listed in requirements.txt so the check can function as intended. Currently, if the Dockerfile installs only from this file, the CUDA validation will always report failure with "PyTorch not available" rather than performing actual CUDA context and memory tests.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@demos/fabric-manager-monitor/requirements.txt` around lines 1 - 3, Add
PyTorch to requirements.txt so the README's CUDA validation check (check 5) can
actually run; update the file (the listed lines with prometheus_client,
requests, pyyaml) to include the PyTorch package (e.g., add a "torch>=2.0.0"
entry) and ensure Dockerfile/pip install uses that requirements.txt so the
validation will import torch instead of failing with "PyTorch not available"
during the CUDA validation check.
| default="EXECUTE_REMEDIATION", | ||
| help="Event processing strategy: EXECUTE_REMEDIATION or STORE_ONLY", | ||
| ) | ||
| @click.option("--verbose", type=bool, default=False, help="Enable debug logging") |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Click python library type=bool vs is_flag=True documentation
💡 Result:
In Click, type=bool and is_flag=True mean different things:
-
type=bool(ortype=click.BOOL): the option expects a value (e.g.--debug true,--debug false) and Click will parse common truthy/falsey strings ("1"/"0","true"/"false","yes"/"no", etc.). Click documents this as thebool/click.BOOLparameter type. [1] -
is_flag=True: the option becomes a boolean flag that takes no value; it’s False when absent and True when present (e.g.--verbose). Click documents this under “Boolean flags”. [2]
Related: defining a pair like --shout/--no-shout makes Click treat it as a boolean flag automatically (implicit is_flag=True). [2]
Sources:
[1] Click “Parameter Types” (bool / click.BOOL) documentation. (click.palletsprojects.com)
[2] Click “Options” documentation (Boolean flags, --flag/--no-flag, is_flag=True). (pocoo-click.readthedocs.io)
Citations:
- 1: https://click.palletsprojects.com/en/stable/parameter-types/?utm_source=openai
- 2: https://pocoo-click.readthedocs.io/en/latest/options/?utm_source=openai
🏁 Script executed:
fd cli.py | grep -E "fabric-manager-monitor"Repository: NVIDIA/NVSentinel
Length of output: 129
🏁 Script executed:
cat -n health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py | head -150Repository: NVIDIA/NVSentinel
Length of output: 6862
🏁 Script executed:
# Let me also check if there are any type hints on the cli function parameters
cat -n health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py | sed -n '77,94p'Repository: NVIDIA/NVSentinel
Length of output: 545
--verbose with type=bool requires an explicit argument value; use is_flag=True instead.
With type=bool, Click expects --verbose true or --verbose false. Running --verbose alone (the natural usage) will fail because Click expects a value. Use is_flag=True to make it a boolean flag.
Proposed fix
-@click.option("--verbose", type=bool, default=False, help="Enable debug logging")
+@click.option("--verbose", is_flag=True, default=False, help="Enable debug logging")Additionally, the cli() function parameters lack type hints, which violates the coding guideline requiring type hints for all functions.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @click.option("--verbose", type=bool, default=False, help="Enable debug logging") | |
| `@click.option`("--verbose", is_flag=True, default=False, help="Enable debug logging") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py` at line
76, The Click option for verbose currently uses type=bool which requires an
explicit value; change the `@click.option` call for "--verbose" to use
is_flag=True (and keep default=False) so the flag can be passed without a value,
and update any help text if needed. Also add missing type hints to the cli()
function signature (e.g., annotate parameters like verbose: bool and any others
present) to comply with the coding guideline; ensure the function body and any
callers accept the annotated types.
...monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py
Show resolved
Hide resolved
| protobuf = ">=6.31.1,<7.0.0" | ||
| googleapis-common-protos = ">=1.56.0" | ||
| structlog = "^25.1.0" | ||
| requests = "^2.31.0" |
There was a problem hiding this comment.
requests = "^2.31.0" allows versions vulnerable to CVE-2024-35195 and CVE-2024-47081.
CVE-2024-35195 (Medium) affects requests prior to 2.32.2 — if the first request in a Session uses verify=False, all subsequent requests to the same host will continue ignoring cert verification. CVE-2024-47081 is also present in earlier releases, affecting netrc credential handling with a maliciously crafted URL. The minimum version 2.31.0 is below both fix points. The lower bound should be raised to ^2.32.3 (the current stable release).
🔒 Proposed fix
-requests = "^2.31.0"
+requests = "^2.32.3"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| requests = "^2.31.0" | |
| requests = "^2.32.3" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@health-monitors/fabric-manager-monitor/pyproject.toml` at line 17, Update the
requests dependency in pyproject.toml to require a fixed minimum that includes
CVE fixes: replace the current spec "requests = \"^2.31.0\"" with a constraint
that forces at least 2.32.3 (e.g., "^2.32.3") so the project pulls versions that
include fixes for CVE-2024-35195 and CVE-2024-47081; ensure you then run your
dependency install/lock command to regenerate the lockfile (poetry lock / poetry
update) so the new safe version is recorded.
| [tool.poetry.scripts] | ||
| fabric_manager_monitor = "fabric_manager_monitor.cli:cli" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if a Helm chart exists for fabric-manager-monitor
echo "=== Searching for fabric-manager-monitor Helm chart ==="
fd "Chart.yaml" distros/kubernetes/nvsentinel/charts/ 2>/dev/null | grep -i fabric || echo "No fabric-manager chart found"
echo ""
echo "=== All Helm charts in distros/kubernetes/nvsentinel/charts/ ==="
fd "Chart.yaml" distros/kubernetes/nvsentinel/charts/ 2>/dev/null
echo ""
echo "=== Health monitors in health-monitors/ ==="
ls -d health-monitors/*/ 2>/dev/null || echo "Directory not found"
echo ""
echo "=== Checking fabric-manager-monitor pyproject.toml ==="
if [ -f "health-monitors/fabric-manager-monitor/pyproject.toml" ]; then
grep -A 2 "\[tool.poetry.scripts\]" health-monitors/fabric-manager-monitor/pyproject.toml
else
echo "File not found"
fiRepository: NVIDIA/NVSentinel
Length of output: 2088
Add Helm chart for fabric-manager-monitor native NVSentinel integration.
The health monitor under health-monitors/fabric-manager-monitor/ exposes a console-scripts entry point (fabric_manager_monitor) but lacks a corresponding Helm chart under distros/kubernetes/nvsentinel/charts/. All other health monitors in the repository include one: csp-health-monitor, gpu-health-monitor, kubernetes-object-monitor, and syslog-health-monitor.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@health-monitors/fabric-manager-monitor/pyproject.toml` around lines 28 - 29,
The project defines a console script entry point "fabric_manager_monitor"
(specified as fabric_manager_monitor.cli:cli) but there is no corresponding Helm
chart for native NVSentinel integration; add a Helm chart named to match the
monitor (e.g., fabric-manager-monitor) mirroring the structure of the other
monitors' charts: include Chart.yaml, values.yaml, templates (Deployment,
Service, ServiceAccount, RBAC, ConfigMap as needed) and any required hooks, and
update the charts index/packaging so the new chart is published alongside the
existing monitor charts for NVSentinel.
…NVIDIA#883) Add a new health monitor that detects Fabric Manager failures, PCIe link degradation, NVLink fabric issues, GPU clock throttling, and CUDA context failures — failure modes invisible to DCGM-based monitoring. Two implementations included: 1. **demos/fabric-manager-monitor/** — Standalone DaemonSet exposing Prometheus metrics and alerts. Validated on 2x P4d.24xlarge (A100-SXM4-40GB) with Amazon Linux 2023, EKS 1.32. 2. **health-monitors/fabric-manager-monitor/** — Native NVSentinel integration using the same patterns as gpu-health-monitor: click CLI, structlog JSON logging, gRPC HealthEvent protobuf to platform-connector via Unix domain socket, state caching, exponential backoff retries. Check categories: - Fabric Manager service status via nsenter + systemctl - Critical GPU services (persistenced, DCGM) via nsenter + systemctl - PCIe link downtraining (Gen5→Gen3, x16→x8) via nsenter + nvidia-smi - NVLink fabric health via DCGM exporter metrics (with FM correlation) - CUDA context validation via subprocess - GPU clock throttling with idle-state false-positive filtering Bug fixes discovered during live testing: - nsenter requires root (non-root USER has empty CapEff) - systemd NRestarts property missing on Amazon Linux 2023 (systemd 252) - GPU Idle (0x0000000000000001) false positive on clock throttle detection Closes NVIDIA#883 Related: NVIDIA#889, NVIDIA#890 Signed-off-by: Anton Alexander <dmvevents@gmail.com>
d54c7ca to
c4c7789
Compare
There was a problem hiding this comment.
Actionable comments posted: 15
🧹 Nitpick comments (35)
health-monitors/fabric-manager-monitor/Dockerfile (2)
53-58: NoWORKDIRset in the runtime stageWithout an explicit
WORKDIR, theCOPY --from=buildinstructions depositfabric_manager_monitor*.whlandconstraints.txtinto/(the filesystem root). While functionally correct (pip resolves./fabric_manager_monitor*.whlrelative to the working directory which defaults to/), it is unconventional and leaves artefacts in the root.♻️ Proposed fix
FROM public.ecr.aws/docker/library/python:3.10-slim-bookworm AS runtime +WORKDIR /app + # util-linux provides nsenter ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/Dockerfile` around lines 53 - 58, The runtime stage lacks a WORKDIR so the COPY --from=build deposits files at /; add an explicit WORKDIR (e.g., WORKDIR /app or /opt/fabric_manager_monitor) in the runtime stage before the COPY and RUN lines so that COPY --from=build /app/dist/*.whl ./ and COPY --from=build /app/constraints.txt ./ place artifacts into that directory and the subsequent pip install ./fabric_manager_monitor*.whl --constraint constraints.txt runs from that working directory; ensure the chosen directory is used consistently for any later file references and leaves the filesystem root clean.
24-25: Upgrade Poetry to 2.3.2 and install the required export pluginPoetry 1.8.2 is over a year outdated; the current stable version is 2.3.2. Upgrading requires addressing two breaking changes in Poetry 2.x:
poetry export(line 39) was moved to a separate plugin and will fail with "command does not exist" unlesspoetry-plugin-exportis installed.- The
[tool.poetry]format used in allpyproject.tomlfiles is backward-compatible with Poetry 2.x, but verify yourpyproject.tomldoes not have any 1.x-specific config that Poetry 2.x no longer supports.♻️ Proposed update for Poetry 2.x
RUN --mount=type=cache,target=/root/.cache/pip \ - pip install poetry==1.8.2 + pip install poetry==2.3.2 poetry-plugin-export🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/Dockerfile` around lines 24 - 25, Update the Dockerfile to install Poetry 2.3.2 instead of 1.8.2 and also install the export plugin so `poetry export` works: change the pip install target from poetry==1.8.2 to poetry==2.3.2 and add installation of the export plugin (poetry-plugin-export) in the same RUN step; then verify the repository's pyproject.toml files are compatible with Poetry 2.x (no deprecated 1.x-only settings) and adjust any 1.x-specific config before relying on `poetry export`.demos/fabric-manager-monitor/k8s/daemonset.yaml (1)
46-51: Consider adding areadinessProbealongside the liveness probe.Without a readiness probe, the Service starts routing traffic (and Prometheus starts scraping) before the monitor completes its initial health check pass and populates metrics. A simple HTTP readiness probe on
/metrics(or a dedicated/readyendpoint) would prevent scraping stale or empty metric series during startup.✨ Proposed addition
livenessProbe: httpGet: path: / port: metrics initialDelaySeconds: 10 periodSeconds: 30 + readinessProbe: + httpGet: + path: /metrics + port: metrics + initialDelaySeconds: 15 + periodSeconds: 30🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/k8s/daemonset.yaml` around lines 46 - 51, Add a readinessProbe next to the existing livenessProbe in the DaemonSet spec so the Service/Prometheus don't scrape before the monitor is ready; specifically, under the same container that defines livenessProbe (look for livenessProbe and httpGet path: / port: metrics) add a readinessProbe that uses httpGet to hit /metrics (or /ready if you have a dedicated endpoint) with conservative initialDelaySeconds and periodSeconds values similar to the livenessProbe so the pod only becomes ready after metrics are populated.demos/fabric-manager-monitor/checks/cuda_validation.py (1)
91-92: Moveimport jsonto the top-level import block.Per PEP 8, all imports should be at the top of the file.
♻️ Proposed fix
+import json import logging import subprocess import sys import textwrap- import json data = json.loads(result.stdout.strip())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/checks/cuda_validation.py` around lines 91 - 92, Move the local import json inside the CUDA validation check to the module-level imports: remove the inline "import json" and add "import json" at the top import block of the file so json.loads(result.stdout.strip()) in the CUDA validation code (where result and data are used) uses the top-level import per PEP8.demos/fabric-manager-monitor/tests/test_monitor.py (2)
101-101: Rename unused mock arguments to_prefix to satisfy Ruff ARG002.
MockPCIe,MockClk, andMockNVLare needed as parameters (the@patchdecorators inject them) but are never referenced in the test body. Prefix with_to make the intentional non-use explicit.♻️ Proposed fix (both affected test methods)
- def test_check_cycle_all_healthy(self, MockSvc, MockPCIe, MockClk, MockNVL): + def test_check_cycle_all_healthy(self, MockSvc, _MockPCIe, _MockClk, _MockNVL):- def test_check_cycle_fabric_manager_down(self, MockSvc, MockPCIe, MockClk, MockNVL): + def test_check_cycle_fabric_manager_down(self, MockSvc, _MockPCIe, _MockClk, _MockNVL):Also applies to: 126-126
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/tests/test_monitor.py` at line 101, Rename the unused patched parameters in the test functions (e.g., test_check_cycle_all_healthy and the other affected test method) by prefixing them with an underscore so Ruff ARG002 is satisfied; specifically change MockPCIe, MockClk, and MockNVL to _MockPCIe, _MockClk, and _MockNVL (or _pcie, _clk, _nvl) in the test function signatures so the `@patch-injected` mocks remain but are clearly marked as intentionally unused.
81-95:_make_monitoris missing type annotations (coding guideline).♻️ Proposed fix
+from typing import Any ... - def _make_monitor(self, **overrides): + def _make_monitor(self, **overrides: Any) -> FabricManagerMonitor:As per coding guidelines: "Type hints required for all functions in Python code."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/tests/test_monitor.py` around lines 81 - 95, The helper function _make_monitor lacks type annotations; add annotations for the parameter and return type (e.g., def _make_monitor(self, **overrides: Any) -> FabricManagerMonitor) and type the locals where appropriate (declare defaults: Dict[str, Any] or annotate the config variable as MonitorConfig) and ensure necessary typing imports (from typing import Any, Dict) are added so _make_monitor, MonitorConfig usage and the returned FabricManagerMonitor are fully type hinted.demos/fabric-manager-monitor/checks/service_check.py (3)
142-153: Silentexcept Exception: passin_get_restart_countswallows unexpected errors.If
result.stdoutis malformed andint(val)raisesValueError, the failure is invisible. This makes it impossible to distinguish "NRestarts not supported" from a parse error. Log the exception atDEBUGlevel.♻️ Proposed fix
except Exception: - pass + logger.debug("Failed to parse NRestarts for %s", service_name, exc_info=True) return 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/checks/service_check.py` around lines 142 - 153, The try/except in _get_restart_count silently swallows all errors; modify the except block in the _get_restart_count method to catch Exception as e and log the exception at DEBUG (e.g., using self._logger.debug or the existing logger) including contextual info (service_name and result.stdout if available), then return 0 as before; ensure you still only parse when result.returncode == 0 and result.stdout is non-empty and do not change the return behavior.
76-84: Prefer list unpacking over concatenation (Ruff RUF005).♻️ Proposed fix
- full_cmd = ["nsenter", "-t", "1", "-m", "--"] + cmd + full_cmd = ["nsenter", "-t", "1", "-m", "--", *cmd]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/checks/service_check.py` around lines 76 - 84, Replace the list concatenation in _run_host_cmd with list unpacking to satisfy Ruff RUF005: instead of building full_cmd via ["nsenter", "-t", "1", "-m", "--"] + cmd, construct the argument list using unpacking (e.g., ["nsenter", "-t", "1", "-m", "--", *cmd]) or pass the literal prefix plus *cmd directly into subprocess.run so the command list is created with unpacking; update the symbol _run_host_cmd accordingly to use the unpacked list when calling subprocess.run.
218-222:except (subprocess.TimeoutExpired, Exception)—TimeoutExpiredis redundant.
subprocess.TimeoutExpiredis a subclass ofException, so the first element of the tuple is always shadowed by the second. Simplify toexcept Exception.♻️ Proposed fix
- except (subprocess.TimeoutExpired, Exception) as e: + except Exception as e: logger.warning("Journal parsing failed for %s: %s", service_name, e) return []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/checks/service_check.py` around lines 218 - 222, The except clause in the journal parsing block uses a redundant exception tuple; change "except (subprocess.TimeoutExpired, Exception) as e" to a single "except Exception as e" in the function/method containing the journal parse (the block that logs "Journal parsing failed for %s: %s" and returns []), keeping the logger.warning("Journal parsing failed for %s: %s", service_name, e) and the return [] intact.health-monitors/fabric-manager-monitor/fabric_manager_monitor/logger.py (1)
56-63: Structlog processor argumentsloggerandmethod_nameare required by the protocol but unused — prefix with_.Ruff ARG001 is correct here. The structlog processor protocol mandates the three-argument signature, but the arguments aren't used inside
inject_module_version. Using a_prefix suppresses the linter warning and clearly signals intentional non-use.♻️ Proposed fix
def inject_module_version( - logger: logging.Logger | None, - method_name: str, + _logger: logging.Logger | None, + _method_name: str, event_dict: dict[str, Any], ) -> dict[str, Any]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/logger.py` around lines 56 - 63, The structlog processor function inject_module_version currently declares parameters logger and method_name but doesn't use them, triggering Ruff ARG001; rename those parameters to _logger and _method_name in the inject_module_version signature to mark them as intentionally unused while leaving event_dict unchanged and still setting event_dict["module"] = module and event_dict["version"] = version before returning it.health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/types.py (1)
34-40: Consider default factories for required list fields.
error_codesandentities_impactedmust be supplied explicitly (even as[]) by every caller. Giving themfield(default_factory=list)would reduce boilerplate at call sites while keeping the type contract intact.♻️ Proposed refactor
+import dataclasses from typing import Dict, List, Optional `@dataclasses.dataclass` class CheckResult: ... - error_codes: List[str] + error_codes: List[str] = dataclasses.field(default_factory=list) message: str - entities_impacted: List[Dict[str, str]] + entities_impacted: List[Dict[str, str]] = dataclasses.field(default_factory=list) metadata: Optional[Dict[str, str]] = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/types.py` around lines 34 - 40, The fields error_codes and entities_impacted are currently required to be passed by every caller; change them to use dataclasses.field(default_factory=list) so callers can omit them and get empty lists by default, and ensure dataclasses.field is imported where the dataclass (the one that declares error_codes and entities_impacted) is defined; keep the type hints as List[str] and List[Dict[str,str]] respectively to preserve the contract.demos/fabric-manager-monitor/README.md (1)
30-34: Helm chart for production integration is not mentioned.The demo uses raw
k8s/manifests. Per established repo conventions, the production counterpart underhealth-monitors/fabric-manager-monitor/should also include a Helm chart underdistros/kubernetes/nvsentinel/charts/. Consider adding a note (or a TODO) here pointing readers toward the Helm chart for the integrated version.Based on learnings: "Health monitors should include Helm chart in
distros/kubernetes/nvsentinel/charts/."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/README.md` around lines 30 - 34, The README's "Deploy (assumes nvsentinel namespace exists)" section currently lists raw manifests (rbac.yaml, configmap.yaml, daemonset.yaml, servicemonitor.yaml); add a brief note or TODO immediately after that block pointing readers to the production Helm chart for the integrated version (distros/kubernetes/nvsentinel/charts/) and, if present, the specific chart for this monitor (e.g., fabric-manager-monitor) so users know where to find the Helm-based deployment alternative for production integration.demos/fabric-manager-monitor/tests/test_service_check.py (4)
156-159: Unused loop variablename— rename to_name.Per Ruff B007, the loop variable
nameon line 158 is unused. Rename to_nameor use_to indicate it's intentionally ignored.- for name, status in results.items(): + for _name, status in results.items():🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/tests/test_service_check.py` around lines 156 - 159, The loop in the test `test_service_check.py` iterates over `results.items()` but the `name` variable is never used; rename the unused loop variable `name` to `_name` (or `_`) in the `for name, status in results.items()` line so it conforms to Ruff B007 and signals the variable is intentionally ignored while leaving `status` unchanged.
8-12:sys.pathmanipulation is fragile — prefer a proper package setup.Inserting parent directories into
sys.pathat lines 11-12 is brittle and can cause import shadowing. Since the coding guidelines require Poetry for dependency management, consider adding a[tool.pytest.ini_options]pythonpathsetting inpyproject.tomlor installing the package in editable mode (poetry install) so imports resolve naturally.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/tests/test_service_check.py` around lines 8 - 12, The test currently manipulates sys.path via sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..")) which is brittle; remove that sys.path manipulation from tests and instead ensure imports resolve via proper packaging: either install the project in editable mode with Poetry (run poetry install) or add a pytest pythonpath configuration under [tool.pytest.ini_options] in pyproject.toml so tests can import the package normally (remove the sys.path.insert call and rely on the package name used in imports).
104-116:time.sleep(1.1)makes this test slow and potentially flaky.Real sleeps in unit tests add latency and can cause intermittent failures under CI load. Consider mocking
time.monotonicto simulate the passage of time deterministically.♻️ Proposed refactor
+ `@patch`("checks.service_check.time") - def test_flap_detection_window_expiry(self): + def test_flap_detection_window_expiry(self, mock_time): + current_time = 1000.0 + mock_time.monotonic.return_value = current_time checker = ServiceChecker(flap_window=1, flap_threshold=3) checker._restart_history["test"] = deque() checker._last_restart_count["test"] = 0 # Add restarts checker._update_flap_tracking("test", 3) assert checker.is_flapping("test") - # Wait for window to expire - time.sleep(1.1) + # Simulate window expiry + mock_time.monotonic.return_value = current_time + 1.1 checker._update_flap_tracking("test", 3) # same count, triggers prune assert not checker.is_flapping("test")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/tests/test_service_check.py` around lines 104 - 116, The test test_flap_detection_window_expiry uses time.sleep and should instead simulate time progression by mocking time.monotonic so it runs fast and deterministically; update the test to patch time.monotonic (via unittest.mock.patch or the test framework's monkeypatch) to return an initial timestamp for the first _update_flap_tracking("test", 3) call and then a timestamp > flap_window later before the second _update_flap_tracking call so ServiceChecker._update_flap_tracking and is_flapping behave the same without real sleep; ensure you reference ServiceChecker, _update_flap_tracking, and is_flapping when adding the patch/unpatch around the two update calls.
118-141: Unusedtimeoutparameter in the side_effect function.The
timeoutkeyword argument on line 120 is never used. This was flagged by Ruff (ARG001). Prefix it with an underscore to signal intent.- def side_effect(cmd, timeout=10): + def side_effect(cmd, timeout=10): # noqa: ARG001Or simply:
- def side_effect(cmd, timeout=10): + def side_effect(cmd, **_kwargs):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/tests/test_service_check.py` around lines 118 - 141, In test_check_fabric_manager_with_journal_errors the side_effect function declares a timeout parameter that is never used (ARG001); update the side_effect signature to accept _timeout (or prefix timeout with an underscore) so the linter knows it is intentionally unused, leaving the logic inside side_effect and the mock_run.side_effect assignment unchanged; this touches the side_effect used in the test_check_fabric_manager_with_journal_errors test for ServiceChecker.check_fabric_manager.demos/fabric-manager-monitor/checks/clock_check.py (2)
36-41: Usefrozensetfor the class-level_BENIGN_REASONSconstant.Ruff RUF012 flags this as a mutable class attribute. Since this set is never mutated, declare it as a
frozensetto express immutability.- _BENIGN_REASONS = { + _BENIGN_REASONS: frozenset = frozenset({ "Not Active", "0x0000000000000000", # No throttle "0x0000000000000001", # GPU Idle — normal when no workload running - } + })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/checks/clock_check.py` around lines 36 - 41, Replace the mutable class-level set _BENIGN_REASONS with an immutable frozenset to satisfy the linter (RUF012); locate the _BENIGN_REASONS symbol in clock_check.py and change its declaration so it uses frozenset(...) (keeping the same strings) to express immutability and avoid mutable class attribute warnings.
155-157: Silent exception swallowing in_query_throttle_reasonshides diagnostic context.The bare
except Exception: return []discards all error information. If nvidia-smi is intermittently failing, this makes it invisible. Add a log statement so operators can diagnose throttle-reason query failures.♻️ Proposed fix
- except Exception: + except Exception: + logger.warning("Failed to query throttle reasons", exc_info=True) return []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/checks/clock_check.py` around lines 155 - 157, In _query_throttle_reasons, avoid swallowing exceptions silently; inside the except Exception block in function _query_throttle_reasons, log the caught exception (including exception message and stacktrace) via the module's logger (or processLogger) before returning [] so intermittent nvidia-smi failures are visible to operators—i.e., replace the bare except with logging the error/context and then return [].health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py (1)
95-95:exitshadows the Python builtin.Using
exitas a variable name shadows the builtinexit()function. Consider renaming toexit_eventorshutdown_eventfor clarity and to avoid accidental misuse.- exit = Event() + exit_event = Event()(Update all references on lines 132, 156 accordingly.)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py` at line 95, The variable named exit (Event()) shadows the Python builtin exit(); rename it (e.g., exit_event or shutdown_event) and update all references to the new name (including where it’s waited on or set in the CLI logic and handlers referenced in the same module such as the code blocks that currently use exit.wait() or exit.set()); ensure imports remain the same and run tests to confirm no remaining references to the old name.health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/cuda_validation.py (1)
34-70: Inline CUDA test script usesassertwhich is stripped underpython -O.If the monitor container is ever invoked with optimization flags (
python -O), theasserton line 62 of the embedded script would be silently removed, making the verification no-op. Replace with an explicitif/raiseto guarantee the check always runs.♻️ Proposed fix (inside the _CUDA_TEST_SCRIPT string)
# Allocate, write, read back, verify t = torch.randn(1024, device="cuda") - assert t.sum().isfinite(), f"GPU {i}: non-finite sum" + if not t.sum().isfinite(): + raise RuntimeError(f"GPU {i}: non-finite sum") del t🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/cuda_validation.py` around lines 34 - 70, The embedded CUDA test (_CUDA_TEST_SCRIPT) currently uses an assert for the CUDA tensor check (assert t.sum().isfinite(), f"GPU {i}: non-finite sum") which is removed under python -O; update the script to replace that assert with an explicit runtime check such as: compute s = t.sum() and if not s.isfinite(): raise RuntimeError(f"GPU {i}: non-finite sum") so the verification always runs regardless of optimization flags and the exception is caught and reported into results["errors"] as the surrounding code expects.health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/pcie_check.py (1)
47-81: Considerlog.exceptionin exception handlers to capture tracebacks.In the
exceptblocks (lines 73–81),log.errordiscards the traceback. Usinglog.exception(orlog.error(..., exc_info=True)) would preserve diagnostic context, especially for the genericExceptionhandler on line 79.♻️ Proposed fix
except subprocess.TimeoutExpired: - log.error("nvidia-smi PCIe query timed out") + log.exception("nvidia-smi PCIe query timed out") return [] except FileNotFoundError: - log.error("nvidia-smi not found") + log.exception("nvidia-smi not found") return [] except Exception as e: - log.error(f"PCIe check failed: {e}") + log.exception(f"PCIe check failed: {e}") return []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/pcie_check.py` around lines 47 - 81, In the check method of pcie_check.py (function check in the PCIe checker), replace the log.error calls inside the exception handlers for subprocess.TimeoutExpired, FileNotFoundError, and the generic Exception handler so they record the traceback (use log.exception(...) or log.error(..., exc_info=True)) while keeping the existing messages ("nvidia-smi PCIe query timed out", "nvidia-smi not found", "PCIe check failed: ..."); this ensures the check() method logs full diagnostic context for TimeoutExpired, FileNotFoundError, and the Exception handler.demos/fabric-manager-monitor/checks/fabric_check.py (1)
41-49: Silently assuming healthy on DCGM fetch failure may mask real outages.When
_fetch_metrics()throws (e.g., DCGM exporter is down),check()returnshealthy=Truewith an error string. This means a prolonged DCGM exporter outage will never surface an unhealthy signal. Consider logging atwarninglevel here so operators have visibility, and potentially exposing the fetch-failure state through a dedicated metric or check result so the monitoring gap doesn't go unnoticed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/checks/fabric_check.py` around lines 41 - 49, The current check() swallows _fetch_metrics() exceptions and returns NVLinkStatus(healthy=True) which hides outages; change this to log the exception at warning (use the instance logger, e.g., self.logger.warning(...), and include the exception string) and return NVLinkStatus(healthy=False, error=f"Failed to fetch DCGM metrics: {e}"); additionally, if a telemetry/metrics client exists on the class, increment or emit a distinct fetch-failure metric (e.g., "nvlink_fetch_failures") so monitoring can alert on DCGM exporter outages (references: check(), _fetch_metrics(), NVLinkStatus).demos/fabric-manager-monitor/checks/pcie_check.py (1)
29-91: Extract PCIe checking logic into a shared library to reduce duplication.The
PCIeStatusdataclass and core checking methods (check()and_parse_output()) are duplicated between the demo and production paths. The demo version lacks the production-specificto_check_results()integration method, but the core logic is identical. Consider creating a shared utility module that both implementations can import, allowing the production version to add its integration layer on top.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/checks/pcie_check.py` around lines 29 - 91, The PCIe checking logic (PCIeStatus dataclass and PCIeChecker.check()/_parse_output()) is duplicated between demo and production; extract these into a shared utility module (e.g., pcie/utils) that exports PCIeStatus and PCIeChecker (or a parsing function) so both demos/fabric-manager-monitor and production code import the common implementation; remove the duplicated definitions from demos/fabric-manager-monitor/checks/pcie_check.py and update the production-specific implementation to keep its integration method (to_check_results()) by importing the shared PCIeStatus and either subclassing or wrapping PCIeChecker to add to_check_results().health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/watcher.py (2)
128-135: Exception handler loses traceback — uselog.exceptioninstead oflog.error.When a check raises an unexpected exception, the current
log.errorwith an f-string loses the full traceback, making debugging much harder.log.exceptionautomatically includes the traceback.Proposed fix
except Exception as e: - log.error(f"Check '{name}' failed with exception: {e}") + log.exception(f"Check '{name}' failed with exception") metrics.check_errors.labels(name).inc()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/watcher.py` around lines 128 - 135, The exception handler inside the loop over self._checkers is using log.error and loses the traceback; change the except block in watcher.py where it catches Exception as e (around the loop that calls check_func() and extends results) to call log.exception with a descriptive message (e.g., "Check '{name}' failed") and still increment metrics.check_errors.labels(name).inc(), so the full traceback is captured while preserving the existing metrics and control flow.
66-66: UnboundedThreadPoolExecutormay accumulate blocked threads.
ThreadPoolExecutor()withoutmax_workersdefaults tomin(32, os.cpu_count() + 4)in Python 3.8+, but each callback thread can block for minutes due to the exponential backoff insend_health_event_with_retries. Over multiple check cycles with a failing platform-connector, threads accumulate until the pool is exhausted, stalling all callbacks.Consider setting an explicit
max_workers(e.g.,max_workers=len(callbacks)or a small fixed number).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/watcher.py` at line 66, The ThreadPoolExecutor created in watcher.py as self._callback_thread_pool = ThreadPoolExecutor() can grow and exhaust threads because send_health_event_with_retries blocks for minutes; set an explicit max_workers to limit concurrency (for example max_workers=len(self._callbacks) or a small fixed number) when constructing the ThreadPoolExecutor to prevent accumulation, and ensure the pool is initialized using that value where self._callback_thread_pool is created; reference the ThreadPoolExecutor instantiation and the send_health_event_with_retries callback behavior when making the change.health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/clock_check.py (2)
150-186:_query_throttle_reasonsreturns untypedList[dict].Per coding guidelines, type hints are required and dataclasses should be used for structured data. This method returns
List[dict]with implicit keys"gpu_index"(int) and"reasons"(str). Consider using aTypedDictor a small dataclass to make the contract explicit and catch key mismatches at development time.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/clock_check.py` around lines 150 - 186, The _query_throttle_reasons function currently returns List[dict]; define an explicit shape (prefer a TypedDict named ThrottleReason with fields gpu_index: int and reasons: str, or alternatively a small dataclass) and update the function signature to return List[ThrottleReason]; then construct and append instances of that TypedDict/dataclass instead of raw dicts, add the necessary import from typing (TypedDict, List) or dataclasses, and keep the same parsing/exception behavior.
56-60: Considerfrozensetfor the class-level constant.
_BENIGN_REASONSis a mutablesetdefined as a class attribute. Since it's intended as a constant lookup table,frozensetcommunicates immutability and avoids accidental mutation.- _BENIGN_REASONS = { + _BENIGN_REASONS = frozenset({ "Not Active", "0x0000000000000000", # No throttle "0x0000000000000001", # GPU Idle -- normal when no workload running - } + })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/clock_check.py` around lines 56 - 60, _BENIGN_REASONS is defined as a mutable set but used as a constant lookup table; change its definition to use frozenset to make it immutable and signal intent. Locate the class/module-level symbol _BENIGN_REASONS in clock_check.py and replace the set literal with a frozenset literal (keeping the same string members) so any accidental mutations are prevented and the constant is clearly immutable.demos/fabric-manager-monitor/monitor.py (2)
91-93: Missing return type annotation on_handle_signal.As per coding guidelines, type hints are required for all functions.
Proposed fix
- def _handle_signal(self, signum, frame): + def _handle_signal(self, signum: int, frame) -> None:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/monitor.py` around lines 91 - 93, The method _handle_signal lacks type annotations; add parameter and return type hints: annotate signum as int and frame as Optional[FrameType] and the return as None (e.g. def _handle_signal(self, signum: int, frame: Optional[FrameType]) -> None:). Also add the necessary imports (from types import FrameType and from typing import Optional) at the top of the module so the annotations resolve.
114-117: Dead code: restart counter block is a no-op.This entire block does nothing — it checks
fm_status.n_restarts > 0, then executespass. The importedfabric_manager_restarts_totalmetric is never updated anywhere in this file. Either wire up the metric or remove this dead code to avoid confusion.Proposed fix — remove dead block
- # Update restart counter (set to current total — Counter only goes up) - if fm_status.n_restarts > 0: - # We use _total suffix via Counter; increment by delta - pass # Counter tracks via flap detection instead -🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/monitor.py` around lines 114 - 117, The restart-counter block in monitor.py is dead: it checks fm_status.n_restarts > 0 then does nothing, and the imported metric fabric_manager_restarts_total is never updated; remove the no-op if-block (the three lines checking fm_status.n_restarts and the pass) and either delete the unused import/registration of fabric_manager_restarts_total or, if you prefer to track restarts, replace the block with code that sets/increments fabric_manager_restarts_total (using fm_status.n_restarts or delta) so the metric is actually updated; reference symbols: fm_status.n_restarts and fabric_manager_restarts_total.health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py (1)
146-152: gRPC channel is created and torn down on every retry attempt.The
with grpc.insecure_channel(...)block inside the retry loop means a new UDS connection + HTTP/2 handshake per attempt. Move channel creation outside the loop so retries reuse the same connection, reducing latency and resource churn.Proposed fix
delay = INITIAL_DELAY - for _ in range(MAX_RETRIES): - with grpc.insecure_channel(f"unix://{self._socket_path}") as chan: - stub = platformconnector_pb2_grpc.PlatformConnectorStub(chan) - try: - stub.HealthEventOccurredV1(platformconnector_pb2.HealthEvents(events=health_events, version=1)) - metrics.events_sent_success.inc() - return True - except grpc.RpcError as e: - log.error(f"Failed to send health event to UDS: {e}") - sleep(delay) - delay *= 1.5 - continue + with grpc.insecure_channel(f"unix://{self._socket_path}") as chan: + stub = platformconnector_pb2_grpc.PlatformConnectorStub(chan) + for _ in range(MAX_RETRIES): + try: + stub.HealthEventOccurredV1(platformconnector_pb2.HealthEvents(events=health_events, version=1)) + metrics.events_sent_success.inc() + return True + except grpc.RpcError as e: + log.error(f"Failed to send health event to UDS: {e}") + sleep(delay) + delay = min(delay * 1.5, MAX_DELAY) + continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py` around lines 146 - 152, The retry loop currently creates a new gRPC channel each iteration (using grpc.insecure_channel(f"unix://{self._socket_path}")), causing repeated UDS/TLS handshakes; move channel creation outside the for _ in range(MAX_RETRIES) loop so a single channel is reused for all retries: open the channel once (using grpc.insecure_channel with self._socket_path), construct PlatformConnectorStub(chan) once, then keep the try/except and retry logic that calls stub.HealthEventOccurredV1(platformconnector_pb2.HealthEvents(...)) inside the loop; on success increment metrics.events_sent_success and return True as before, and ensure the channel is properly closed (or used in a single with block) after the retry loop completes/fails.health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py (5)
100-108: Prefer list unpacking over concatenation (RUF005); document trust assumption forservice_nameRuff RUF005 flags the
+concatenation; use spread syntax instead. More importantly,service_nameis caller-supplied (including viacheck_all_gpu_services(..., service_names=[...])) and flows directly intonsenter/systemctlarguments. Whileshell=Falseprevents shell injection, malformed or adversarial service names can still produce unexpectedsystemctlbehavior (e.g., names containing../or whitespace). Document the implicit trust assumption or add a basic allowlist/regex guard in callers that accept external input.♻️ Proposed fix (RUF005)
- full_cmd = ["nsenter", "-t", "1", "-m", "--"] + cmd + full_cmd = ["nsenter", "-t", "1", "-m", "--", *cmd]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py` around lines 100 - 108, The _run_host_cmd function currently concatenates lists with ["nsenter", "-t", "1", "-m", "--"] + cmd and callers pass service names into commands (e.g., check_all_gpu_services -> service_name), which can lead to unexpected systemctl behavior; change the call site to build the final argv using unpacking (["nsenter", "-t", "1", "-m", "--", *cmd] or equivalent) instead of +, and either add a simple allowlist/regex validation for service_name in the callers that accept external input (e.g., check_all_gpu_services) or document the implicit trust assumption for service_name in the function/method docstrings so callers know they must sanitize inputs before passing them to _run_host_cmd.
87-90: Mutable class attributeDEFAULT_GPU_SERVICESshould be annotated asClassVarWithout
ClassVar, type checkers treat this as an instance field, and it can be inadvertently mutated at the class level (affecting all instances). Ruff RUF012 flags this correctly.♻️ Proposed fix
+from typing import ClassVar, Dict, List, Optional class ServiceChecker: """Checks host systemd services via nsenter.""" # Additional GPU-related services to check alongside Fabric Manager - DEFAULT_GPU_SERVICES = [ + DEFAULT_GPU_SERVICES: ClassVar[List[str]] = [ "nvidia-persistenced", "nv-hostengine", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py` around lines 87 - 90, DEFAULT_GPU_SERVICES is a mutable class attribute and should be annotated as a ClassVar so type checkers don't treat it as an instance field; import ClassVar from typing and change the declaration of DEFAULT_GPU_SERVICES to include the ClassVar annotation (e.g. DEFAULT_GPU_SERVICES: ClassVar[list[str]] = [...]) so the linter (RUF012) is satisfied and instances won't be considered to own/mutate this list; update the declaration in the module where DEFAULT_GPU_SERVICES is defined and ensure the typing import is added if missing.
210-227: Fragile manual field unpacking incheck_fabric_manager— usedataclasses.fieldsfor resilienceIf
ServiceStatusgains new fields in the future,check_fabric_managerwill silently omit them without a compile-time error.♻️ Proposed fix
+import dataclasses def check_fabric_manager(self) -> FabricManagerStatus: """Check Fabric Manager with journal error analysis.""" base = self.check_service("nvidia-fabricmanager") journal_errors = self._parse_journal_errors("nvidia-fabricmanager") flapping = self.is_flapping("nvidia-fabricmanager") - return FabricManagerStatus( - name=base.name, - active=base.active, - sub_state=base.sub_state, - main_pid=base.main_pid, - n_restarts=base.n_restarts, - start_timestamp=base.start_timestamp, - error=base.error, - journal_errors=journal_errors, - flapping=flapping, - ) + base_fields = {f.name: getattr(base, f.name) for f in dataclasses.fields(base)} + return FabricManagerStatus(**base_fields, journal_errors=journal_errors, flapping=flapping)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py` around lines 210 - 227, check_fabric_manager manually unpacks fields from the ServiceStatus returned by check_service, which will silently omit new fields; instead build the FabricManagerStatus from a dynamic copy of the base dataclass (use dataclasses.fields or dataclasses.asdict on the ServiceStatus instance returned by check_service inside check_fabric_manager) then merge/override the additional keys journal_errors (from _parse_journal_errors) and flapping (from is_flapping) and construct FabricManagerStatus(**merged_dict); this ensures any new ServiceStatus fields are preserved without manual updates.
254-258: Redundant exception type inexceptclause;TimeoutExpiredis already a subclass ofException
subprocess.TimeoutExpiredis a subclass ofSubprocessError → OSError → Exception, so the explicit listing is redundant and the clause is functionallyexcept Exception. Consider narrowing to only expected exceptions, or at minimum remove the redundant type. Ruff BLE001 is valid here.♻️ Proposed fix
- except (subprocess.TimeoutExpired, Exception) as e: - log.warning(f"Journal parsing failed for {service_name}: {e}") + except subprocess.TimeoutExpired: + log.warning("Journal parsing timed out", service=service_name) + except OSError as e: + log.warning("Journal parsing failed", service=service_name, error=str(e)) return []Also consider moving
return found(line 254) to anelseblock of thetryto satisfy Ruff TRY300 and make the non-exceptional path explicit.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py` around lines 254 - 258, The except tuple currently uses (subprocess.TimeoutExpired, Exception) which is redundant because TimeoutExpired is a subclass of Exception; change the handler to either narrowly catch the specific expected exceptions (e.g., subprocess.SubprocessError or the concrete errors your journal parsing can raise) or collapse to a single except Exception as e and remove TimeoutExpired from the tuple, and move the existing return found out of the try and into an else block so the non-exception path is explicit (references: the except clause handling journal parsing errors for service_name and the variable found in service_check.py).
168-182: Silentexcept … passin_get_restart_countswallows diagnostic failures (S110)If
nsenterfails for a non-benign reason (e.g., missing root privilege, kernel capability gap), the error is silently discarded and0is returned, making flap detection silently inoperable. The benign case (NRestarts unsupported on older systemd) and the fatal case (infrastructure failure) produce the same result.♻️ Proposed fix
- except Exception: - pass + except ValueError: + pass # malformed NRestarts value; treat as 0 + except Exception as e: + log.warning("Failed to query NRestarts", service=service_name, error=str(e)) return 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py` around lines 168 - 182, The helper _get_restart_count currently swallows all exceptions, masking real failures; update it to distinguish benign "property unsupported" from command/runtime failures by: after calling self._run_host_cmd(service_name) check result.returncode and if non-zero log the failure with service_name and result.stderr (using self.logger.error or process logger) and re-raise or return an error indicator instead of silently returning 0; keep returning 0 only when the command succeeded but stdout is empty or the NRestarts property is absent; also change the broad except Exception: pass to either catch specific expected exceptions or log the exception (with exception info) and re-raise so infrastructure errors aren’t hidden. Ensure references to _get_restart_count, _run_host_cmd and service_name are used in your changes.
| apiVersion: apps/v1 | ||
| kind: DaemonSet | ||
| metadata: | ||
| name: fabric-manager-monitor | ||
| namespace: nvsentinel | ||
| labels: | ||
| app.kubernetes.io/name: fabric-manager-monitor | ||
| app.kubernetes.io/instance: nvsentinel | ||
| app.kubernetes.io/component: metrics | ||
| spec: | ||
| selector: | ||
| matchLabels: | ||
| app.kubernetes.io/name: fabric-manager-monitor | ||
| app.kubernetes.io/instance: nvsentinel | ||
| template: | ||
| metadata: | ||
| labels: | ||
| app.kubernetes.io/name: fabric-manager-monitor | ||
| app.kubernetes.io/instance: nvsentinel | ||
| app.kubernetes.io/component: metrics | ||
| spec: | ||
| serviceAccountName: fabric-manager-monitor | ||
| hostPID: true # Required for nsenter to check host systemd services and nvidia-smi | ||
| nodeSelector: | ||
| # Matches any node with NVIDIA GPUs (set by GPU Feature Discovery) | ||
| feature.node.kubernetes.io/pci-10de.present: "true" | ||
| tolerations: | ||
| - key: nvidia.com/gpu | ||
| operator: Exists | ||
| effect: NoSchedule | ||
| containers: | ||
| - name: monitor | ||
| image: fabric-manager-monitor:latest | ||
| envFrom: | ||
| - configMapRef: | ||
| name: fabric-manager-monitor | ||
| env: | ||
| - name: NODE_NAME | ||
| valueFrom: | ||
| fieldRef: | ||
| fieldPath: spec.nodeName | ||
| ports: | ||
| - name: metrics | ||
| containerPort: 9101 | ||
| protocol: TCP | ||
| livenessProbe: | ||
| httpGet: | ||
| path: / | ||
| port: metrics | ||
| initialDelaySeconds: 10 | ||
| periodSeconds: 30 | ||
| resources: | ||
| requests: | ||
| cpu: 50m | ||
| memory: 64Mi | ||
| limits: | ||
| cpu: 100m | ||
| memory: 128Mi | ||
| securityContext: | ||
| privileged: true # Required for nsenter into host mount namespace | ||
| volumeMounts: | ||
| - name: host-dbus | ||
| mountPath: /var/run/dbus | ||
| readOnly: true | ||
| volumes: | ||
| - name: host-dbus | ||
| hostPath: | ||
| path: /var/run/dbus | ||
| type: Directory | ||
|
|
||
| --- | ||
| apiVersion: v1 | ||
| kind: Service | ||
| metadata: | ||
| name: fabric-manager-monitor | ||
| namespace: nvsentinel | ||
| labels: | ||
| app.kubernetes.io/name: fabric-manager-monitor | ||
| app.kubernetes.io/instance: nvsentinel | ||
| app.kubernetes.io/component: metrics | ||
| spec: | ||
| selector: | ||
| app.kubernetes.io/name: fabric-manager-monitor | ||
| app.kubernetes.io/instance: nvsentinel | ||
| ports: | ||
| - name: metrics | ||
| port: 9101 | ||
| targetPort: 9101 | ||
| protocol: TCP | ||
| clusterIP: None # Headless — each pod scraped individually |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
DaemonSet variant selection logic is undocumented — coding guideline violation.
The coding guidelines require **/daemonset*.yaml files to explain DaemonSet variant selection logic (e.g., kata vs. regular nodes) in Helm chart documentation. This manifest has a single undifferentiated DaemonSet with no mention of kata node handling. For production Helm chart delivery the monitor will also need a corresponding chart under distros/kubernetes/nvsentinel/charts/fabric-manager-monitor/ (per project conventions) that documents variant selection logic.
As per coding guidelines: "Explain DaemonSet variant selection logic in Helm chart documentation" (**/daemonset*.yaml). Based on learnings: "Health monitors should include Helm chart in distros/kubernetes/nvsentinel/charts/" and "Separate DaemonSets should be created for kata vs regular nodes using nodeAffinity based on kata.enabled label".
🧰 Tools
🪛 Checkov (3.2.334)
[medium] 1-71: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 1-71: Container should not be privileged
(CKV_K8S_16)
[medium] 1-71: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 1-71: Containers should not share the host process ID namespace
(CKV_K8S_17)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@demos/fabric-manager-monitor/k8s/daemonset.yaml` around lines 1 - 90, The
DaemonSet named fabric-manager-monitor currently lacks documented variant
selection and does not handle kata vs regular nodes; update the deployment by
adding separate DaemonSet manifests (or templated Helm chart resources) for each
variant and document the selection logic in the chart: keep the existing
fabric-manager-monitor DaemonSet for regular GPU nodes (current
feature.node.kubernetes.io/pci-10de.present selector), add a second DaemonSet
that targets kata nodes via nodeAffinity matching the kata.enabled label (and
inverse affinity for the regular one if using both), ensure both manifests
expose the same ports/env/volumes, and add clear Helm chart documentation
describing when each DaemonSet is deployed and how to enable/disable the kata
variant (include references to the NODE_NAME env and service
fabric-manager-monitor for scraping).
| serviceAccountName: fabric-manager-monitor | ||
| hostPID: true # Required for nsenter to check host systemd services and nvidia-smi | ||
| nodeSelector: | ||
| # Matches any node with NVIDIA GPUs (set by GPU Feature Discovery) | ||
| feature.node.kubernetes.io/pci-10de.present: "true" | ||
| tolerations: | ||
| - key: nvidia.com/gpu | ||
| operator: Exists | ||
| effect: NoSchedule |
There was a problem hiding this comment.
Privileged + hostPID security posture — acknowledge Checkov findings and consider capability-scoping.
Checkov flags these as HIGH (CKV_K8S_16: privileged) and MEDIUM (CKV_K8S_17: hostPID, CKV_K8S_20: allowPrivilegeEscalation, CKV_K8S_23: root containers). The inline comments correctly explain the nsenter requirement, but two hardening options are worth considering even in a demo:
-
Capability-based approach:
nsenterfor mount/UTS/network namespaces only requiresCAP_SYS_ADMIN(and optionallyCAP_SYS_PTRACE). Fullprivileged: truegrants all 40+ capabilities and removes seccomp/AppArmor profiles — a much larger blast radius than needed. -
allowPrivilegeEscalation: falseshould be explicitly set even whenprivileged: trueto satisfy policy engines (OPA/Kyverno) and pass Checkov CKV_K8S_20.
🛡️ Capability-scoped alternative to full privileged mode
securityContext:
- privileged: true # Required for nsenter into host mount namespace
+ capabilities:
+ add:
+ - SYS_ADMIN # nsenter namespace traversal
+ - SYS_PTRACE # /proc/<pid> access for nsenter
+ allowPrivilegeEscalation: false
+ runAsNonRoot: false # nsenter + nvidia-smi require root; document explicitlyNote: Test thoroughly — some
nsenterinvocations may still require fullprivilegeddepending on the target namespace types (e.g.,--mountinto the host mount namespace typically needsSYS_ADMIN+SYS_CHROOT).
Also applies to: 59-60
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@demos/fabric-manager-monitor/k8s/daemonset.yaml` around lines 22 - 30, The
DaemonSet is using a broad privileged posture; update the Pod/Container
securityContext to avoid privileged: true where possible and instead scope
capabilities required by nsenter (e.g., add kernel capabilities like
CAP_SYS_ADMIN and optionally CAP_SYS_PTRACE under
securityContext.capabilities.add) and explicitly set
securityContext.allowPrivilegeEscalation: false; keep hostPID: true if you still
need host PID access but remove or avoid privileged: true (or document why it
must remain) so Checkov flags CKV_K8S_16/17/20/23 are mitigated—look for the Pod
spec fields securityContext, capabilities, privileged, allowPrivilegeEscalation,
and hostPID to apply these changes.
| def to_check_results(self, statuses: List[PCIeStatus], node_name: str) -> List[CheckResult]: | ||
| """Convert PCIeStatus list to CheckResult list for the watcher.""" | ||
| results = [] | ||
| for pcie in statuses: | ||
| if pcie.degraded: | ||
| results.append( | ||
| CheckResult( | ||
| check_name="PcieLinkDegraded", | ||
| is_healthy=False, | ||
| is_fatal=True, | ||
| error_codes=["PCIE_LINK_DEGRADED"], | ||
| message=( | ||
| f"PCIe link degraded on {node_name} GPU {pcie.gpu_index}: " | ||
| f"Gen{pcie.link_gen_current} x{pcie.link_width_current} " | ||
| f"(max Gen{pcie.link_gen_max} x{pcie.link_width_max})" | ||
| ), | ||
| entities_impacted=[{"entityType": "GPU", "entityValue": str(pcie.gpu_index)}], | ||
| ) | ||
| ) | ||
| else: | ||
| results.append( | ||
| CheckResult( | ||
| check_name="PcieLinkDegraded", | ||
| is_healthy=True, | ||
| is_fatal=False, | ||
| error_codes=[], | ||
| message=f"PCIe link healthy on {node_name} GPU {pcie.gpu_index}", | ||
| entities_impacted=[{"entityType": "GPU", "entityValue": str(pcie.gpu_index)}], | ||
| ) | ||
| ) | ||
| return results |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd health-monitors/fabric-manager-monitor && find . -name "pcie_check.py" -o -name "watcher.py" | head -20Repository: NVIDIA/NVSentinel
Length of output: 153
🏁 Script executed:
# First, let's see the full pcie_check.py file to understand the check() method
cat -n fabric_manager_monitor/checkers/pcie_check.py | head -200Repository: NVIDIA/NVSentinel
Length of output: 138
🏁 Script executed:
# Search for watcher.py to see how it handles empty results
find . -name "watcher.py" -type fRepository: NVIDIA/NVSentinel
Length of output: 144
🏁 Script executed:
# Check how the check() method works and when it returns empty list
rg -A 20 "def check\(" fabric_manager_monitor/checkers/pcie_check.pyRepository: NVIDIA/NVSentinel
Length of output: 146
🏁 Script executed:
cat -n health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/pcie_check.py | head -150Repository: NVIDIA/NVSentinel
Length of output: 6348
🏁 Script executed:
# Look for the check() method in pcie_check.py
rg -A 25 "def check\(" health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/pcie_check.pyRepository: NVIDIA/NVSentinel
Length of output: 945
🏁 Script executed:
# Check the watcher to see how it uses results from checkers
cat health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/watcher.py | head -200Repository: NVIDIA/NVSentinel
Length of output: 8247
🏁 Script executed:
rg -A 15 "_run_pcie_checks" health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/watcher.pyRepository: NVIDIA/NVSentinel
Length of output: 1615
🏁 Script executed:
# Also check if there's any special handling for empty results
rg -B 5 -A 10 "if results" health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/watcher.pyRepository: NVIDIA/NVSentinel
Length of output: 956
🏁 Script executed:
# Check the CheckResult type to understand the data structure
cat health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/types.pyRepository: NVIDIA/NVSentinel
Length of output: 1860
🏁 Script executed:
# Get the full _run_pcie_checks method to see if it calls to_check_results
rg -A 30 "_run_pcie_checks" health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/watcher.pyRepository: NVIDIA/NVSentinel
Length of output: 3014
🏁 Script executed:
# Search for all calls to to_check_results in the codebase
rg "to_check_results" health-monitors/fabric-manager-monitor/Repository: NVIDIA/NVSentinel
Length of output: 1396
🏁 Script executed:
# Check how clock_check.py handles empty statuses
rg -A 30 "def to_check_results" health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/clock_check.pyRepository: NVIDIA/NVSentinel
Length of output: 1644
🏁 Script executed:
# Also check the fabric_check.py to see if there's a pattern for handling empty results
rg -B 5 -A 40 "def to_check_results" health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/fabric_check.pyRepository: NVIDIA/NVSentinel
Length of output: 2086
🏁 Script executed:
# Check service_check.py for comparison
rg -B 5 -A 20 "def check\(" health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py | head -50Repository: NVIDIA/NVSentinel
Length of output: 43
🏁 Script executed:
# Check if there's an "error" field in PCIeStatus that could be used
rg -A 5 "class PCIeStatus" health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/pcie_check.pyRepository: NVIDIA/NVSentinel
Length of output: 191
🏁 Script executed:
# Check service_check.py to see how it handles errors - looking for a model to follow
cat health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py | head -100Repository: NVIDIA/NVSentinel
Length of output: 3209
🏁 Script executed:
# Let's check if there's existing error reporting pattern by looking at the full check method context
cat health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/pcie_check.py | sed -n '47,82p'Repository: NVIDIA/NVSentinel
Length of output: 1271
🏁 Script executed:
# Check if any tests exist that show expected behavior for empty statuses
find health-monitors/fabric-manager-monitor -name "*test*.py" -o -name "test_*" -type f | head -10Repository: NVIDIA/NVSentinel
Length of output: 43
to_check_results() silently returns empty list when nvidia-smi fails, masking monitoring gaps.
When check() fails (returns [] due to nvidia-smi errors, timeout, or exceptions), to_check_results() also returns []. This means:
- No callbacks are fired (watcher's
if results:check skips them) - Node health defaults to
True(watcher'sif results else Truelogic) - The monitoring failure is silent—the platform has no signal that PCIe health is unknown
Consider emitting an error-state CheckResult when statuses is empty, so the platform can track monitoring gaps. The ServiceChecker pattern provides a model: always return a status object with an error field set when the check itself fails.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/pcie_check.py`
around lines 114 - 144, to_check_results currently returns an empty list when
statuses is empty (e.g., nvidia-smi failure), which silences monitoring; update
to_check_results to detect when statuses == [] and instead return a single
error-state CheckResult (e.g., check_name "PcieLinkDegraded" or
"PcieCheckFailed") with is_healthy=False, is_fatal=False,
error_codes=["PCIE_CHECK_FAILED"], a clear message like "Unable to determine
PCIe status for {node_name}: check execution failed or timed out", and an
appropriate entities_impacted (e.g., node or no specific GPU), so consumers and
the watcher can surface monitoring gaps; keep existing behavior for non-empty
List[PCIeStatus].
health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py
Show resolved
Hide resolved
| self._fabric_manager_down = False | ||
|
|
||
| # Initialize checkers and build the check list based on enabled flags | ||
| self._checkers: List[tuple[str, callable]] = [] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the actual file to understand the context
cd health-monitors/fabric-manager-monitor
head -n 100 fabric_manager_monitor/checkers/watcher.py | cat -nRepository: NVIDIA/NVSentinel
Length of output: 4527
🏁 Script executed:
# Run the suggested command from the review to find other instances
rg -n 'callable\]' --type=pyRepository: NVIDIA/NVSentinel
Length of output: 201
🏁 Script executed:
# Also search for imports and see if Callable is used elsewhere
rg -n 'from typing import|Callable' --type=py fabric_manager_monitor/checkers/watcher.pyRepository: NVIDIA/NVSentinel
Length of output: 143
🏁 Script executed:
# Check Python version requirements in the project
fd -e "pyproject.toml" -o -e "setup.py" -o -e ".python-version" | head -5 | xargs catRepository: NVIDIA/NVSentinel
Length of output: 177
Replace callable (lowercase) with proper Callable type annotation.
callable is a built-in function, not a type. Use typing.Callable with the appropriate signature. This will be flagged by static type checkers.
Proposed fix
-from typing import List
+from typing import Callable, List, Tuple
...
- self._checkers: List[tuple[str, callable]] = []
+ self._checkers: List[Tuple[str, Callable[[], List[CheckResult]]]] = []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/watcher.py`
at line 72, The annotation for self._checkers uses the builtin callable instead
of the typing Callable; update the type to use typing.Callable (for example
List[tuple[str, Callable[..., Any]]]) and add the corresponding import (from
typing import Callable, Any) at the top of watcher.py so static type checkers
recognize the correct type for the check functions referenced by self._checkers.
...monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py
Show resolved
Hide resolved
- Fix flapping alert: use increase() > 3 instead of rate() > 0.5 (old threshold required ~300 restarts in 10m to trigger) - Fix PCIe degradation not affecting overall_healthy in demo monitor - Fix requests CVE-2024-35195: bump minimum to >=2.32.2 - Fix click --verbose: use is_flag=True instead of type=bool - Fix grpcio-tools: move from runtime to dev dependencies - Cap exponential backoff: MAX_RETRIES=5, MAX_DELAY=15s (was 10 retries with uncapped delay blocking ~9 minutes) - Remove unnecessary cluster-wide pods RBAC access - Add Apache 2.0 license headers to __init__.py files - Fix type annotations: Dict[str, float] -> Dict[str, List[float]] - Improve port-forward docs to target specific node's pod Signed-off-by: Anton Alexander <dmvevents@gmail.com>
- Fix CUDAValidationResult: use field(default_factory=list) instead of mutable default `errors: list = None` - Fix logger: add "critical" to _LEVEL_MAP (was silently falling back to INFO) - Fix Dockerfile: remove non-functional ARG shell substitution and rm -rf inside BuildKit cache mount - Fix cache key sorting: sort entities by explicit fields instead of dict __repr__ - Fix Prometheus text parser: handle optional timestamps in exposition format (split on space, take second field as value) - Add GENERAL_ERROR patterns to journal error parsing - Pin demo image tag to 0.1.0 instead of :latest - Bump requests to >=2.32.2 in native integration Signed-off-by: Anton Alexander <dmvevents@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (8)
demos/fabric-manager-monitor/monitor.py (1)
91-93: Unusedframeparameter in signal handler.Ruff ARG002 flags
frameas unused. Since the signal handler signature is mandated by Python'ssignalmodule, rename to_frame(or annotate with# noqa: ARG002) to silence the lint warning without changing behaviour.♻️ Proposed fix
- def _handle_signal(self, signum, frame): + def _handle_signal(self, signum: int, _frame: object) -> None:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/monitor.py` around lines 91 - 93, The signal handler _handle_signal currently declares an unused parameter named frame which triggers Ruff ARG002; update the handler signature in the monitor class by renaming the parameter from frame to _frame (or alternately add a `# noqa: ARG002` comment) so the function still matches the signal.signal callback signature but the linter stops flagging it; ensure the body remains unchanged (logger.info and self._shutdown.set()) and only the parameter name/annotation is modified.demos/fabric-manager-monitor/checks/fabric_check.py (2)
88-95:rsplit(" ", 1)will parse the timestamp as the value on lines that include a Prometheus timestamp.Per the Prometheus exposition format, a sample line may have an optional millisecond-epoch timestamp:
metric{labels} value timestamp. Usingrsplit(" ", 1)splits at the rightmost space, sovalue_strwould be the timestamp, not the sample value.DCGM exporter typically omits timestamps, so this is benign in practice. If you want robustness, split on whitespace and take the second token instead:
♻️ Proposed fix
- name_and_labels, value_str = line.rsplit(" ", 1) - name = name_and_labels.split("{")[0] - value = float(value_str) + parts = line.split() + name = parts[0].split("{")[0] + value = float(parts[1])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/checks/fabric_check.py` around lines 88 - 95, The parser currently uses line.rsplit(" ", 1) which can pick up an optional Prometheus timestamp as value_str; instead split the line on whitespace (e.g., parts = line.split()), ensure there are at least two tokens, set name_and_labels = parts[0] and value_str = parts[1], then proceed to extract name = name_and_labels.split("{")[0], convert value = float(value_str), and append to metrics as before (metrics.setdefault(name, []).append(value)); also keep the existing try/except or check token length to avoid IndexError/ValueError.
83-83: InternalDict[str, list]annotations should beDict[str, List[float]]to match the declared return/parameter types.Line 83's variable annotation and line 98's parameter annotation use the unparameterized
list, while the public signatures on lines 81 and 72 declareDict[str, List[float]]. This creates a silent mismatch that mypy/pyright will flag.♻️ Proposed fix
- metrics: Dict[str, list] = {} + metrics: Dict[str, List[float]] = {}- def _sum_metric(self, metrics: Dict[str, list], name: str) -> float: + def _sum_metric(self, metrics: Dict[str, List[float]], name: str) -> float:Also applies to: 98-98
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/checks/fabric_check.py` at line 83, Update the internal annotations that use the bare list to the more specific List[float]: change the variable annotation metrics: Dict[str, list] to metrics: Dict[str, List[float]] and any function parameter annotated as Dict[str, list] (e.g., the parameter named metrics) to Dict[str, List[float]]; also add "from typing import List" if it's not already imported so the List type is available. Ensure all occurrences of the bare list in these Dict[...] annotations are replaced consistently with List[float] to match the public signatures.demos/fabric-manager-monitor/k8s/servicemonitor.yaml (1)
1-6: Consider adding a Helm chart for the production health monitor.The demo ships raw k8s manifests, which is fine for standalone evaluation. However, the companion production health monitor (
health-monitors/fabric-manager-monitor/) has no Helm chart indistros/kubernetes/nvsentinel/charts/.Based on learnings: "Health monitors should include Helm chart in
distros/kubernetes/nvsentinel/charts/"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/k8s/servicemonitor.yaml` around lines 1 - 6, The ServiceMonitor manifest for kind: ServiceMonitor and metadata.name: fabric-manager-monitor is committed as a raw k8s YAML but the repo expects a Helm chart for production monitors; create a Helm chart named fabric-manager-monitor in the distro charts collection (the repo's nvsentinel charts area) that templates the ServiceMonitor (templates/servicemonitor.yaml) and exposes values (values.yaml) for name, namespace, labels/selectors and endpoints, include Chart.yaml and README, and add the chart to whatever charts index/packaging used by the repo so the production health monitor is deployable via Helm instead of a raw manifest.health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py (2)
131-134: Prefix unused signal handler parameters with_to suppress linting warnings.
signumandframeare required by thesignal.signalcontract but intentionally unused; prefixing with_is the conventional signal.♻️ Proposed fix
- def process_exit_signal(signum, frame): + def process_exit_signal(_signum, _frame):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py` around lines 131 - 134, The signal handler process_exit_signal currently declares unused parameters signum and frame; rename them to _signum and _frame in the process_exit_signal function signature to suppress lint warnings while preserving the signal.signal contract, leaving the body unchanged (it should still call exit.set(), prom_server.shutdown(), and t.join()) so behavior is unaffected.
159-160: Consider adding a Helm chart for the productionfabric-manager-monitorcomponent.The PR wires production-grade gRPC, signal handling, and Prometheus metrics but there is no Helm chart for it under
distros/kubernetes/nvsentinel/charts/. This is a deployment gap for Kubernetes-based installations.Based on learnings: "Health monitors should include Helm chart in
distros/kubernetes/nvsentinel/charts/."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py` around lines 159 - 160, The project lacks a Helm chart for deploying the production fabric-manager-monitor; add a new Helm chart in the nvsentinel charts directory for the fabric-manager-monitor component that packages the service invoked by the CLI entrypoint (the if __name__ == "__main__": cli() binary). Create Chart.yaml and values.yaml, and templates including Deployment (container image, GRPC port, resources, env for signal/metrics config), Service, ServiceMonitor (Prometheus scraping), RBAC (if needed), ConfigMap for config, and probes (liveness/readiness); ensure templates expose the same gRPC/metrics ports used by the code and make chart values configurable (image/tag, replicas, resources), then add the chart to the repo packaging/CI so it’s included in release artifacts.health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py (2)
136-137: Uselog.exceptionto preserve the traceback on unexpected errors.
log.error(f"Exception while sending health events: {e}")silently discards the stack trace. Swap tolog.exceptionso the full context is visible in structured logs, and optionally narrow the catch type.♻️ Proposed fix
- except Exception as e: - log.error(f"Exception while sending health events: {e}") + except Exception: + log.exception("Exception while sending health events")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py` around lines 136 - 137, The except block currently calls log.error(f"Exception while sending health events: {e}") which loses the traceback; replace that call with log.exception("Exception while sending health events") in the except block (preserving the exception variable or removing it since log.exception captures the traceback) and consider narrowing the except Exception to a more specific exception type inside event_processor.py where the send-health-events logic runs so we don't swallow unrelated errors.
154-158:log.errordrops the traceback;sleepruns wastefully after the last retry.Three nits in the retry handler:
log.error(line 155) →log.exceptionto include the RPC error's traceback in structured output.sleep(delay)executes even on the final failed attempt — move the sleep beforedelayupdate and guard it with a retry-count check, or restructure withelse.continue(line 158) is redundant at the end of the loop body.♻️ Proposed fix
for attempt in range(MAX_RETRIES): with grpc.insecure_channel(f"unix://{self._socket_path}") as chan: stub = platformconnector_pb2_grpc.PlatformConnectorStub(chan) try: stub.HealthEventOccurredV1(platformconnector_pb2.HealthEvents(events=health_events, version=1)) metrics.events_sent_success.inc() return True except grpc.RpcError as e: - log.error(f"Failed to send health event to UDS: {e}") - sleep(delay) - delay = min(delay * 1.5, MAX_DELAY) - continue + log.exception(f"Failed to send health event to UDS: {e}") + if attempt < MAX_RETRIES - 1: + sleep(delay) + delay = min(delay * 1.5, MAX_DELAY)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py` around lines 154 - 158, The retry handler currently swallows traceback and always sleeps even after the final attempt; change the except grpc.RpcError as e block to call log.exception(...) instead of log.error(...) to include the traceback, move the sleep(delay) to occur before updating delay (delay = min(delay * 1.5, MAX_DELAY)) and only execute it if there are remaining retries (use the loop's retry counter or compare current attempt index to max_retries), and remove the redundant continue at the end of the except block so control naturally loops or exits based on the retry condition.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@demos/fabric-manager-monitor/k8s/servicemonitor.yaml`:
- Around line 111-121: The alert GPUServiceDown currently sets labels.severity
to "critical" which conflicts with the PR's stated non-fatal classification for
GPU services; update the alert definition (alert name GPUServiceDown, labels
block) to use severity: warning instead of critical so the alert severity aligns
with the non-fatal classification used for GPU services like persistenced/DCGM.
In `@demos/fabric-manager-monitor/monitor.py`:
- Around line 114-117: The metric fabric_manager_restarts_total is never
incremented because the loop handling fm_status.n_restarts contains a no-op; fix
by storing the previous restart count per FM (e.g., a module-level dict like
prev_restarts keyed by FM id/host) and in the monitoring loop (where
fm_status.n_restarts is read) compute delta = max(0, fm_status.n_restarts -
prev_restarts[key]); if delta > 0 call the Counter's inc(delta) to increment
fabric_manager_restarts_total, then update prev_restarts[key] =
fm_status.n_restarts so subsequent cycles use the new baseline; ensure this
logic is applied where the current code comments reference "increment by delta"
and that the symbol fabric_manager_restarts_total is the Counter you call inc on
so the FabricManagerFlapping alert can fire.
In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py`:
- Line 95: Rename the threading.Event variable currently named exit to avoid
shadowing the built-in: change its declaration (Event() assigned to exit) to
exit_event, update the closure process_exit_signal to call exit_event.set()
instead of exit.set(), and update any callers such as watcher.start(...) to pass
exit_event (e.g., watcher.start(exit_event)); ensure all references to the old
symbol exit in that scope are replaced with exit_event.
- Around line 106-108: Wrap the call to
get_package_version("fabric-manager-monitor") in a try/except that catches
importlib.metadata.PackageNotFoundError, defaulting version to a safe value
(e.g., "dev" or "unknown") when the package is not installed, and ensure
set_default_structured_logger_with_level("fabric-manager-monitor", version,
log_level) is always called; reference get_package_version,
PackageNotFoundError, and set_default_structured_logger_with_level when making
the change so the structured logger is initialized even on bare source
checkouts.
---
Duplicate comments:
In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py`:
- Around line 77-94: The cli function signature lacks type annotations; update
the cli(...) declaration to add explicit Python type hints for each parameter
(e.g., platform_connector_socket: str, port: int, poll_interval: float,
node_name: str, boot_grace_period: int, flap_window: int, flap_threshold: int,
enable_fabric_check: bool, enable_pcie_check: bool, enable_clock_check: bool,
enable_nvlink_check: bool, enable_cuda_validation: bool, dcgm_exporter_url:
Optional[str], clock_throttle_ratio: float, processing_strategy: str, verbose:
bool) and add a return annotation (e.g., -> None). Use Optional[...] where
parameters can be None and import typing types if needed; update the cli
symbol's signature only so callers remain compatible.
In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py`:
- Around line 64-67: _build_cache_key generates cache keys by sorting
entities_impacted with key=str, which depends on dict insertion order; change
the sort to use explicit stable fields so semantically identical dicts produce
the same key: sort entities_impacted by a tuple of (e.get('entityType',''),
e.get('entityValue','')) (or the canonical field names used in your entities)
and then build the entity_str using those fields (e.g.,
f"{entityType}:{entityValue}") so ordering and missing-key cases are
deterministic; update the _build_cache_key function to use this explicit-key
sorting and access pattern.
---
Nitpick comments:
In `@demos/fabric-manager-monitor/checks/fabric_check.py`:
- Around line 88-95: The parser currently uses line.rsplit(" ", 1) which can
pick up an optional Prometheus timestamp as value_str; instead split the line on
whitespace (e.g., parts = line.split()), ensure there are at least two tokens,
set name_and_labels = parts[0] and value_str = parts[1], then proceed to extract
name = name_and_labels.split("{")[0], convert value = float(value_str), and
append to metrics as before (metrics.setdefault(name, []).append(value)); also
keep the existing try/except or check token length to avoid
IndexError/ValueError.
- Line 83: Update the internal annotations that use the bare list to the more
specific List[float]: change the variable annotation metrics: Dict[str, list] to
metrics: Dict[str, List[float]] and any function parameter annotated as
Dict[str, list] (e.g., the parameter named metrics) to Dict[str, List[float]];
also add "from typing import List" if it's not already imported so the List type
is available. Ensure all occurrences of the bare list in these Dict[...]
annotations are replaced consistently with List[float] to match the public
signatures.
In `@demos/fabric-manager-monitor/k8s/servicemonitor.yaml`:
- Around line 1-6: The ServiceMonitor manifest for kind: ServiceMonitor and
metadata.name: fabric-manager-monitor is committed as a raw k8s YAML but the
repo expects a Helm chart for production monitors; create a Helm chart named
fabric-manager-monitor in the distro charts collection (the repo's nvsentinel
charts area) that templates the ServiceMonitor (templates/servicemonitor.yaml)
and exposes values (values.yaml) for name, namespace, labels/selectors and
endpoints, include Chart.yaml and README, and add the chart to whatever charts
index/packaging used by the repo so the production health monitor is deployable
via Helm instead of a raw manifest.
In `@demos/fabric-manager-monitor/monitor.py`:
- Around line 91-93: The signal handler _handle_signal currently declares an
unused parameter named frame which triggers Ruff ARG002; update the handler
signature in the monitor class by renaming the parameter from frame to _frame
(or alternately add a `# noqa: ARG002` comment) so the function still matches
the signal.signal callback signature but the linter stops flagging it; ensure
the body remains unchanged (logger.info and self._shutdown.set()) and only the
parameter name/annotation is modified.
In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py`:
- Around line 131-134: The signal handler process_exit_signal currently declares
unused parameters signum and frame; rename them to _signum and _frame in the
process_exit_signal function signature to suppress lint warnings while
preserving the signal.signal contract, leaving the body unchanged (it should
still call exit.set(), prom_server.shutdown(), and t.join()) so behavior is
unaffected.
- Around line 159-160: The project lacks a Helm chart for deploying the
production fabric-manager-monitor; add a new Helm chart in the nvsentinel charts
directory for the fabric-manager-monitor component that packages the service
invoked by the CLI entrypoint (the if __name__ == "__main__": cli() binary).
Create Chart.yaml and values.yaml, and templates including Deployment (container
image, GRPC port, resources, env for signal/metrics config), Service,
ServiceMonitor (Prometheus scraping), RBAC (if needed), ConfigMap for config,
and probes (liveness/readiness); ensure templates expose the same gRPC/metrics
ports used by the code and make chart values configurable (image/tag, replicas,
resources), then add the chart to the repo packaging/CI so it’s included in
release artifacts.
In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py`:
- Around line 136-137: The except block currently calls log.error(f"Exception
while sending health events: {e}") which loses the traceback; replace that call
with log.exception("Exception while sending health events") in the except block
(preserving the exception variable or removing it since log.exception captures
the traceback) and consider narrowing the except Exception to a more specific
exception type inside event_processor.py where the send-health-events logic runs
so we don't swallow unrelated errors.
- Around line 154-158: The retry handler currently swallows traceback and always
sleeps even after the final attempt; change the except grpc.RpcError as e block
to call log.exception(...) instead of log.error(...) to include the traceback,
move the sleep(delay) to occur before updating delay (delay = min(delay * 1.5,
MAX_DELAY)) and only execute it if there are remaining retries (use the loop's
retry counter or compare current attempt index to max_retries), and remove the
redundant continue at the end of the except block so control naturally loops or
exits based on the retry condition.
| - alert: GPUServiceDown | ||
| expr: nvidia_service_up == 0 | ||
| for: 3m | ||
| labels: | ||
| severity: critical | ||
| component: fabric-manager-monitor | ||
| annotations: | ||
| summary: "{{ $labels.service_name }} down on {{ $labels.node }}" | ||
| description: >- | ||
| NVIDIA service {{ $labels.service_name }} has been down for more | ||
| than 3 minutes on {{ $labels.node }}. |
There was a problem hiding this comment.
severity: critical inconsistent with the PR's non-fatal classification for GPU services.
The PR description explicitly classifies GPU services (persistenced, DCGM) as non-fatal. Using severity: critical for this alert will page on-call for a condition that the monitor itself treats as non-fatal, creating alert-fatigue risk. Consider severity: warning to align with the stated impact.
🔧 Proposed fix
- alert: GPUServiceDown
- expr: nvidia_service_up == 0
- for: 3m
- labels:
- severity: critical
+ expr: nvidia_service_up == 0
+ for: 3m
+ labels:
+ severity: warning🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@demos/fabric-manager-monitor/k8s/servicemonitor.yaml` around lines 111 - 121,
The alert GPUServiceDown currently sets labels.severity to "critical" which
conflicts with the PR's stated non-fatal classification for GPU services; update
the alert definition (alert name GPUServiceDown, labels block) to use severity:
warning instead of critical so the alert severity aligns with the non-fatal
classification used for GPU services like persistenced/DCGM.
| # Update restart counter (set to current total — Counter only goes up) | ||
| if fm_status.n_restarts > 0: | ||
| # We use _total suffix via Counter; increment by delta | ||
| pass # Counter tracks via flap detection instead |
There was a problem hiding this comment.
fabric_manager_restarts_total is never incremented — FabricManagerFlapping alert can never fire.
The pass at line 117 is a deliberate no-op, so increase(fabric_manager_restarts_total[10m]) > 3 in the PrometheusRule will always evaluate to 0 and the alert is effectively dead code.
The intended fix based on the comment ("increment by delta") would require tracking the previous restart count across cycles:
🐛 Proposed fix
class FabricManagerMonitor:
def __init__(self, config: MonitorConfig):
...
self._fabric_manager_down = False
self._nvlink_bandwidth_zero = False
+ self._prev_fm_restarts: int = 0- if fm_status.n_restarts > 0:
- # We use _total suffix via Counter; increment by delta
- pass # Counter tracks via flap detection instead
+ delta = max(0, fm_status.n_restarts - self._prev_fm_restarts)
+ if delta > 0:
+ fabric_manager_restarts_total.labels(node).inc(delta)
+ self._prev_fm_restarts = fm_status.n_restarts🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@demos/fabric-manager-monitor/monitor.py` around lines 114 - 117, The metric
fabric_manager_restarts_total is never incremented because the loop handling
fm_status.n_restarts contains a no-op; fix by storing the previous restart count
per FM (e.g., a module-level dict like prev_restarts keyed by FM id/host) and in
the monitoring loop (where fm_status.n_restarts is read) compute delta = max(0,
fm_status.n_restarts - prev_restarts[key]); if delta > 0 call the Counter's
inc(delta) to increment fabric_manager_restarts_total, then update
prev_restarts[key] = fm_status.n_restarts so subsequent cycles use the new
baseline; ensure this logic is applied where the current code comments reference
"increment by delta" and that the symbol fabric_manager_restarts_total is the
Counter you call inc on so the FabricManagerFlapping alert can fire.
| processing_strategy, | ||
| verbose, | ||
| ): | ||
| exit = Event() |
There was a problem hiding this comment.
exit shadows the Python built-in; rename to exit_event.
The threading.Event variable named exit masks the built-in exit() within the entire function scope, including the process_exit_signal closure.
🐛 Proposed fix
- exit = Event()
+ exit_event = Event()Apply the same rename at lines 131 (exit_event.set()), and 156 (watcher.start(exit_event)).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py` at line
95, Rename the threading.Event variable currently named exit to avoid shadowing
the built-in: change its declaration (Event() assigned to exit) to exit_event,
update the closure process_exit_signal to call exit_event.set() instead of
exit.set(), and update any callers such as watcher.start(...) to pass exit_event
(e.g., watcher.start(exit_event)); ensure all references to the old symbol exit
in that scope are replaced with exit_event.
| version = get_package_version("fabric-manager-monitor") | ||
| log_level = "debug" if verbose else os.getenv("LOG_LEVEL", "info") | ||
| set_default_structured_logger_with_level("fabric-manager-monitor", version, log_level) |
There was a problem hiding this comment.
get_package_version() can raise PackageNotFoundError before the structured logger is initialized.
If the package isn't installed (e.g., bare source checkout), this throws an uncaught importlib.metadata.PackageNotFoundError before set_default_structured_logger_with_level is ever called, producing an unformatted traceback instead of a clean error.
🛡️ Proposed fix
+from importlib.metadata import PackageNotFoundError
...
- version = get_package_version("fabric-manager-monitor")
+ try:
+ version = get_package_version("fabric-manager-monitor")
+ except PackageNotFoundError:
+ version = "unknown"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| version = get_package_version("fabric-manager-monitor") | |
| log_level = "debug" if verbose else os.getenv("LOG_LEVEL", "info") | |
| set_default_structured_logger_with_level("fabric-manager-monitor", version, log_level) | |
| try: | |
| version = get_package_version("fabric-manager-monitor") | |
| except PackageNotFoundError: | |
| version = "unknown" | |
| log_level = "debug" if verbose else os.getenv("LOG_LEVEL", "info") | |
| set_default_structured_logger_with_level("fabric-manager-monitor", version, log_level) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/cli.py` around
lines 106 - 108, Wrap the call to get_package_version("fabric-manager-monitor")
in a try/except that catches importlib.metadata.PackageNotFoundError, defaulting
version to a safe value (e.g., "dev" or "unknown") when the package is not
installed, and ensure
set_default_structured_logger_with_level("fabric-manager-monitor", version,
log_level) is always called; reference get_package_version,
PackageNotFoundError, and set_default_structured_logger_with_level when making
the change so the structured logger is initialized even on bare source
checkouts.
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (17)
demos/fabric-manager-monitor/k8s/daemonset.yaml (1)
62-68:/var/run/dbushostPath mount may be redundant withnsenter --mount.When
nsenterenters the host's mount namespace, the host's D-Bus socket (/var/run/dbus/system_bus_socket) becomes accessible naturally. Explicitly bind-mounting it into the container is redundant if allsystemctlcalls go throughnsenter. If any code path issues D-Bus calls directly (without nsenter), add a clarifying comment explaining why.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/k8s/daemonset.yaml` around lines 62 - 68, The hostPath volume and volumeMount named "host-dbus" that binds /var/run/dbus is likely redundant when the container uses nsenter to enter the host mount namespace (nsenter) because /var/run/dbus/system_bus_socket will already be accessible; remove the volumeMount entry (mountPath: /var/run/dbus) and the volumes entry for "host-dbus" to avoid an unnecessary bind-mount, or if there are any code paths that call D-Bus directly (e.g., systemctl calls not wrapped in nsenter), keep the "host-dbus" entries but add a clear comment referencing nsenter, systemctl, and system_bus_socket explaining why the explicit hostPath mount is required.health-monitors/fabric-manager-monitor/fabric_manager_monitor/logger.py (1)
58-59: Prefix unused-but-required processor arguments with_to silence Ruff ARG001.
loggerandmethod_nameare mandated by the structlog processor protocol and cannot be removed, but the Ruff linter (ARG001) flags them as unused. The idiomatic Python fix is a_prefix.♻️ Proposed refactor
def inject_module_version( - logger: logging.Logger | None, - method_name: str, + _logger: logging.Logger | None, + _method_name: str, event_dict: dict[str, Any], ) -> dict[str, Any]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/logger.py` around lines 58 - 59, The structlog processor function currently declares parameters named logger and method_name which are required by the processor protocol but unused and trigger Ruff ARG001; fix this by prefixing those parameters with an underscore (e.g., change logger -> _logger and method_name -> _method_name) in the processor function signature (and update any internal references if they exist) so the names remain present for the protocol but are treated as intentionally unused.health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py (5)
22-22:import logging as loguses the root logger.All
log.debug/error/warningcalls go through the root logger rather than a module-scoped logger, losing module-level filtering and source attribution. The sibling filecuda_validation.pycorrectly useslogger = logging.getLogger(__name__).♻️ Proposed fix
-import logging as log +import logging + +log = logging.getLogger(__name__)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py` at line 22, The module currently imports the root logger via "import logging as log", causing all calls to use the root logger; replace that with "import logging" and add a module-scoped logger variable (logger = logging.getLogger(__name__)), then update all uses of log.debug/log.error/log.warning (etc.) in event_processor.py to use logger.debug/logger.error/logger.warning so log records carry module-level source and filtering. Ensure the new logger is configured consistently with existing sibling modules (e.g., cuda_validation.py) and keep existing log messages unchanged.
132-132: Non-Pythoniclen()truthiness check.
if len(health_events):should beif health_events:.♻️ Proposed fix
- if len(health_events): + if health_events:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py` at line 132, Replace the non-Pythonic truthiness check using len() by checking the list directly: change the condition `if len(health_events):` to `if health_events:` in the event processing code that references the variable `health_events` (look for the block inside the EventProcessor/event processing function in event_processor.py where `health_events` is evaluated) so the code uses idiomatic Python truthiness.
139-140:log.error()silently drops the exception traceback at both catch sites.Lines 140 and 158 use
log.error(f"... {e}"), which records only the stringified exception. Ruff TRY400 confirms this — uselog.exception()(orlog.error(..., exc_info=True)) so the full traceback is available in logs for diagnosis.♻️ Proposed fix
- log.error(f"Failed to send health event to UDS: {e}") + log.exception(f"Failed to send health event to UDS: {e}")- log.error(f"Exception while sending health events: {e}") + log.exception(f"Exception while sending health events: {e}")Also applies to: 157-158
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py` around lines 139 - 140, The except blocks in event_processor.py currently call log.error(f"Exception while sending health events: {e}") which drops the traceback; update those catch sites (the two except Exception handlers around the send/publish logic in EventProcessor/event processing routine) to log the full traceback by using log.exception("Exception while sending health events") or log.error("Exception while sending health events", exc_info=True) instead so the stacktrace is captured for debugging.
149-161: New gRPC channel opened on every retry iteration.
grpc.insecure_channel(...)is constructed (and torn down by thewithblock) on each pass through the retry loop. Every retry pays the full Unix-domain-socket connection-establishment cost, negating the benefit of reconnect logic built into gRPC channels. Move channel creation outside the loop.♻️ Proposed refactor
- delay = INITIAL_DELAY - for _ in range(MAX_RETRIES): - with grpc.insecure_channel(f"unix://{self._socket_path}") as chan: - stub = platformconnector_pb2_grpc.PlatformConnectorStub(chan) - try: - stub.HealthEventOccurredV1(platformconnector_pb2.HealthEvents(events=health_events, version=1)) - metrics.events_sent_success.inc() - return True - except grpc.RpcError as e: - log.error(f"Failed to send health event to UDS: {e}") - sleep(delay) - delay = min(delay * 1.5, MAX_DELAY) - continue + delay = INITIAL_DELAY + with grpc.insecure_channel(f"unix://{self._socket_path}") as chan: + stub = platformconnector_pb2_grpc.PlatformConnectorStub(chan) + for _ in range(MAX_RETRIES): + try: + stub.HealthEventOccurredV1(platformconnector_pb2.HealthEvents(events=health_events, version=1)) + metrics.events_sent_success.inc() + return True + except grpc.RpcError as e: + log.error(f"Failed to send health event to UDS: {e}", exc_info=True) + sleep(delay) + delay = min(delay * 1.5, MAX_DELAY)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py` around lines 149 - 161, The retry loop currently creates a new gRPC channel inside the loop (grpc.insecure_channel) so each retry reconnects; move channel creation outside the for loop and reuse the same channel and stub for all retries: create chan = grpc.insecure_channel(f"unix://{self._socket_path}") and stub = platformconnector_pb2_grpc.PlatformConnectorStub(chan) before the for _ in range(MAX_RETRIES) loop, then call stub.HealthEventOccurredV1(...) inside the try block, keeping the existing backoff logic (INITIAL_DELAY, MAX_DELAY, sleep) and metrics.events_sent_success.inc(), and ensure the channel is closed/cleaned up after the loop completes or on success.
87-89:health_events_publish_time_to_grpc_channelmetric includes retry sleep time.The
with ... .time():context wraps the entire method body, including per-retrysleep(delay)calls (up to ~26 s cumulative). This makes the metric report inflated values on transient failures and conflates gRPC call latency with backoff delays, undermining its usefulness as an SLI.Consider measuring only the actual gRPC call, or emit a separate metric for retry count/total delay.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py` around lines 87 - 89, The current timing context around metrics.health_events_publish_time_to_grpc_channel.labels("health_check_completed_to_grpc_channel").time() wraps the entire retry loop (including sleep(delay)), inflating latency; to fix, move the .time() context so it only surrounds the actual gRPC call (the call to the client/producer inside the retry loop) and remove sleeps from that scope, and add a separate metric (e.g., health_events_retry_delay_seconds_total or health_events_retries_count) updated inside the retry branch to record cumulative backoff delay and retry count; reference the existing metric name health_events_publish_time_to_grpc_channel and the retry/sleep logic in the method to locate where to reposition the timer and where to increment the new retry metric.demos/fabric-manager-monitor/checks/cuda_validation.py (2)
87-88: Moveimport jsonto module-level imports.
import jsoninside the method body violates PEP 8, which requires all imports at the top of the file. As per coding guidelines, PEP 8 must be followed for Python code.♻️ Proposed fix
import logging import subprocess import sys import textwrap +import json from dataclasses import dataclass, field from typing import List, Optional- import json data = json.loads(result.stdout.strip())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/checks/cuda_validation.py` around lines 87 - 88, The local "import json" found inside the function in cuda_validation.py should be moved to the module-level imports to comply with PEP 8: remove the inline "import json" and add "import json" alongside the other top-of-file imports so that the code that calls json.loads(result.stdout.strip()) uses the module-level json; ensure there are no name conflicts and run tests to verify nothing else relied on the local import.
46-53:assertin the subprocess script can be silently stripped by-O.
assert t.sum().isfinite()is the sole GPU correctness check in the per-device loop. If Python is ever invoked with-O(or-OO), this assertion is a no-op and the function silently returnspassed=Trueregardless of the tensor value. Replace with an explicit check.♻️ Proposed fix
- assert t.sum().isfinite(), f"GPU {i}: non-finite sum" + if not t.sum().isfinite(): + raise RuntimeError(f"GPU {i}: non-finite sum")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/checks/cuda_validation.py` around lines 46 - 53, The GPU correctness check uses an assert that can be disabled with -O; replace the assert in the per-device loop (the line checking t.sum().isfinite() for tensor t on GPU index i) with an explicit conditional that treats non-finite results as an error: if the sum is not finite, append a descriptive error to results["errors"] (or raise an Exception) and set results["passed"]=False so the existing except/handling logic records the failure reliably even when Python is run with optimizations.health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/fabric_check.py (1)
87-87: UseDict[str, List[float]]for the metrics mapping type.All three methods (
_fetch_metrics,_parse_prometheus_text,_sum_metric) use the unparameterizedDict[str, list]. The demo counterpart was updated toDict[str, List[float]]in a prior review; the production file still lags.♻️ Proposed fix
- def _fetch_metrics(self) -> Dict[str, list]: + def _fetch_metrics(self) -> Dict[str, List[float]]: - def _parse_prometheus_text(self, text: str) -> Dict[str, list]: + def _parse_prometheus_text(self, text: str) -> Dict[str, List[float]]: metrics: Dict[str, list] = {} + metrics: Dict[str, List[float]] = {} - def _sum_metric(self, metrics: Dict[str, list], name: str) -> float: + def _sum_metric(self, metrics: Dict[str, List[float]], name: str) -> float:Also applies to: 96-96, 117-117
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/fabric_check.py` at line 87, Update the typing for metrics maps to use List[float] instead of unparameterized list: change return/arg types in _fetch_metrics, _parse_prometheus_text, and _sum_metric from Dict[str, list] to Dict[str, List[float]] and add the List import from typing if missing; ensure all three function signatures and any local variable annotations that currently use Dict[str, list] are updated to Dict[str, List[float]] for consistency.demos/fabric-manager-monitor/checks/fabric_check.py (2)
45-49: Narrow the exception type torequests.exceptions.RequestException.Catching bare
Exceptionsilently swallows programming errors likeAttributeErrororTypeError. The only expected errors here are network/HTTP failures.♻️ Proposed fix
- except Exception as e: + except requests.exceptions.RequestException as e: return NVLinkStatus( healthy=True, # can't determine -- assume healthy error=f"Failed to fetch DCGM metrics: {e}", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/checks/fabric_check.py` around lines 45 - 49, The except block in fabric_check.py that currently reads "except Exception as e" (returning NVLinkStatus with error about DCGM metrics) is too broad; narrow it to catch only network/HTTP errors by changing it to catch requests.exceptions.RequestException (or RequestException) and ensure you import RequestException from requests at the top of the module; keep the same return behavior (healthy=True and error message) but only for RequestException so programming errors (e.g., AttributeError/TypeError) still surface.
102-104:_sum_metricparameter type is less precise than its callers.
_fetch_metricsand_parse_prometheus_textreturnDict[str, List[float]], but_sum_metricaccepts the looserDict[str, list]. Align them.♻️ Proposed fix
- def _sum_metric(self, metrics: Dict[str, list], name: str) -> float: + def _sum_metric(self, metrics: Dict[str, List[float]], name: str) -> float:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@demos/fabric-manager-monitor/checks/fabric_check.py` around lines 102 - 104, The _sum_metric function's parameter type is too loose (Dict[str, list]) compared to its callers; change the type annotation of metrics in _sum_metric to Dict[str, List[float]] to match _fetch_metrics and _parse_prometheus_text, and add/ensure the List import from typing is present; update any related type hints or imports so the function signature reads metrics: Dict[str, List[float]] while keeping the float return.health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py (5)
92-95: AnnotateDEFAULT_GPU_SERVICESasClassVar.Ruff (RUF012) flags mutable class-level attributes that lack a
ClassVarannotation.♻️ Proposed fix
+from typing import ClassVar class ServiceChecker: - DEFAULT_GPU_SERVICES = [ + DEFAULT_GPU_SERVICES: ClassVar[List[str]] = [ "nvidia-persistenced", "nv-hostengine", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py` around lines 92 - 95, DEFAULT_GPU_SERVICES is a mutable class-level list and should be annotated as a ClassVar to satisfy RUF012; update the declaration of DEFAULT_GPU_SERVICES to use typing.ClassVar (e.g., ClassVar[list[str]] or ClassVar[List[str]]), and add the required import(s) for ClassVar (and List if using List) at the top of the module so the class attribute is properly annotated as a non-instance variable.
185-186: Silentexcept Exception: passin_get_restart_countswallows unexpected errors.The intent is to handle older systemd that doesn't expose
NRestarts, which only warrants catchingsubprocess.TimeoutExpiredandValueError(fromint(val)). Bareexcept Exception: passalso masks unrelated failures (e.g., a broken nsenter binary) with no diagnostic signal.♻️ Proposed fix
- except Exception: - pass + except (subprocess.TimeoutExpired, ValueError): + pass + except Exception: + log.debug("Unexpected error fetching NRestarts for %s", service_name, exc_info=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py` around lines 185 - 186, In _get_restart_count, remove the bare "except Exception: pass" that swallows errors; instead catch only the expected exceptions (subprocess.TimeoutExpired and ValueError from int(val)) and handle them as before, and for any other unexpected exceptions (e.g., subprocess.CalledProcessError or runtime errors) log the exception via the module logger or re-raise it so failures (like a broken nsenter) are not silently ignored; update the except blocks around the subprocess call and int conversion in _get_restart_count to implement this narrower handling and logging.
259-263: Redundantsubprocess.TimeoutExpiredin theexcepttuple; also movereturn foundtoelse.
subprocess.TimeoutExpiredis a subclass ofException, so the tuple is equivalent toexcept Exception. Consider narrowing to specific types instead, and movereturn foundto anelseblock (Ruff TRY300) for clearer control flow.♻️ Proposed fix
- return found - - except (subprocess.TimeoutExpired, Exception) as e: + else: + return found + + except (subprocess.TimeoutExpired, OSError, Exception) as e:Or, if you only expect subprocess / OS failures:
- except (subprocess.TimeoutExpired, Exception) as e: + except Exception as e:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py` around lines 259 - 263, The except clause currently catches (subprocess.TimeoutExpired, Exception) which is redundant because TimeoutExpired is an Exception; change the handler to catch only the specific error types you expect (e.g., subprocess.TimeoutExpired or OSError) rather than Exception, update the except line accordingly, and move the "return found" out of the try block into an else block so successful parsing returns found in the else branch while the except branch logs via log.warning(f"Journal parsing failed for {service_name}: {e}") and returns [].
88-95: Helm chart not included for the new health monitor.Per project conventions, health monitors should ship a Helm chart under
distros/kubernetes/nvsentinel/charts/. The PR does not appear to include one forfabric-manager-monitor, leaving Kubernetes deployment unguided.Based on learnings: "Health monitors should include Helm chart in
distros/kubernetes/nvsentinel/charts/."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py` around lines 88 - 95, Add a Helm chart for the new fabric-manager-monitor so the ServiceChecker-based monitor (referencing ServiceChecker and DEFAULT_GPU_SERVICES) can be deployed via Kubernetes; create a chart named fabric-manager-monitor in the nvsentinel charts collection that includes Chart.yaml, values.yaml (exposing image, replicas, and GPU service list driven by DEFAULT_GPU_SERVICES), templates for Deployment/Service/ServiceAccount/ConfigMap (mounting any config/env needed by ServiceChecker), and update the nvsentinel umbrella/values to reference/install this new chart so the monitor is included in cluster deployments.
107-107: Prefer list unpacking over concatenation (RUF005).♻️ Proposed fix
- full_cmd = ["nsenter", "-t", "1", "-m", "--"] + cmd + full_cmd = ["nsenter", "-t", "1", "-m", "--", *cmd]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py` at line 107, The current construction of full_cmd uses list concatenation ("full_cmd = [...] + cmd"); change this to use list unpacking so the literal nsenter list and the cmd iterable are merged via the star-unpacking operator into a single list. Update the assignment for full_cmd (the variable built from the nsenter prefix and cmd) to use unpacking and ensure cmd is an iterable of strings before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@demos/fabric-manager-monitor/checks/cuda_validation.py`:
- Around line 68-104: Add a to_check_results(self) -> CheckResults (or the
project’s common check results type) method to class CUDAValidator that adapts a
CUDAValidationResult into the watcher pipeline format: call self.check() to get
a CUDAValidationResult, map its passed, error, gpu_count and errors fields into
the same structure other checkers (fabric, PCIe, clock) return, and return that
CheckResults object so watcher.py (called at watcher.py:301) can consume it;
refer to existing to_check_results implementations in the other checker classes
for the exact keys/shape and mirror their behavior in error/failed cases.
In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/fabric_check.py`:
- Around line 58-64: When _fetch_metrics raises we currently return
NVLinkStatus(healthy=True) but leave bandwidth_zero at its dataclass default
(True), which causes to_check_results to treat the status as degraded; change
the exception branch in the code that calls _fetch_metrics to return
NVLinkStatus with bandwidth_zero explicitly set to False (e.g.,
NVLinkStatus(healthy=True, bandwidth_zero=False, error=...)) so to_check_results
will not raise NvlinkFabricDegraded when we “assume healthy”; update the return
in the try/except around _fetch_metrics and keep the error text in the error
field.
In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/logger.py`:
- Around line 47-49: _parse_log_level currently falls back silently to
logging.INFO and its docstring omits valid keys; change _parse_log_level to
first normalize the input and check membership in _LEVEL_MAP, and if the key is
unknown call the module logger (or logging.warning) to emit a clear warning
including the invalid string and that logging.INFO will be used, then return the
default; also expand the _parse_log_level docstring to explicitly list accepted
keys (e.g. "debug", "info", "warning", "warn", "error", "critical") and note the
function is case-insensitive and strips whitespace so typos can be diagnosed via
the warning; reference symbols: _parse_log_level and _LEVEL_MAP.
In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py`:
- Around line 97-137: The code has a TOCTOU race on self.entity_cache in
health_check_completed: concurrent callbacks can read the same stale/missing
cache entries, both decide to send identical HealthEvents, and then both write
back — causing duplicates; fix by adding a threading.Lock (e.g.,
self._entity_cache_lock) and use it to protect the read-check-update sequence
around the cache: acquire the lock when building cache_key via _build_cache_key
and checking self.entity_cache to decide which results need sending, mark those
keys as "in-flight" (or store pending_cache_updates) while still holding the
lock, then release the lock before calling
send_health_event_with_retries(health_events); after a successful send,
re-acquire the lock and commit the pending_cache_updates into self.entity_cache
(and clear any in-flight markers) so only one thread sends for a given
cache_key; apply this around the loop that builds health_events and when writing
to self.entity_cache to prevent overlapping sends.
---
Duplicate comments:
In `@demos/fabric-manager-monitor/k8s/daemonset.yaml`:
- Around line 1-20: The current DaemonSet (fabric-manager-monitor) lacks the
required kata vs regular node split and is missing its Helm chart; update the
deployment by creating two DaemonSet manifests (or templated Helm resources)
that target kata and regular nodes using nodeAffinity with the kata.enabled
label (e.g., separate DaemonSets named fabric-manager-monitor-kata and
fabric-manager-monitor-regular or Helm templates in daemonset*.yaml), and add a
Helm chart under distros/kubernetes/nvsentinel/charts/fabric-manager-monitor/
including README/docs that explain the DaemonSet variant selection logic and
helm values (kata.enabled flag) per the project coding guideline.
- Around line 46-49: The livenessProbe configured under livenessProbe.httpGet
currently targets path "/" which 404s because
prometheus_client.start_http_server() only serves "/metrics"; change the probe
to use path "/metrics" (i.e., update livenessProbe.httpGet.path to "/metrics")
so the probe checks the actual metrics endpoint; while here, verify the probe
uses the same port name "metrics" and consider adding sensible timing settings
(initialDelaySeconds/failureThreshold) on the livenessProbe if not already
present to avoid premature restarts.
- Around line 22-30: The DaemonSet is using hostPID and likely a privileged
container for nsenter; to satisfy Checkov add an explicit container
securityContext that sets allowPrivilegeEscalation: false (and, if you only need
nsenter, prefer dropping full privileged mode and grant minimal capabilities
such as CAP_SYS_ADMIN and optionally CAP_SYS_PTRACE via
securityContext.capabilities.add). Update the container spec (referencing the
DaemonSet container definitions where serviceAccountName and hostPID are set) to
include securityContext.allowPrivilegeEscalation: false (and optionally set
capabilities to ["SYS_ADMIN","SYS_PTRACE"] and remove privileged: true if
feasible) so policy engines stop flagging CKV_K8S_16/17/20/23.
In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py`:
- Around line 43-47: The NVSWITCH_ERROR patterns list contains a redundant
"NVSwitch" entry because _parse_journal_errors lowercases both the journal text
and patterns before matching; remove the duplicate ("NVSwitch" or "nvswitch")
from ErrorCategory.NVSWITCH_ERROR's list so each pattern is unique (leave only
one of "nvswitch" or "NVSwitch"), keeping the remaining patterns intact to
preserve matching behavior in _parse_journal_errors.
In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/logger.py`:
- Around line 37-44: The review noted a duplicate comment about the _LEVEL_MAP
dictionary; no code change is needed because _LEVEL_MAP already includes the
"critical": logging.CRITICAL entry — simply mark the duplicate review as
resolved and leave the _LEVEL_MAP constant in logger.py (symbol: _LEVEL_MAP)
unchanged.
---
Nitpick comments:
In `@demos/fabric-manager-monitor/checks/cuda_validation.py`:
- Around line 87-88: The local "import json" found inside the function in
cuda_validation.py should be moved to the module-level imports to comply with
PEP 8: remove the inline "import json" and add "import json" alongside the other
top-of-file imports so that the code that calls
json.loads(result.stdout.strip()) uses the module-level json; ensure there are
no name conflicts and run tests to verify nothing else relied on the local
import.
- Around line 46-53: The GPU correctness check uses an assert that can be
disabled with -O; replace the assert in the per-device loop (the line checking
t.sum().isfinite() for tensor t on GPU index i) with an explicit conditional
that treats non-finite results as an error: if the sum is not finite, append a
descriptive error to results["errors"] (or raise an Exception) and set
results["passed"]=False so the existing except/handling logic records the
failure reliably even when Python is run with optimizations.
In `@demos/fabric-manager-monitor/checks/fabric_check.py`:
- Around line 45-49: The except block in fabric_check.py that currently reads
"except Exception as e" (returning NVLinkStatus with error about DCGM metrics)
is too broad; narrow it to catch only network/HTTP errors by changing it to
catch requests.exceptions.RequestException (or RequestException) and ensure you
import RequestException from requests at the top of the module; keep the same
return behavior (healthy=True and error message) but only for RequestException
so programming errors (e.g., AttributeError/TypeError) still surface.
- Around line 102-104: The _sum_metric function's parameter type is too loose
(Dict[str, list]) compared to its callers; change the type annotation of metrics
in _sum_metric to Dict[str, List[float]] to match _fetch_metrics and
_parse_prometheus_text, and add/ensure the List import from typing is present;
update any related type hints or imports so the function signature reads
metrics: Dict[str, List[float]] while keeping the float return.
In `@demos/fabric-manager-monitor/k8s/daemonset.yaml`:
- Around line 62-68: The hostPath volume and volumeMount named "host-dbus" that
binds /var/run/dbus is likely redundant when the container uses nsenter to enter
the host mount namespace (nsenter) because /var/run/dbus/system_bus_socket will
already be accessible; remove the volumeMount entry (mountPath: /var/run/dbus)
and the volumes entry for "host-dbus" to avoid an unnecessary bind-mount, or if
there are any code paths that call D-Bus directly (e.g., systemctl calls not
wrapped in nsenter), keep the "host-dbus" entries but add a clear comment
referencing nsenter, systemctl, and system_bus_socket explaining why the
explicit hostPath mount is required.
In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/fabric_check.py`:
- Line 87: Update the typing for metrics maps to use List[float] instead of
unparameterized list: change return/arg types in _fetch_metrics,
_parse_prometheus_text, and _sum_metric from Dict[str, list] to Dict[str,
List[float]] and add the List import from typing if missing; ensure all three
function signatures and any local variable annotations that currently use
Dict[str, list] are updated to Dict[str, List[float]] for consistency.
In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/service_check.py`:
- Around line 92-95: DEFAULT_GPU_SERVICES is a mutable class-level list and
should be annotated as a ClassVar to satisfy RUF012; update the declaration of
DEFAULT_GPU_SERVICES to use typing.ClassVar (e.g., ClassVar[list[str]] or
ClassVar[List[str]]), and add the required import(s) for ClassVar (and List if
using List) at the top of the module so the class attribute is properly
annotated as a non-instance variable.
- Around line 185-186: In _get_restart_count, remove the bare "except Exception:
pass" that swallows errors; instead catch only the expected exceptions
(subprocess.TimeoutExpired and ValueError from int(val)) and handle them as
before, and for any other unexpected exceptions (e.g.,
subprocess.CalledProcessError or runtime errors) log the exception via the
module logger or re-raise it so failures (like a broken nsenter) are not
silently ignored; update the except blocks around the subprocess call and int
conversion in _get_restart_count to implement this narrower handling and
logging.
- Around line 259-263: The except clause currently catches
(subprocess.TimeoutExpired, Exception) which is redundant because TimeoutExpired
is an Exception; change the handler to catch only the specific error types you
expect (e.g., subprocess.TimeoutExpired or OSError) rather than Exception,
update the except line accordingly, and move the "return found" out of the try
block into an else block so successful parsing returns found in the else branch
while the except branch logs via log.warning(f"Journal parsing failed for
{service_name}: {e}") and returns [].
- Around line 88-95: Add a Helm chart for the new fabric-manager-monitor so the
ServiceChecker-based monitor (referencing ServiceChecker and
DEFAULT_GPU_SERVICES) can be deployed via Kubernetes; create a chart named
fabric-manager-monitor in the nvsentinel charts collection that includes
Chart.yaml, values.yaml (exposing image, replicas, and GPU service list driven
by DEFAULT_GPU_SERVICES), templates for
Deployment/Service/ServiceAccount/ConfigMap (mounting any config/env needed by
ServiceChecker), and update the nvsentinel umbrella/values to reference/install
this new chart so the monitor is included in cluster deployments.
- Line 107: The current construction of full_cmd uses list concatenation
("full_cmd = [...] + cmd"); change this to use list unpacking so the literal
nsenter list and the cmd iterable are merged via the star-unpacking operator
into a single list. Update the assignment for full_cmd (the variable built from
the nsenter prefix and cmd) to use unpacking and ensure cmd is an iterable of
strings before merging.
In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/logger.py`:
- Around line 58-59: The structlog processor function currently declares
parameters named logger and method_name which are required by the processor
protocol but unused and trigger Ruff ARG001; fix this by prefixing those
parameters with an underscore (e.g., change logger -> _logger and method_name ->
_method_name) in the processor function signature (and update any internal
references if they exist) so the names remain present for the protocol but are
treated as intentionally unused.
In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py`:
- Line 22: The module currently imports the root logger via "import logging as
log", causing all calls to use the root logger; replace that with "import
logging" and add a module-scoped logger variable (logger =
logging.getLogger(__name__)), then update all uses of
log.debug/log.error/log.warning (etc.) in event_processor.py to use
logger.debug/logger.error/logger.warning so log records carry module-level
source and filtering. Ensure the new logger is configured consistently with
existing sibling modules (e.g., cuda_validation.py) and keep existing log
messages unchanged.
- Line 132: Replace the non-Pythonic truthiness check using len() by checking
the list directly: change the condition `if len(health_events):` to `if
health_events:` in the event processing code that references the variable
`health_events` (look for the block inside the EventProcessor/event processing
function in event_processor.py where `health_events` is evaluated) so the code
uses idiomatic Python truthiness.
- Around line 139-140: The except blocks in event_processor.py currently call
log.error(f"Exception while sending health events: {e}") which drops the
traceback; update those catch sites (the two except Exception handlers around
the send/publish logic in EventProcessor/event processing routine) to log the
full traceback by using log.exception("Exception while sending health events")
or log.error("Exception while sending health events", exc_info=True) instead so
the stacktrace is captured for debugging.
- Around line 149-161: The retry loop currently creates a new gRPC channel
inside the loop (grpc.insecure_channel) so each retry reconnects; move channel
creation outside the for loop and reuse the same channel and stub for all
retries: create chan = grpc.insecure_channel(f"unix://{self._socket_path}") and
stub = platformconnector_pb2_grpc.PlatformConnectorStub(chan) before the for _
in range(MAX_RETRIES) loop, then call stub.HealthEventOccurredV1(...) inside the
try block, keeping the existing backoff logic (INITIAL_DELAY, MAX_DELAY, sleep)
and metrics.events_sent_success.inc(), and ensure the channel is closed/cleaned
up after the loop completes or on success.
- Around line 87-89: The current timing context around
metrics.health_events_publish_time_to_grpc_channel.labels("health_check_completed_to_grpc_channel").time()
wraps the entire retry loop (including sleep(delay)), inflating latency; to fix,
move the .time() context so it only surrounds the actual gRPC call (the call to
the client/producer inside the retry loop) and remove sleeps from that scope,
and add a separate metric (e.g., health_events_retry_delay_seconds_total or
health_events_retries_count) updated inside the retry branch to record
cumulative backoff delay and retry count; reference the existing metric name
health_events_publish_time_to_grpc_channel and the retry/sleep logic in the
method to locate where to reposition the timer and where to increment the new
retry metric.
| class CUDAValidator: | ||
| """Validates CUDA context creation and memory on each GPU.""" | ||
|
|
||
| def check(self) -> CUDAValidationResult: | ||
| """Run CUDA validation script as a subprocess.""" | ||
| try: | ||
| result = subprocess.run( | ||
| [sys.executable, "-c", _CUDA_TEST_SCRIPT], | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=120, # generous timeout for multi-GPU test | ||
| ) | ||
|
|
||
| if result.returncode != 0: | ||
| return CUDAValidationResult( | ||
| passed=False, | ||
| error=f"CUDA test script failed: {result.stderr.strip()}", | ||
| ) | ||
|
|
||
| import json | ||
| data = json.loads(result.stdout.strip()) | ||
| return CUDAValidationResult( | ||
| passed=data.get("passed", False), | ||
| gpu_count=data.get("gpu_count", 0), | ||
| errors=data.get("errors", []), | ||
| ) | ||
|
|
||
| except subprocess.TimeoutExpired: | ||
| return CUDAValidationResult( | ||
| passed=False, | ||
| error="CUDA validation timed out", | ||
| ) | ||
| except Exception as e: | ||
| return CUDAValidationResult( | ||
| passed=False, | ||
| error=str(e), | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
rg -n "to_check_results" --type py -A3Repository: NVIDIA/NVSentinel
Length of output: 4248
🏁 Script executed:
# Also check the full file length and content beyond line 104
wc -l demos/fabric-manager-monitor/checks/cuda_validation.pyRepository: NVIDIA/NVSentinel
Length of output: 119
🏁 Script executed:
# List all files in the checks directory to see other checker implementations
ls -la demos/fabric-manager-monitor/checks/Repository: NVIDIA/NVSentinel
Length of output: 559
Add missing to_check_results method to CUDAValidator.
The demos/fabric-manager-monitor/checks/cuda_validation.py file (104 lines) is missing the to_check_results method. This method exists in the production code at health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/cuda_validation.py:120 and is required to integrate CUDA validation results with the watcher pipeline (called at watcher.py:301). All other checkers (fabric, PCIe, clock) implement this method following the same pattern.
🧰 Tools
🪛 Ruff (0.15.1)
[error] 74-74: subprocess call: check for execution of untrusted input
(S603)
[warning] 100-100: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@demos/fabric-manager-monitor/checks/cuda_validation.py` around lines 68 -
104, Add a to_check_results(self) -> CheckResults (or the project’s common check
results type) method to class CUDAValidator that adapts a CUDAValidationResult
into the watcher pipeline format: call self.check() to get a
CUDAValidationResult, map its passed, error, gpu_count and errors fields into
the same structure other checkers (fabric, PCIe, clock) return, and return that
CheckResults object so watcher.py (called at watcher.py:301) can consume it;
refer to existing to_check_results implementations in the other checker classes
for the exact keys/shape and mirror their behavior in error/failed cases.
| try: | ||
| metrics = self._fetch_metrics() | ||
| except Exception as e: | ||
| return NVLinkStatus( | ||
| healthy=True, # can't determine -- assume healthy | ||
| error=f"Failed to fetch DCGM metrics: {e}", | ||
| ) |
There was a problem hiding this comment.
DCGM fetch failure silently triggers NvlinkFabricDegraded when Fabric Manager is also down.
When _fetch_metrics raises, the returned NVLinkStatus has healthy=True (as intended) but bandwidth_zero=True (the dataclass default). In to_check_results (line 130):
fabric_nvlink_degraded = not status.healthy or (status.bandwidth_zero and fabric_manager_down)
# = False or (True and True )
# = True ← fires NvlinkFabricDegraded despite "assume healthy" intentThe status.error field is never consulted in to_check_results, so the false-positive path is silent. Explicitly set bandwidth_zero=False in the error branch:
🐛 Proposed fix
- except Exception as e:
+ except requests.exceptions.RequestException as e:
return NVLinkStatus(
healthy=True, # can't determine -- assume healthy
+ bandwidth_zero=False, # unknown; don't trigger false positive
error=f"Failed to fetch DCGM metrics: {e}",
)🧰 Tools
🪛 Ruff (0.15.1)
[warning] 60-60: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/checkers/fabric_check.py`
around lines 58 - 64, When _fetch_metrics raises we currently return
NVLinkStatus(healthy=True) but leave bandwidth_zero at its dataclass default
(True), which causes to_check_results to treat the status as degraded; change
the exception branch in the code that calls _fetch_metrics to return
NVLinkStatus with bandwidth_zero explicitly set to False (e.g.,
NVLinkStatus(healthy=True, bandwidth_zero=False, error=...)) so to_check_results
will not raise NvlinkFabricDegraded when we “assume healthy”; update the return
in the try/except around _fetch_metrics and keep the error text in the error
field.
| def _parse_log_level(level: str) -> int: | ||
| """Convert a string log level to a logging constant.""" | ||
| return _LEVEL_MAP.get(level.lower().strip(), logging.INFO) |
There was a problem hiding this comment.
Silent fallback in _parse_log_level can mask misconfiguration; docstring also incomplete.
An unrecognised level string (e.g. a typo like "debugg" or an env-var value like "WARN" with uppercase) silently produces logging.INFO, enabling full verbose output in production with no diagnostic.
Separately, the level arg docstring on line 76 lists "debug", "info", "warn", "error" but omits "critical" and "warning", both of which are valid keys in _LEVEL_MAP.
🛡️ Proposed fix
def _parse_log_level(level: str) -> int:
"""Convert a string log level to a logging constant."""
- return _LEVEL_MAP.get(level.lower().strip(), logging.INFO)
+ normalized = level.lower().strip()
+ if normalized not in _LEVEL_MAP:
+ print(
+ f"[logger] Unknown log level {level!r}; defaulting to INFO.",
+ file=sys.stderr,
+ )
+ return _LEVEL_MAP.get(normalized, logging.INFO)And update the docstring:
- level: The log level as a string (e.g., "debug", "info", "warn", "error").
+ level: The log level as a string (e.g., "debug", "info", "warn",
+ "warning", "error", "critical").🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@health-monitors/fabric-manager-monitor/fabric_manager_monitor/logger.py`
around lines 47 - 49, _parse_log_level currently falls back silently to
logging.INFO and its docstring omits valid keys; change _parse_log_level to
first normalize the input and check membership in _LEVEL_MAP, and if the key is
unknown call the module logger (or logging.warning) to emit a clear warning
including the invalid string and that logging.INFO will be used, then return the
default; also expand the _parse_log_level docstring to explicitly list accepted
keys (e.g. "debug", "info", "warning", "warn", "error", "critical") and note the
function is case-insensitive and strips whitespace so typos can be diagnosed via
the warning; reference symbols: _parse_log_level and _LEVEL_MAP.
| for result in results: | ||
| cache_key = self._build_cache_key(result.check_name, result.entities_impacted) | ||
| cached = self.entity_cache.get(cache_key) | ||
|
|
||
| # Only send if state changed (or first observation) | ||
| if cached is None or cached.is_fatal != result.is_fatal or cached.is_healthy != result.is_healthy: | ||
| entities = [ | ||
| platformconnector_pb2.Entity(entityType=e["entityType"], entityValue=e["entityValue"]) | ||
| for e in result.entities_impacted | ||
| ] | ||
|
|
||
| recommended_action = self._get_recommended_action(result) | ||
|
|
||
| health_event = platformconnector_pb2.HealthEvent( | ||
| version=self._version, | ||
| agent=self._agent, | ||
| componentClass=self._component_class, | ||
| checkName=result.check_name, | ||
| isFatal=result.is_fatal, | ||
| isHealthy=result.is_healthy, | ||
| message=result.message, | ||
| recommendedAction=recommended_action, | ||
| errorCode=result.error_codes, | ||
| entitiesImpacted=entities, | ||
| metadata=result.metadata or {}, | ||
| generatedTimestamp=timestamp, | ||
| nodeName=self._node_name, | ||
| processingStrategy=self._processing_strategy, | ||
| ) | ||
| health_events.append(health_event) | ||
| pending_cache_updates[cache_key] = CachedEntityState( | ||
| is_fatal=result.is_fatal, is_healthy=result.is_healthy | ||
| ) | ||
|
|
||
| log.debug(f"fabric manager health events to send: {len(health_events)}") | ||
| if len(health_events): | ||
| try: | ||
| if self.send_health_event_with_retries(health_events): | ||
| # Only update cache after successful send | ||
| for key, state in pending_cache_updates.items(): | ||
| self.entity_cache[key] = state |
There was a problem hiding this comment.
TOCTOU race on entity_cache when callbacks run concurrently.
health_check_completed reads self.entity_cache, builds events, then calls send_health_event_with_retries (which can block up to ~26 s across retries) before writing the cache. If a ThreadPoolExecutor fires overlapping callback invocations, two threads can both observe a stale/absent cache entry, both build identical events, and both send — producing duplicate HealthEvents on every check cycle until the first write lands.
A threading.Lock guarding the read-check-update sequence eliminates the race:
🔒 Proposed fix
+import threading
...
class PlatformConnectorEventProcessor(CallbackInterface):
def __init__(self, ...):
...
self.entity_cache: dict[str, CachedEntityState] = {}
+ self._cache_lock = threading.Lock()
def health_check_completed(self, results: List[CheckResult]) -> None:
...
- for result in results:
- cache_key = self._build_cache_key(result.check_name, result.entities_impacted)
- cached = self.entity_cache.get(cache_key)
- if cached is None or ...:
- ...
- pending_cache_updates[cache_key] = ...
+ with self._cache_lock:
+ for result in results:
+ cache_key = self._build_cache_key(result.check_name, result.entities_impacted)
+ cached = self.entity_cache.get(cache_key)
+ if cached is None or ...:
+ ...
+ pending_cache_updates[cache_key] = ...
if health_events:
try:
if self.send_health_event_with_retries(health_events):
- for key, state in pending_cache_updates.items():
- self.entity_cache[key] = state
+ with self._cache_lock:
+ for key, state in pending_cache_updates.items():
+ self.entity_cache[key] = state🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@health-monitors/fabric-manager-monitor/fabric_manager_monitor/platform_connector/event_processor.py`
around lines 97 - 137, The code has a TOCTOU race on self.entity_cache in
health_check_completed: concurrent callbacks can read the same stale/missing
cache entries, both decide to send identical HealthEvents, and then both write
back — causing duplicates; fix by adding a threading.Lock (e.g.,
self._entity_cache_lock) and use it to protect the read-check-update sequence
around the cache: acquire the lock when building cache_key via _build_cache_key
and checking self.entity_cache to decide which results need sending, mark those
keys as "in-flight" (or store pending_cache_updates) while still holding the
lock, then release the lock before calling
send_health_event_with_retries(health_events); after a successful send,
re-acquire the lock and commit the pending_cache_updates into self.entity_cache
(and clear any in-flight markers) so only one thread sends for a given
cache_key; apply this around the loop that builds health_events and when writing
to self.entity_cache to prevent overlapping sends.
|
hey @dmvevents Thanks for putting this together — the Fabric Manager monitoring gap is a real problem and great to see someone tackling it. For larger features like this, we typically start with a design doc to align on architecture early and reduce back-and-forth during review. Would you mind putting one together? A couple of things the design doc would help clarify:
|
Summary
Adds a new health monitor that detects Fabric Manager failures, PCIe link degradation, NVLink fabric issues, GPU clock throttling, and CUDA context failures — failure modes invisible to DCGM-based monitoring.
Closes #883 (NVSentinel not detecting fabric health on H100s)
Related: #889, #890
Motivation
A customer's NVIDIA Fabric Manager failed to start after maintenance and stayed broken for 20 days undetected. NVSentinel's existing monitors (DCGM-based, syslog-based) missed it because individual GPUs appear healthy to DCGM even when Fabric Manager is down. This PR fills the gap with service-level and infrastructure health checks.
What's Included
1. Native Integration (
health-monitors/fabric-manager-monitor/)Follows the same architecture as
gpu-health-monitor:2. Standalone Demo (
demos/fabric-manager-monitor/)Lightweight version exposing Prometheus metrics and alerts without requiring the full NVSentinel pipeline. Useful for evaluating the checks independently.
Check Categories
nsenter+systemctl shownsenter+systemctl shownsenter+nvidia-sminsenter+nvidia-smiFalse-Positive Mitigations
0x0000000000000001)Validated On
Bug Fixes Discovered During Live Testing
nsenterrequires root — non-rootUSERin Dockerfile has emptyCapEffeven withprivileged: trueNRestartsproperty not recognized by systemd 252 (Amazon Linux 2023) — now queried separately with fallback0x0000000000000001) false positive on clock throttle detection — added benign reasons filterTest Plan
pytest tests/curl localhost:9101/metricssystemctl stop nvidia-fabricmanager→fabric_manager_updrops to 0Summary by CodeRabbit
New Features
Monitoring / Alerts
Packaging
Documentation
Tests