NVIDIA · dmvevents · Feb 20, 2026 · Feb 20, 2026 · Feb 20, 2026
diff --git a/demos/README.md b/demos/README.md
@@ -24,6 +24,18 @@ Interactive demonstrations of NVSentinel's core capabilities that run locally on
 
 **Best for:** Understanding how NVSentinel's node-drainer can delegate pod eviction to external controllers for custom drain workflows coordinated with HPC schedulers.
 
+### [Fabric Manager Monitor](fabric-manager-monitor/)
+
+**What it shows:** Standalone DaemonSet that detects Fabric Manager failures, PCIe link degradation, NVLink fabric issues, GPU clock throttling, and CUDA context failures — all invisible to DCGM-based monitoring.
+
+**Requirements:** Docker, kubectl, Kubernetes cluster with GPU nodes, Prometheus Operator
+
+**Best for:** Catching GPU infrastructure failures that NVSentinel's existing health monitors miss. Validated on P4d.24xlarge (A100-SXM4) with Amazon Linux 2023.
+
+**Related issue:** [#883](https://github.com/NVIDIA/NVSentinel/issues/883)
+
+**Note:** For native NVSentinel integration (gRPC HealthEvents to platform-connector), see [`health-monitors/fabric-manager-monitor/`](../health-monitors/fabric-manager-monitor/).
+
 
 ## Coming Soon
 

diff --git a/demos/fabric-manager-monitor/Dockerfile b/demos/fabric-manager-monitor/Dockerfile
@@ -0,0 +1,21 @@
+FROM python:3.11-slim
+
+# nsenter is in util-linux (already in slim), but ensure it's available
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    util-linux \
+    && rm -rf /var/lib/apt/lists/*
+
+WORKDIR /app
+
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+COPY config.py metrics.py monitor.py ./
+COPY checks/ ./checks/
+
+# nsenter requires root to enter host namespaces.
+# The DaemonSet securityContext controls the actual privilege level.
+
+EXPOSE 9101
+
+ENTRYPOINT ["python", "monitor.py"]
diff --git a/demos/fabric-manager-monitor/README.md b/demos/fabric-manager-monitor/README.md
@@ -0,0 +1,84 @@
+# Fabric Manager & GPU Node Health Validator
+
+A standalone DaemonSet companion to NVSentinel that catches GPU infrastructure failures invisible to telemetry-based monitoring.
+
+**Related issue:** [#883 - NVSentinel not detecting fabric health on H100s](https://github.com/NVIDIA/NVSentinel/issues/883)
+
+## Problem
+
+NVIDIA Fabric Manager can fail and stay broken for weeks undetected. NVSentinel's existing monitors (DCGM-based, syslog-based) miss it because individual GPUs appear healthy to DCGM even when Fabric Manager is down. This tool fills the gap with service-level health checks.
+
+**Requirements:** Kubernetes cluster with GPU nodes, Prometheus Operator
+
+## What It Monitors
+
+| # | Check | What It Catches | Method |
+|---|-------|-----------------|--------|
+| 1 | **Fabric Manager Service** | FM not running, flapping, error state | `nsenter` + `systemctl` |
+| 2 | **Critical GPU Services** | persistenced, DCGM dead | `nsenter` + `systemctl` |
+| 3 | **PCIe Link Health** | Link downtraining (Gen5->Gen3, x16->x8) | `nsenter` + `nvidia-smi` |
+| 4 | **NVLink Fabric** | Bandwidth zero with FM down, CRC errors | DCGM metrics HTTP |
+| 5 | **CUDA Validation** | Context failures, memory errors | PyTorch subprocess |
+| 6 | **Clock & Throttle** | Silent throttling without XID | `nsenter` + `nvidia-smi` |
+
+## Quick Start
+
+```bash
+# Build
+docker build -t fabric-manager-monitor:latest .
+
+# Deploy (assumes nvsentinel namespace exists)
+kubectl apply -f k8s/rbac.yaml
+kubectl apply -f k8s/configmap.yaml
+kubectl apply -f k8s/daemonset.yaml
+kubectl apply -f k8s/servicemonitor.yaml
+
+# Verify
+kubectl get ds -n nvsentinel fabric-manager-monitor
+
+# Port-forward to a specific node's pod
+NODE=<node-name>
+POD=$(kubectl get pod -n nvsentinel -o wide --field-selector spec.nodeName=${NODE} \
+  -l app=fabric-manager-monitor -o jsonpath='{.items[0].metadata.name}')
+kubectl port-forward -n nvsentinel pod/${POD} 9101:9101
+curl -s localhost:9101/metrics | grep fabric_manager_up
+```
+
+## Metrics
+
+Exposed on port 9101. Key metrics:
+
+| Metric | Description |
+|--------|-------------|
+| `fabric_manager_up` | Fabric Manager running (1/0) |
+| `gpu_node_health_up` | Overall node health (1/0) |
+| `nvidia_service_up` | Per-service status |
+| `pcie_link_degraded` | PCIe link degraded per GPU |
+| `nvlink_fabric_healthy` | NVLink health |
+| `gpu_clock_throttled` | Clock throttled per GPU |
+| `gpu_clock_ratio` | Current/max clock ratio |
+
+## Alert Rules
+
+The ServiceMonitor includes PrometheusRule with 7 alerts:
+- `FabricManagerDown` (critical, 5m)
+- `FabricManagerFlapping` (warning, 5m)
+- `NVLinkFabricDegraded` (critical, 5m) — correlated: requires FM down AND NVLink degraded
+- `GPUPCIeLinkDegraded` (warning, 5m)
+- `GPUClockThrottled` (warning, 10m)
+- `GPUServiceDown` (critical, 3m)
+- `CUDAValidationFailed` (critical, 5m)
+
+## Validated On
+
+- 2x P4d.24xlarge (8x A100-SXM4-40GB each) — Amazon Linux 2023, EKS 1.32
+- All 6 check categories produce correct metrics
+- GPU Idle downclocking correctly filtered as benign
+
+## Configuration
+
+All settings via ConfigMap environment variables. See `k8s/configmap.yaml`.
+
+## Relationship to NVSentinel
+
+This is a **standalone companion tool** that exposes Prometheus metrics and alerts. It does not integrate with NVSentinel's gRPC event pipeline or remediation workflow. See the native `health-monitors/fabric-manager-monitor/` for an integrated version that emits HealthEvents to platform-connector.
diff --git a/demos/fabric-manager-monitor/checks/__init__.py b/demos/fabric-manager-monitor/checks/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Health check modules for GPU Node Health Validator."""
diff --git a/demos/fabric-manager-monitor/checks/clock_check.py b/demos/fabric-manager-monitor/checks/clock_check.py
@@ -0,0 +1,157 @@
+"""Check 6: Clock and throttle detection.
+
+Detects silent GPU throttling by comparing current clocks against maximum
+and querying active throttle reasons. Catches performance degradation
+that doesn't generate XID errors.
+"""
+
+import logging
+import subprocess
+from dataclasses import dataclass
+from typing import List, Optional
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class ClockStatus:
+    """Clock and throttle status for a single GPU."""
+    gpu_index: int
+    graphics_clock_current: int   # MHz
+    graphics_clock_max: int       # MHz
+    mem_clock_current: int        # MHz
+    mem_clock_max: int            # MHz
+    clock_ratio: float            # current/max (graphics)
+    throttled: bool
+    throttle_reasons: str = ""
+    error: Optional[str] = None
+
+
+class ClockChecker:
+    """Detects GPU clock throttling."""
+
+    def __init__(self, throttle_ratio: float = 0.85):
+        self._throttle_ratio = throttle_ratio
+
+    # Throttle reasons that are benign (not actual degradation)
+    _BENIGN_REASONS = {
+        "Not Active",
+        "0x0000000000000000",  # No throttle
+        "0x0000000000000001",  # GPU Idle — normal when no workload running
+    }
+
+    def check(self) -> List[ClockStatus]:
+        """Query clocks and throttle reasons for all GPUs."""
+        clocks = self._query_clocks()
+        reasons = self._query_throttle_reasons()
+
+        # Merge throttle reasons into clock results
+        reason_map = {r["gpu_index"]: r["reasons"] for r in reasons}
+        for status in clocks:
+            reason_str = reason_map.get(status.gpu_index, "")
+            status.throttle_reasons = reason_str
+
+            # GPU Idle causes low clock ratio but isn't a real throttle.
+            # Only flag as throttled for non-benign reasons.
+            if reason_str in self._BENIGN_REASONS:
+                status.throttled = False
+            elif reason_str:
+                status.throttled = True
+
+        return clocks
+
+    def _query_clocks(self) -> List[ClockStatus]:
+        """Get current vs max clocks from nvidia-smi."""
+        try:
+            result = subprocess.run(
+                [
+                    "nsenter", "-t", "1", "-m", "--",
+                    "nvidia-smi",
+                    "--query-gpu=index,clocks.current.graphics,clocks.max.graphics,"
+                    "clocks.current.memory,clocks.max.memory",
+                    "--format=csv,noheader,nounits",
+                ],
+                capture_output=True,
+                text=True,
+                timeout=15,
+            )
+
+            if result.returncode != 0:
+                logger.error("nvidia-smi clock query failed: %s", result.stderr.strip())
+                return []
+
+            return self._parse_clocks(result.stdout)
+
+        except subprocess.TimeoutExpired:
+            logger.error("nvidia-smi clock query timed out")
+            return []
+        except FileNotFoundError:
+            logger.error("nvidia-smi not found")
+            return []
+        except Exception as e:
+            logger.error("Clock check failed: %s", e)
+            return []
+
+    def _parse_clocks(self, output: str) -> List[ClockStatus]:
+        results = []
+        for line in output.strip().splitlines():
+            parts = [p.strip() for p in line.split(",")]
+            if len(parts) != 5:
+                continue
+            try:
+                idx = int(parts[0])
+                gfx_cur = int(parts[1])
+                gfx_max = int(parts[2])
+                mem_cur = int(parts[3])
+                mem_max = int(parts[4])
+
+                ratio = gfx_cur / gfx_max if gfx_max > 0 else 0.0
+                throttled = ratio < self._throttle_ratio
+
+                results.append(ClockStatus(
+                    gpu_index=idx,
+                    graphics_clock_current=gfx_cur,
+                    graphics_clock_max=gfx_max,
+                    mem_clock_current=mem_cur,
+                    mem_clock_max=mem_max,
+                    clock_ratio=round(ratio, 3),
+                    throttled=throttled,
+                ))
+            except (ValueError, IndexError, ZeroDivisionError) as e:
+                logger.warning("Failed to parse clock line '%s': %s", line, e)
+
+        return results
+
+    def _query_throttle_reasons(self) -> List[dict]:
+        """Get active throttle reasons from nvidia-smi."""
+        try:
+            result = subprocess.run(
+                [
+                    "nsenter", "-t", "1", "-m", "--",
+                    "nvidia-smi",
+                    "--query-gpu=index,clocks_throttle_reasons.active",
+                    "--format=csv,noheader",
+                ],
+                capture_output=True,
+                text=True,
+                timeout=15,
+            )
+
+            if result.returncode != 0:
+                return []
+
+            reasons = []
+            for line in result.stdout.strip().splitlines():
+                parts = [p.strip() for p in line.split(",", 1)]
+                if len(parts) == 2:
+                    try:
+                        reasons.append({
+                            "gpu_index": int(parts[0]),
+                            "reasons": parts[1],
+                        })
+                    except ValueError:
+                        pass
+            return reasons
+
+        except Exception:
+            return []