Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions demos/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,18 @@ Interactive demonstrations of NVSentinel's core capabilities that run locally on

**Best for:** Understanding how NVSentinel's node-drainer can delegate pod eviction to external controllers for custom drain workflows coordinated with HPC schedulers.

### [Fabric Manager Monitor](fabric-manager-monitor/)

**What it shows:** Standalone DaemonSet that detects Fabric Manager failures, PCIe link degradation, NVLink fabric issues, GPU clock throttling, and CUDA context failures — all invisible to DCGM-based monitoring.

**Requirements:** Docker, kubectl, Kubernetes cluster with GPU nodes, Prometheus Operator

**Best for:** Catching GPU infrastructure failures that NVSentinel's existing health monitors miss. Validated on P4d.24xlarge (A100-SXM4) with Amazon Linux 2023.

**Related issue:** [#883](https://github.com/NVIDIA/NVSentinel/issues/883)

**Note:** For native NVSentinel integration (gRPC HealthEvents to platform-connector), see [`health-monitors/fabric-manager-monitor/`](../health-monitors/fabric-manager-monitor/).


## Coming Soon

Expand Down
21 changes: 21 additions & 0 deletions demos/fabric-manager-monitor/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
FROM python:3.11-slim

# nsenter is in util-linux (already in slim), but ensure it's available
RUN apt-get update && apt-get install -y --no-install-recommends \
util-linux \
&& rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY config.py metrics.py monitor.py ./
COPY checks/ ./checks/

# nsenter requires root to enter host namespaces.
# The DaemonSet securityContext controls the actual privilege level.

EXPOSE 9101

ENTRYPOINT ["python", "monitor.py"]
84 changes: 84 additions & 0 deletions demos/fabric-manager-monitor/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Fabric Manager & GPU Node Health Validator

A standalone DaemonSet companion to NVSentinel that catches GPU infrastructure failures invisible to telemetry-based monitoring.

**Related issue:** [#883 - NVSentinel not detecting fabric health on H100s](https://github.com/NVIDIA/NVSentinel/issues/883)

## Problem

NVIDIA Fabric Manager can fail and stay broken for weeks undetected. NVSentinel's existing monitors (DCGM-based, syslog-based) miss it because individual GPUs appear healthy to DCGM even when Fabric Manager is down. This tool fills the gap with service-level health checks.

**Requirements:** Kubernetes cluster with GPU nodes, Prometheus Operator

## What It Monitors

| # | Check | What It Catches | Method |
|---|-------|-----------------|--------|
| 1 | **Fabric Manager Service** | FM not running, flapping, error state | `nsenter` + `systemctl` |
| 2 | **Critical GPU Services** | persistenced, DCGM dead | `nsenter` + `systemctl` |
| 3 | **PCIe Link Health** | Link downtraining (Gen5->Gen3, x16->x8) | `nsenter` + `nvidia-smi` |
| 4 | **NVLink Fabric** | Bandwidth zero with FM down, CRC errors | DCGM metrics HTTP |
| 5 | **CUDA Validation** | Context failures, memory errors | PyTorch subprocess |
| 6 | **Clock & Throttle** | Silent throttling without XID | `nsenter` + `nvidia-smi` |

## Quick Start

```bash
# Build
docker build -t fabric-manager-monitor:latest .

# Deploy (assumes nvsentinel namespace exists)
kubectl apply -f k8s/rbac.yaml
kubectl apply -f k8s/configmap.yaml
kubectl apply -f k8s/daemonset.yaml
kubectl apply -f k8s/servicemonitor.yaml

# Verify
kubectl get ds -n nvsentinel fabric-manager-monitor

# Port-forward to a specific node's pod
NODE=<node-name>
POD=$(kubectl get pod -n nvsentinel -o wide --field-selector spec.nodeName=${NODE} \
-l app=fabric-manager-monitor -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward -n nvsentinel pod/${POD} 9101:9101
curl -s localhost:9101/metrics | grep fabric_manager_up
```

## Metrics

Exposed on port 9101. Key metrics:

| Metric | Description |
|--------|-------------|
| `fabric_manager_up` | Fabric Manager running (1/0) |
| `gpu_node_health_up` | Overall node health (1/0) |
| `nvidia_service_up` | Per-service status |
| `pcie_link_degraded` | PCIe link degraded per GPU |
| `nvlink_fabric_healthy` | NVLink health |
| `gpu_clock_throttled` | Clock throttled per GPU |
| `gpu_clock_ratio` | Current/max clock ratio |

## Alert Rules

The ServiceMonitor includes PrometheusRule with 7 alerts:
- `FabricManagerDown` (critical, 5m)
- `FabricManagerFlapping` (warning, 5m)
- `NVLinkFabricDegraded` (critical, 5m) — correlated: requires FM down AND NVLink degraded
- `GPUPCIeLinkDegraded` (warning, 5m)
- `GPUClockThrottled` (warning, 10m)
- `GPUServiceDown` (critical, 3m)
- `CUDAValidationFailed` (critical, 5m)

## Validated On

- 2x P4d.24xlarge (8x A100-SXM4-40GB each) — Amazon Linux 2023, EKS 1.32
- All 6 check categories produce correct metrics
- GPU Idle downclocking correctly filtered as benign

## Configuration

All settings via ConfigMap environment variables. See `k8s/configmap.yaml`.

## Relationship to NVSentinel

This is a **standalone companion tool** that exposes Prometheus metrics and alerts. It does not integrate with NVSentinel's gRPC event pipeline or remediation workflow. See the native `health-monitors/fabric-manager-monitor/` for an integrated version that emits HealthEvents to platform-connector.
15 changes: 15 additions & 0 deletions demos/fabric-manager-monitor/checks/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Health check modules for GPU Node Health Validator."""
157 changes: 157 additions & 0 deletions demos/fabric-manager-monitor/checks/clock_check.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
"""Check 6: Clock and throttle detection.

Detects silent GPU throttling by comparing current clocks against maximum
and querying active throttle reasons. Catches performance degradation
that doesn't generate XID errors.
"""

import logging
import subprocess
from dataclasses import dataclass
from typing import List, Optional

logger = logging.getLogger(__name__)


@dataclass
class ClockStatus:
"""Clock and throttle status for a single GPU."""
gpu_index: int
graphics_clock_current: int # MHz
graphics_clock_max: int # MHz
mem_clock_current: int # MHz
mem_clock_max: int # MHz
clock_ratio: float # current/max (graphics)
throttled: bool
throttle_reasons: str = ""
error: Optional[str] = None


class ClockChecker:
"""Detects GPU clock throttling."""

def __init__(self, throttle_ratio: float = 0.85):
self._throttle_ratio = throttle_ratio

# Throttle reasons that are benign (not actual degradation)
_BENIGN_REASONS = {
"Not Active",
"0x0000000000000000", # No throttle
"0x0000000000000001", # GPU Idle — normal when no workload running
}

def check(self) -> List[ClockStatus]:
"""Query clocks and throttle reasons for all GPUs."""
clocks = self._query_clocks()
reasons = self._query_throttle_reasons()

# Merge throttle reasons into clock results
reason_map = {r["gpu_index"]: r["reasons"] for r in reasons}
for status in clocks:
reason_str = reason_map.get(status.gpu_index, "")
status.throttle_reasons = reason_str

# GPU Idle causes low clock ratio but isn't a real throttle.
# Only flag as throttled for non-benign reasons.
if reason_str in self._BENIGN_REASONS:
status.throttled = False
elif reason_str:
status.throttled = True

return clocks

def _query_clocks(self) -> List[ClockStatus]:
"""Get current vs max clocks from nvidia-smi."""
try:
result = subprocess.run(
[
"nsenter", "-t", "1", "-m", "--",
"nvidia-smi",
"--query-gpu=index,clocks.current.graphics,clocks.max.graphics,"
"clocks.current.memory,clocks.max.memory",
"--format=csv,noheader,nounits",
],
capture_output=True,
text=True,
timeout=15,
)

if result.returncode != 0:
logger.error("nvidia-smi clock query failed: %s", result.stderr.strip())
return []

return self._parse_clocks(result.stdout)

except subprocess.TimeoutExpired:
logger.error("nvidia-smi clock query timed out")
return []
except FileNotFoundError:
logger.error("nvidia-smi not found")
return []
except Exception as e:
logger.error("Clock check failed: %s", e)
return []

def _parse_clocks(self, output: str) -> List[ClockStatus]:
results = []
for line in output.strip().splitlines():
parts = [p.strip() for p in line.split(",")]
if len(parts) != 5:
continue
try:
idx = int(parts[0])
gfx_cur = int(parts[1])
gfx_max = int(parts[2])
mem_cur = int(parts[3])
mem_max = int(parts[4])

ratio = gfx_cur / gfx_max if gfx_max > 0 else 0.0
throttled = ratio < self._throttle_ratio

results.append(ClockStatus(
gpu_index=idx,
graphics_clock_current=gfx_cur,
graphics_clock_max=gfx_max,
mem_clock_current=mem_cur,
mem_clock_max=mem_max,
clock_ratio=round(ratio, 3),
throttled=throttled,
))
except (ValueError, IndexError, ZeroDivisionError) as e:
logger.warning("Failed to parse clock line '%s': %s", line, e)

return results

def _query_throttle_reasons(self) -> List[dict]:
"""Get active throttle reasons from nvidia-smi."""
try:
result = subprocess.run(
[
"nsenter", "-t", "1", "-m", "--",
"nvidia-smi",
"--query-gpu=index,clocks_throttle_reasons.active",
"--format=csv,noheader",
],
capture_output=True,
text=True,
timeout=15,
)

if result.returncode != 0:
return []

reasons = []
for line in result.stdout.strip().splitlines():
parts = [p.strip() for p in line.split(",", 1)]
if len(parts) == 2:
try:
reasons.append({
"gpu_index": int(parts[0]),
"reasons": parts[1],
})
except ValueError:
pass
return reasons

except Exception:
return []
Loading