Skip to content

Refactor health monitoring logic and snapshot collection#333

Open
krataratha wants to merge 1 commit into
rohitg00:mainfrom
krataratha:patch-2
Open

Refactor health monitoring logic and snapshot collection#333
krataratha wants to merge 1 commit into
rohitg00:mainfrom
krataratha:patch-2

Conversation

@krataratha
Copy link
Copy Markdown

@krataratha krataratha commented May 13, 2026

Summary by CodeRabbit

  • New Features
    • Health monitoring now runs every 5 seconds for faster responsiveness.
    • Enhanced alerting system with state awareness detects and reports critical system conditions.
    • New health alert and monitor status events available for integration.

Review Change Stack

Copilot AI review requested due to automatic review settings May 13, 2026 12:48
@vercel
Copy link
Copy Markdown

vercel Bot commented May 13, 2026

@krataratha is attempting to deploy a commit to the rohitg00's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

📝 Walkthrough

Walkthrough

The health monitor now collects event loop lag via performance.now() deltas and constructs a lean HealthSnapshot. It evaluates health status, reads prior state from KV, emits health_alert on critical transitions, and persists snapshots and status. The monitoring loop runs every 5 seconds with proper cleanup.

Changes

Health Monitoring Collection and Alerting

Layer / File(s) Summary
Health snapshot collection with event loop lag
src/health/monitor.ts
Snapshot now measures eventLoopLag using performance.now() deltas and populates CPU, memory RSS/heap, uptime, connection state, and timestamp fields.
Health evaluation and KV-backed alerting
src/health/monitor.ts
Health evaluation reads prior status from KV, emits health_alert on critical state transitions, and persists snapshot and updated status back to KV.
Monitor loop and lifecycle management
src/health/monitor.ts
Monitor runs collectHealth every 5 seconds; stop() clears the interval and emits monitor_stopped event.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A rabbit bounds through metrics bright,
Collecting lags from midnight to light,
When health goes critical, alerts take flight,
Five seconds ticking in steady sight,
Then cleanup hops when monitoring takes flight! 🐰

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Refactor health monitoring logic and snapshot collection' accurately reflects the main changes: refactoring the health monitoring flow including health snapshot collection, evaluation, and alerting mechanisms.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR attempts to refactor the health monitoring implementation to collect a health snapshot, evaluate it, persist it to KV, and emit alerts when transitioning into a critical state.

Changes:

  • Reworked snapshot collection to include event loop lag measurement and additional persisted state.
  • Added state-aware alert emission logic intended to trigger only on critical transitions.
  • Updated interval lifecycle management (start/stop) for the monitor.
Comments suppressed due to low confidence (4)

src/health/monitor.ts:23

  • evaluateHealth returns { status, alerts, notes }, but this code treats the result as if it had isCritical and level properties. Update this logic to use evaluated.status === "critical" (and persist/compare that string) rather than status.isCritical/status.level.
    const status = evaluateHealth(snapshot);
    
    // Feature: Persistence & State-Aware Alerting
    const lastStatus = await kv.get(KV.LAST_HEALTH_STATUS);
    if (status.isCritical && lastStatus !== "critical") {
      sdk.emit?.("health_alert", { snapshot, status });
    }

    await kv.set(KV.LATEST_HEALTH, snapshot);
    await kv.set(KV.LAST_HEALTH_STATUS, status.level);

src/health/monitor.ts:23

  • StateKV.get/set require (scope, key, ...), but these calls pass a single argument (and use KV.LATEST_HEALTH/KV.LAST_HEALTH_STATUS, which don’t exist in src/state/schema.ts). Use the existing KV.health scope with explicit keys (e.g., "latest", "last_status"), or extend the KV schema and update all call sites accordingly.
    const lastStatus = await kv.get(KV.LAST_HEALTH_STATUS);
    if (status.isCritical && lastStatus !== "critical") {
      sdk.emit?.("health_alert", { snapshot, status });
    }

    await kv.set(KV.LATEST_HEALTH, snapshot);
    await kv.set(KV.LAST_HEALTH_STATUS, status.level);

src/health/monitor.ts:40

  • After the return/closing brace here, the file continues with more await statements and another snapshot/interval definition. This indicates the refactor left behind a second copy of the old implementation (and likely mismatched braces), which will either be unreachable or cause parse/type errors. Remove the duplicate block and ensure there is exactly one coherent implementation/interval lifecycle.
  const interval = setInterval(collectHealth, 5000);
  return { 
    stop: () => {
      clearInterval(interval);
      sdk.emit?.("monitor_stopped", { at: Date.now() });
    } 
  };
}
    const eventLoopLagMs = performance.now() - startMark;

    let workers: HealthSnapshot["workers"] = [];
    try {
      const result = await sdk.trigger<

src/health/monitor.ts:35

  • This module no longer exports registerHealthMonitor, but it’s imported/used from src/index.ts (import { registerHealthMonitor } ... and registerHealthMonitor(sdk, kv)). Either reintroduce that export (recommended) or update the entrypoint and any other call sites to the new API so the build doesn’t break.
  const interval = setInterval(collectHealth, 5000);
  return { 
    stop: () => {
      clearInterval(interval);
      sdk.emit?.("monitor_stopped", { at: Date.now() });
    } 
  };
}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/health/monitor.ts
Comment on lines +1 to 3
const endMark = performance.now();
const eventLoopLag = endMark - startMark;

Comment thread src/health/monitor.ts
Comment on lines +4 to +12
const snapshot: HealthSnapshot = {
cpuUsage: cpuPercent,
memoryRss: mem.rss / 1024 / 1024,
memoryHeapUsed: mem.heapUsed / 1024 / 1024,
eventLoopLag,
uptime,
connectionState,
timestamp: now,
};
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
src/health/monitor.ts (1)

32-32: 💤 Low value

Consider using ISO string format for timestamp consistency.

The coding guideline recommends new Date().toISOString() for timestamps. While this is a single call (not multiple), using ISO string format ensures consistency with other timestamps in the codebase (like line 11 which uses now that's presumably an ISO string).

♻️ Optional refactor for consistency
-    sdk.emit?.("monitor_stopped", { at: Date.now() });
+    sdk.emit?.("monitor_stopped", { at: new Date().toISOString() });

As per coding guidelines: "Capture timestamps once with new Date().toISOString() and reuse instead of calling Date multiple times"

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/health/monitor.ts` at line 32, The timestamp passed to
sdk.emit("monitor_stopped", ...) should use a single ISO string timestamp for
consistency; capture const at = new Date().toISOString() once and pass that
variable (e.g., { at }) to sdk.emit so the emitted timestamp matches other ISO
timestamps in the codebase and avoids multiple Date.now()/Date() calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/health/monitor.ts`:
- Around line 17-20: The current alerting emits a health_alert on first run if
the system is already critical because lastStatus from
kv.get(KV.LAST_HEALTH_STATUS) can be null/undefined; update the condition that
triggers sdk.emit("health_alert", { snapshot, status }) to skip the first-run
case by ensuring lastStatus is not null/undefined (e.g., check lastStatus !=
null) before comparing to "critical" so only transitions from a known
non-critical state emit an alert; make this change near the code using kv.get,
KV.LAST_HEALTH_STATUS, lastStatus, status.isCritical, and
sdk.emit("health_alert").
- Around line 28-34: Call collectHealth immediately on startup and ensure both
the immediate call and the scheduled calls handle rejections: invoke
collectHealth() once before setting the interval and attach .catch(...) to
log/emit errors; then create the interval with const interval = setInterval(()
=> collectHealth().catch(err => { /* log via processLogger or
sdk.emit("monitor_error", {err}) */ }), 5000); and call interval.unref() after
creating the timer so it won’t keep the process alive; keep the existing stop
implementation that clearInterval(interval) and sdk.emit("monitor_stopped", {
at: Date.now() }).
- Around line 1-2: In collectHealth(), startMark is never defined before it's
used to compute eventLoopLag; add capturing of the start timestamp (e.g., const
startMark = performance.now()) at the beginning of the collectHealth() execution
path so that the existing lines const endMark = performance.now(); and const
eventLoopLag = endMark - startMark; use a valid startMark; ensure the symbol
names startMark, endMark, and eventLoopLag remain consistent with the current
function.

---

Nitpick comments:
In `@src/health/monitor.ts`:
- Line 32: The timestamp passed to sdk.emit("monitor_stopped", ...) should use a
single ISO string timestamp for consistency; capture const at = new
Date().toISOString() once and pass that variable (e.g., { at }) to sdk.emit so
the emitted timestamp matches other ISO timestamps in the codebase and avoids
multiple Date.now()/Date() calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b8bcfa96-e578-4330-b0d1-5a07a71f11aa

📥 Commits

Reviewing files that changed from the base of the PR and between cba234d and 21acd2e.

📒 Files selected for processing (1)
  • src/health/monitor.ts

Comment thread src/health/monitor.ts
Comment on lines +1 to +2
const endMark = performance.now();
const eventLoopLag = endMark - startMark;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify that startMark is defined before the event loop lag calculation

rg -n -B 20 'const endMark = performance\.now\(\)' src/health/monitor.ts | rg 'startMark'

Repository: rohitg00/agentmemory

Length of output: 46


🏁 Script executed:

# Check if file exists and get basic info
fd -t f "monitor\.ts" src/

# Read the file to see actual content
head -100 src/health/monitor.ts

Repository: rohitg00/agentmemory

Length of output: 3011


🏁 Script executed:

# Search for startMark definition/usage in the entire file
rg -n 'startMark' src/health/monitor.ts

# Also get the full file to understand structure
wc -l src/health/monitor.ts
cat -n src/health/monitor.ts

Repository: rohitg00/agentmemory

Length of output: 4100


🏁 Script executed:

# Search for function declarations that might contain startMark
rg -n 'function|const.*=.*function|const.*=.*\(\)' src/health/monitor.ts | head -20

# Also search for registerHealthMonitor or collectHealth function
rg -n 'registerHealthMonitor|collectHealth' src/health/monitor.ts

# Search for where performance.now is called to capture startMark
rg -n 'performance\.now\(\)' src/health/monitor.ts

Repository: rohitg00/agentmemory

Length of output: 887


🏁 Script executed:

# Get the beginning of the file to find function declaration
head -50 src/health/monitor.ts | tail -40

# Also try to find the function signature that contains collectHealth
ast-grep --pattern 'const collectHealth = $_'

# Or search in broader context
rg -n 'const startMark' src/health/monitor.ts

Repository: rohitg00/agentmemory

Length of output: 1209


🏁 Script executed:

# Try to get the very beginning of the file
head -1 src/health/monitor.ts

# Get more context around line 1
sed -n '1,30p' src/health/monitor.ts

# Search for the actual function definition that contains this code
rg -B 50 'const endMark = performance\.now' src/health/monitor.ts | head -60

Repository: rohitg00/agentmemory

Length of output: 965


🏁 Script executed:

# Get the entire file to understand full context
cat src/health/monitor.ts

# Also search for where startMark might be initialized
rg 'startMark\s*=' src/health/monitor.ts

Repository: rohitg00/agentmemory

Length of output: 3195


🏁 Script executed:

# Let me check the raw file with more context - search for function definitions
rg -n 'export.*function|function.*registerHealthMonitor|async function' src/health/monitor.ts

# Check if startMark might be a parameter or closure variable
rg -B 100 'const endMark = performance.now' src/health/monitor.ts | head -120

Repository: rohitg00/agentmemory

Length of output: 146


🏁 Script executed:

# Search for all variable declarations including startMark context
rg -A 5 -B 5 'startMark' src/health/monitor.ts

# Look for function parameters that might include startMark
rg '\(.*startMark.*\)' src/health/monitor.ts

# Check if maybe the file content is incomplete - look for git diff info
git diff src/health/monitor.ts 2>/dev/null | head -100

Repository: rohitg00/agentmemory

Length of output: 587


🏁 Script executed:

# Try to find the complete function - search for "registerHealthMonitor" or similar
rg -n 'registerHealthMonitor|export.*function.*Health' src/health/monitor.ts

# Search for where the collectHealth function is defined
rg -B 30 'const endMark = performance.now' src/health/monitor.ts | grep -E '(function|const|let|var)' | head -20

# Try ast-grep to find function definitions
ast-grep --pattern 'function $_($_) { $$$ }'
</parameter>

Repository: rohitg00/agentmemory

Length of output: 50501


Define startMark before using it in the event loop lag calculation.

The variable startMark is referenced on line 2 but is never defined anywhere in the function. This will cause a runtime error. Ensure startMark is captured at the start of collectHealth() with const startMark = performance.now(); before the endMark calculation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/health/monitor.ts` around lines 1 - 2, In collectHealth(), startMark is
never defined before it's used to compute eventLoopLag; add capturing of the
start timestamp (e.g., const startMark = performance.now()) at the beginning of
the collectHealth() execution path so that the existing lines const endMark =
performance.now(); and const eventLoopLag = endMark - startMark; use a valid
startMark; ensure the symbol names startMark, endMark, and eventLoopLag remain
consistent with the current function.

Comment thread src/health/monitor.ts
Comment on lines +17 to +20
const lastStatus = await kv.get(KV.LAST_HEALTH_STATUS);
if (status.isCritical && lastStatus !== "critical") {
sdk.emit?.("health_alert", { snapshot, status });
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle first-run case in alerting logic.

On the first monitor run, lastStatus will be null or undefined. The condition lastStatus !== "critical" will be true, causing an alert to be emitted immediately if the system starts in a critical state. Consider whether you want to alert on initial critical status or only on transitions from non-critical to critical states.

🛡️ Proposed fix to skip alerting on first run
 const lastStatus = await kv.get(KV.LAST_HEALTH_STATUS);
-if (status.isCritical && lastStatus !== "critical") {
+if (status.isCritical && lastStatus && lastStatus !== "critical") {
   sdk.emit?.("health_alert", { snapshot, status });
 }

Or explicitly handle first run:

 const lastStatus = await kv.get(KV.LAST_HEALTH_STATUS);
-if (status.isCritical && lastStatus !== "critical") {
+if (status.isCritical && lastStatus !== null && lastStatus !== "critical") {
   sdk.emit?.("health_alert", { snapshot, status });
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/health/monitor.ts` around lines 17 - 20, The current alerting emits a
health_alert on first run if the system is already critical because lastStatus
from kv.get(KV.LAST_HEALTH_STATUS) can be null/undefined; update the condition
that triggers sdk.emit("health_alert", { snapshot, status }) to skip the
first-run case by ensuring lastStatus is not null/undefined (e.g., check
lastStatus != null) before comparing to "critical" so only transitions from a
known non-critical state emit an alert; make this change near the code using
kv.get, KV.LAST_HEALTH_STATUS, lastStatus, status.isCritical, and
sdk.emit("health_alert").

Comment thread src/health/monitor.ts
Comment on lines +28 to +34
const interval = setInterval(collectHealth, 5000);
return {
stop: () => {
clearInterval(interval);
sdk.emit?.("monitor_stopped", { at: Date.now() });
}
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚖️ Poor tradeoff

Add error handling, initial call, and interval.unref().

Three concerns with the monitoring loop:

  1. Missing error handling: collectHealth is async but has no .catch() handler. Unhandled promise rejections could crash the process or be silently swallowed.
  2. Missing interval.unref(): The old implementation (line 99) called interval.unref() to prevent the interval from keeping the process alive. Without it, the process won't exit cleanly when shutdown is initiated.
  3. No initial collection: The old code (line 95) called collectHealth() immediately on startup. The new code has a 5-second delay before the first health check, meaning no health data is available initially.
🔧 Proposed fix
+collectHealth().catch(() => {});
 const interval = setInterval(collectHealth, 5000);
+interval.unref();
 return { 
   stop: () => {
     clearInterval(interval);
     sdk.emit?.("monitor_stopped", { at: Date.now() });
   } 
 };

For better error handling, consider logging errors instead of silently catching:

-collectHealth().catch(() => {});
+collectHealth().catch((err) => {
+  sdk.emit?.("monitor_error", { error: String(err) });
+});
 const interval = setInterval(() => {
-  collectHealth();
+  collectHealth().catch((err) => {
+    sdk.emit?.("monitor_error", { error: String(err) });
+  });
 }, 5000);
🧰 Tools
🪛 Biome (2.4.15)

[error] 29-34: Illegal return statement outside of a function

(parse)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/health/monitor.ts` around lines 28 - 34, Call collectHealth immediately
on startup and ensure both the immediate call and the scheduled calls handle
rejections: invoke collectHealth() once before setting the interval and attach
.catch(...) to log/emit errors; then create the interval with const interval =
setInterval(() => collectHealth().catch(err => { /* log via processLogger or
sdk.emit("monitor_error", {err}) */ }), 5000); and call interval.unref() after
creating the timer so it won’t keep the process alive; keep the existing stop
implementation that clearInterval(interval) and sdk.emit("monitor_stopped", {
at: Date.now() }).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants