Refactor health monitoring logic and snapshot collection#333
Conversation
|
@krataratha is attempting to deploy a commit to the rohitg00's projects Team on Vercel. A member of the Team first needs to authorize it. |
📝 WalkthroughWalkthroughThe health monitor now collects event loop lag via ChangesHealth Monitoring Collection and Alerting
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR attempts to refactor the health monitoring implementation to collect a health snapshot, evaluate it, persist it to KV, and emit alerts when transitioning into a critical state.
Changes:
- Reworked snapshot collection to include event loop lag measurement and additional persisted state.
- Added state-aware alert emission logic intended to trigger only on critical transitions.
- Updated interval lifecycle management (start/stop) for the monitor.
Comments suppressed due to low confidence (4)
src/health/monitor.ts:23
evaluateHealthreturns{ status, alerts, notes }, but this code treats the result as if it hadisCriticalandlevelproperties. Update this logic to useevaluated.status === "critical"(and persist/compare that string) rather thanstatus.isCritical/status.level.
const status = evaluateHealth(snapshot);
// Feature: Persistence & State-Aware Alerting
const lastStatus = await kv.get(KV.LAST_HEALTH_STATUS);
if (status.isCritical && lastStatus !== "critical") {
sdk.emit?.("health_alert", { snapshot, status });
}
await kv.set(KV.LATEST_HEALTH, snapshot);
await kv.set(KV.LAST_HEALTH_STATUS, status.level);
src/health/monitor.ts:23
StateKV.get/setrequire(scope, key, ...), but these calls pass a single argument (and useKV.LATEST_HEALTH/KV.LAST_HEALTH_STATUS, which don’t exist insrc/state/schema.ts). Use the existingKV.healthscope with explicit keys (e.g.,"latest","last_status"), or extend the KV schema and update all call sites accordingly.
const lastStatus = await kv.get(KV.LAST_HEALTH_STATUS);
if (status.isCritical && lastStatus !== "critical") {
sdk.emit?.("health_alert", { snapshot, status });
}
await kv.set(KV.LATEST_HEALTH, snapshot);
await kv.set(KV.LAST_HEALTH_STATUS, status.level);
src/health/monitor.ts:40
- After the
return/closing brace here, the file continues with moreawaitstatements and anothersnapshot/intervaldefinition. This indicates the refactor left behind a second copy of the old implementation (and likely mismatched braces), which will either be unreachable or cause parse/type errors. Remove the duplicate block and ensure there is exactly one coherent implementation/interval lifecycle.
const interval = setInterval(collectHealth, 5000);
return {
stop: () => {
clearInterval(interval);
sdk.emit?.("monitor_stopped", { at: Date.now() });
}
};
}
const eventLoopLagMs = performance.now() - startMark;
let workers: HealthSnapshot["workers"] = [];
try {
const result = await sdk.trigger<
src/health/monitor.ts:35
- This module no longer exports
registerHealthMonitor, but it’s imported/used fromsrc/index.ts(import { registerHealthMonitor } ...andregisterHealthMonitor(sdk, kv)). Either reintroduce that export (recommended) or update the entrypoint and any other call sites to the new API so the build doesn’t break.
const interval = setInterval(collectHealth, 5000);
return {
stop: () => {
clearInterval(interval);
sdk.emit?.("monitor_stopped", { at: Date.now() });
}
};
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const endMark = performance.now(); | ||
| const eventLoopLag = endMark - startMark; | ||
|
|
| const snapshot: HealthSnapshot = { | ||
| cpuUsage: cpuPercent, | ||
| memoryRss: mem.rss / 1024 / 1024, | ||
| memoryHeapUsed: mem.heapUsed / 1024 / 1024, | ||
| eventLoopLag, | ||
| uptime, | ||
| connectionState, | ||
| timestamp: now, | ||
| }; |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
src/health/monitor.ts (1)
32-32: 💤 Low valueConsider using ISO string format for timestamp consistency.
The coding guideline recommends
new Date().toISOString()for timestamps. While this is a single call (not multiple), using ISO string format ensures consistency with other timestamps in the codebase (like line 11 which usesnowthat's presumably an ISO string).♻️ Optional refactor for consistency
- sdk.emit?.("monitor_stopped", { at: Date.now() }); + sdk.emit?.("monitor_stopped", { at: new Date().toISOString() });As per coding guidelines: "Capture timestamps once with
new Date().toISOString()and reuse instead of calling Date multiple times"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/health/monitor.ts` at line 32, The timestamp passed to sdk.emit("monitor_stopped", ...) should use a single ISO string timestamp for consistency; capture const at = new Date().toISOString() once and pass that variable (e.g., { at }) to sdk.emit so the emitted timestamp matches other ISO timestamps in the codebase and avoids multiple Date.now()/Date() calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/health/monitor.ts`:
- Around line 17-20: The current alerting emits a health_alert on first run if
the system is already critical because lastStatus from
kv.get(KV.LAST_HEALTH_STATUS) can be null/undefined; update the condition that
triggers sdk.emit("health_alert", { snapshot, status }) to skip the first-run
case by ensuring lastStatus is not null/undefined (e.g., check lastStatus !=
null) before comparing to "critical" so only transitions from a known
non-critical state emit an alert; make this change near the code using kv.get,
KV.LAST_HEALTH_STATUS, lastStatus, status.isCritical, and
sdk.emit("health_alert").
- Around line 28-34: Call collectHealth immediately on startup and ensure both
the immediate call and the scheduled calls handle rejections: invoke
collectHealth() once before setting the interval and attach .catch(...) to
log/emit errors; then create the interval with const interval = setInterval(()
=> collectHealth().catch(err => { /* log via processLogger or
sdk.emit("monitor_error", {err}) */ }), 5000); and call interval.unref() after
creating the timer so it won’t keep the process alive; keep the existing stop
implementation that clearInterval(interval) and sdk.emit("monitor_stopped", {
at: Date.now() }).
- Around line 1-2: In collectHealth(), startMark is never defined before it's
used to compute eventLoopLag; add capturing of the start timestamp (e.g., const
startMark = performance.now()) at the beginning of the collectHealth() execution
path so that the existing lines const endMark = performance.now(); and const
eventLoopLag = endMark - startMark; use a valid startMark; ensure the symbol
names startMark, endMark, and eventLoopLag remain consistent with the current
function.
---
Nitpick comments:
In `@src/health/monitor.ts`:
- Line 32: The timestamp passed to sdk.emit("monitor_stopped", ...) should use a
single ISO string timestamp for consistency; capture const at = new
Date().toISOString() once and pass that variable (e.g., { at }) to sdk.emit so
the emitted timestamp matches other ISO timestamps in the codebase and avoids
multiple Date.now()/Date() calls.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| const endMark = performance.now(); | ||
| const eventLoopLag = endMark - startMark; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify that startMark is defined before the event loop lag calculation
rg -n -B 20 'const endMark = performance\.now\(\)' src/health/monitor.ts | rg 'startMark'Repository: rohitg00/agentmemory
Length of output: 46
🏁 Script executed:
# Check if file exists and get basic info
fd -t f "monitor\.ts" src/
# Read the file to see actual content
head -100 src/health/monitor.tsRepository: rohitg00/agentmemory
Length of output: 3011
🏁 Script executed:
# Search for startMark definition/usage in the entire file
rg -n 'startMark' src/health/monitor.ts
# Also get the full file to understand structure
wc -l src/health/monitor.ts
cat -n src/health/monitor.tsRepository: rohitg00/agentmemory
Length of output: 4100
🏁 Script executed:
# Search for function declarations that might contain startMark
rg -n 'function|const.*=.*function|const.*=.*\(\)' src/health/monitor.ts | head -20
# Also search for registerHealthMonitor or collectHealth function
rg -n 'registerHealthMonitor|collectHealth' src/health/monitor.ts
# Search for where performance.now is called to capture startMark
rg -n 'performance\.now\(\)' src/health/monitor.tsRepository: rohitg00/agentmemory
Length of output: 887
🏁 Script executed:
# Get the beginning of the file to find function declaration
head -50 src/health/monitor.ts | tail -40
# Also try to find the function signature that contains collectHealth
ast-grep --pattern 'const collectHealth = $_'
# Or search in broader context
rg -n 'const startMark' src/health/monitor.tsRepository: rohitg00/agentmemory
Length of output: 1209
🏁 Script executed:
# Try to get the very beginning of the file
head -1 src/health/monitor.ts
# Get more context around line 1
sed -n '1,30p' src/health/monitor.ts
# Search for the actual function definition that contains this code
rg -B 50 'const endMark = performance\.now' src/health/monitor.ts | head -60Repository: rohitg00/agentmemory
Length of output: 965
🏁 Script executed:
# Get the entire file to understand full context
cat src/health/monitor.ts
# Also search for where startMark might be initialized
rg 'startMark\s*=' src/health/monitor.tsRepository: rohitg00/agentmemory
Length of output: 3195
🏁 Script executed:
# Let me check the raw file with more context - search for function definitions
rg -n 'export.*function|function.*registerHealthMonitor|async function' src/health/monitor.ts
# Check if startMark might be a parameter or closure variable
rg -B 100 'const endMark = performance.now' src/health/monitor.ts | head -120Repository: rohitg00/agentmemory
Length of output: 146
🏁 Script executed:
# Search for all variable declarations including startMark context
rg -A 5 -B 5 'startMark' src/health/monitor.ts
# Look for function parameters that might include startMark
rg '\(.*startMark.*\)' src/health/monitor.ts
# Check if maybe the file content is incomplete - look for git diff info
git diff src/health/monitor.ts 2>/dev/null | head -100Repository: rohitg00/agentmemory
Length of output: 587
🏁 Script executed:
# Try to find the complete function - search for "registerHealthMonitor" or similar
rg -n 'registerHealthMonitor|export.*function.*Health' src/health/monitor.ts
# Search for where the collectHealth function is defined
rg -B 30 'const endMark = performance.now' src/health/monitor.ts | grep -E '(function|const|let|var)' | head -20
# Try ast-grep to find function definitions
ast-grep --pattern 'function $_($_) { $$$ }'
</parameter>Repository: rohitg00/agentmemory
Length of output: 50501
Define startMark before using it in the event loop lag calculation.
The variable startMark is referenced on line 2 but is never defined anywhere in the function. This will cause a runtime error. Ensure startMark is captured at the start of collectHealth() with const startMark = performance.now(); before the endMark calculation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/health/monitor.ts` around lines 1 - 2, In collectHealth(), startMark is
never defined before it's used to compute eventLoopLag; add capturing of the
start timestamp (e.g., const startMark = performance.now()) at the beginning of
the collectHealth() execution path so that the existing lines const endMark =
performance.now(); and const eventLoopLag = endMark - startMark; use a valid
startMark; ensure the symbol names startMark, endMark, and eventLoopLag remain
consistent with the current function.
| const lastStatus = await kv.get(KV.LAST_HEALTH_STATUS); | ||
| if (status.isCritical && lastStatus !== "critical") { | ||
| sdk.emit?.("health_alert", { snapshot, status }); | ||
| } |
There was a problem hiding this comment.
Handle first-run case in alerting logic.
On the first monitor run, lastStatus will be null or undefined. The condition lastStatus !== "critical" will be true, causing an alert to be emitted immediately if the system starts in a critical state. Consider whether you want to alert on initial critical status or only on transitions from non-critical to critical states.
🛡️ Proposed fix to skip alerting on first run
const lastStatus = await kv.get(KV.LAST_HEALTH_STATUS);
-if (status.isCritical && lastStatus !== "critical") {
+if (status.isCritical && lastStatus && lastStatus !== "critical") {
sdk.emit?.("health_alert", { snapshot, status });
}Or explicitly handle first run:
const lastStatus = await kv.get(KV.LAST_HEALTH_STATUS);
-if (status.isCritical && lastStatus !== "critical") {
+if (status.isCritical && lastStatus !== null && lastStatus !== "critical") {
sdk.emit?.("health_alert", { snapshot, status });
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/health/monitor.ts` around lines 17 - 20, The current alerting emits a
health_alert on first run if the system is already critical because lastStatus
from kv.get(KV.LAST_HEALTH_STATUS) can be null/undefined; update the condition
that triggers sdk.emit("health_alert", { snapshot, status }) to skip the
first-run case by ensuring lastStatus is not null/undefined (e.g., check
lastStatus != null) before comparing to "critical" so only transitions from a
known non-critical state emit an alert; make this change near the code using
kv.get, KV.LAST_HEALTH_STATUS, lastStatus, status.isCritical, and
sdk.emit("health_alert").
| const interval = setInterval(collectHealth, 5000); | ||
| return { | ||
| stop: () => { | ||
| clearInterval(interval); | ||
| sdk.emit?.("monitor_stopped", { at: Date.now() }); | ||
| } | ||
| }; |
There was a problem hiding this comment.
Add error handling, initial call, and interval.unref().
Three concerns with the monitoring loop:
- Missing error handling:
collectHealthis async but has no.catch()handler. Unhandled promise rejections could crash the process or be silently swallowed. - Missing
interval.unref(): The old implementation (line 99) calledinterval.unref()to prevent the interval from keeping the process alive. Without it, the process won't exit cleanly when shutdown is initiated. - No initial collection: The old code (line 95) called
collectHealth()immediately on startup. The new code has a 5-second delay before the first health check, meaning no health data is available initially.
🔧 Proposed fix
+collectHealth().catch(() => {});
const interval = setInterval(collectHealth, 5000);
+interval.unref();
return {
stop: () => {
clearInterval(interval);
sdk.emit?.("monitor_stopped", { at: Date.now() });
}
};For better error handling, consider logging errors instead of silently catching:
-collectHealth().catch(() => {});
+collectHealth().catch((err) => {
+ sdk.emit?.("monitor_error", { error: String(err) });
+});
const interval = setInterval(() => {
- collectHealth();
+ collectHealth().catch((err) => {
+ sdk.emit?.("monitor_error", { error: String(err) });
+ });
}, 5000);🧰 Tools
🪛 Biome (2.4.15)
[error] 29-34: Illegal return statement outside of a function
(parse)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/health/monitor.ts` around lines 28 - 34, Call collectHealth immediately
on startup and ensure both the immediate call and the scheduled calls handle
rejections: invoke collectHealth() once before setting the interval and attach
.catch(...) to log/emit errors; then create the interval with const interval =
setInterval(() => collectHealth().catch(err => { /* log via processLogger or
sdk.emit("monitor_error", {err}) */ }), 5000); and call interval.unref() after
creating the timer so it won’t keep the process alive; keep the existing stop
implementation that clearInterval(interval) and sdk.emit("monitor_stopped", {
at: Date.now() }).
Summary by CodeRabbit