Skip to content

feat(proxy): enhance PrismaClient with signal handling and exit logging#28199

Open
harish-berri wants to merge 3 commits into
litellm_internal_stagingfrom
litellm_prisma-logging
Open

feat(proxy): enhance PrismaClient with signal handling and exit logging#28199
harish-berri wants to merge 3 commits into
litellm_internal_stagingfrom
litellm_prisma-logging

Conversation

@harish-berri
Copy link
Copy Markdown
Collaborator

@harish-berri harish-berri commented May 18, 2026

The Prisma client runs as a subprocess which is managed by litellm. This means that instead of prisma being a in-process ORM, the litellm process talks to database using a prisma client process. The issues around prisma client process maintenance surface in LIT-3146 and LIT-3183. These reconnect bugs seem to be resolved by this PR #26225. This is mainly related to the native prisma disconnect flow being blocking.

The second class of issues occurs with the prisma watchdog which runs in the background. This runs in a loop in the background checking / doing healthchecks on the database. If the database is unresponsive then a reconnect attempt is made wherein the prisma client subprocess is terminated and reinstantiated.

The changes in the PR aim to check the reason for prisma client subprocesses going down. The reason being core dump / seg faults / oom errors etc.

This PR adds logging around the reason for prisma client subprocess exits.

  • Added signal handling utilities to format signal names and engine wait statuses.
  • Implemented logging for Prisma engine exit reasons, capturing detailed exit statuses and signals.
  • Updated methods to pass wait status information to the event loop for improved diagnostics.
  • Enhanced tests to validate new signal handling and exit status formatting functionalities.

Relevant issues

Linear ticket

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Delays in PR merge?

If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).

CI (LiteLLM team)

CI status guideline:

  • 50-55 passing tests: main is stable with minor issues.
  • 45-49 passing tests: acceptable but needs attention
  • <= 40 passing tests: unstable; be careful with your merges and assess the risk.
  • Branch creation CI run
    Link:

  • CI run for the last commit
    Link:

  • Merge / cherry-pick CI run
    Links:

Screenshots / Proof of Fix

Type

🆕 New Feature
🐛 Bug Fix
🧹 Refactoring
📖 Documentation
🚄 Infrastructure
✅ Test

Changes

- Added signal handling utilities to format signal names and engine wait statuses.
- Implemented logging for Prisma engine exit reasons, capturing detailed exit statuses and signals.
- Updated methods to pass wait status information to the event loop for improved diagnostics.
- Enhanced tests to validate new signal handling and exit status formatting functionalities.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 18, 2026

Greptile Summary

This PR enriches the PrismaClient watchdog's exit logging by adding helpers that decode a raw POSIX waitpid status into a human-readable string (exit code, signal name/number, core-dump flag) and thread it through every engine-death detection path.

  • _format_signal_name, _format_engine_wait_status, and _format_prisma_engine_exit_reason are new pure utilities that translate a raw wait-status integer into a structured log string; the pidfd and os.kill polling paths pass wait_status=None (logged as exit_status=unavailable) because no wait status is available in those code paths.
  • _log_prisma_engine_exit_reason centralises error-level logging across all three death-detection paths, replacing three separate ad-hoc verbose_proxy_logger.error calls.
  • New unit tests cover the exit-code and signal decoding helpers and verify that _waitpid_thread_func forwards the raw status to the event loop callback.

Confidence Score: 4/5

Safe to merge — changes are limited to diagnostic logging helpers and their tests; no request-path logic, authentication, or database schema is touched.

The new helpers are self-contained and well-tested. The only rough edges are a redundant WARNING log message emitted alongside the new ERROR log in the already-dead-at-watch-start branch, and two helpers that hold self unnecessarily when they could be static. Neither affects correctness or behavior.

No files require special attention; litellm/proxy/utils.py has minor style inconsistencies noted in review comments.

Important Files Changed

Filename Overview
litellm/proxy/utils.py Adds signal-handling helpers to enrich Prisma engine exit logs; minor duplicate WARNING log when engine is already dead at watch start, and two new helpers could be @staticmethod for consistency.
tests/litellm/proxy/test_prisma_engine_watchdog.py Adds focused unit tests for _format_engine_wait_status (exit-code and signal cases) and verifies wait_status is forwarded through _waitpid_thread_func to the event loop; all tests use mocks only.

Reviews (1): Last reviewed commit: "feat(proxy): enhance PrismaClient with s..." | Re-trigger Greptile

Comment thread litellm/proxy/utils.py Outdated
Comment on lines 4311 to 4320
if probe_pid == pid:
self._log_prisma_engine_exit_reason(
pid=pid,
detection_method="waitpid watch start",
wait_status=wait_status,
)
verbose_proxy_logger.warning(
"prisma-query-engine PID %s already dead at watch start.",
"prisma-query-engine PID %s already dead at watch start; triggering reconnect.",
pid,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The WARNING log immediately after _log_prisma_engine_exit_reason is redundant. _log_prisma_engine_exit_reason already emits an ERROR-level message that includes "triggering reconnect" and the exit reason; the second call adds no new information and will appear in logs twice for the same event.

Suggested change
if probe_pid == pid:
self._log_prisma_engine_exit_reason(
pid=pid,
detection_method="waitpid watch start",
wait_status=wait_status,
)
verbose_proxy_logger.warning(
"prisma-query-engine PID %s already dead at watch start.",
"prisma-query-engine PID %s already dead at watch start; triggering reconnect.",
pid,
)
if probe_pid == pid:
self._log_prisma_engine_exit_reason(
pid=pid,
detection_method="waitpid watch start",
wait_status=wait_status,
)

Comment thread litellm/proxy/utils.py
Comment on lines +4236 to +4255
def _format_prisma_engine_exit_reason(
self,
*,
detection_method: str,
wait_status: Optional[int],
) -> str:
if wait_status is None:
return f"detection_method={detection_method} exit_status=unavailable"
return (
f"detection_method={detection_method} "
f"{self._format_engine_wait_status(wait_status)}"
)

def _log_prisma_engine_exit_reason(
self,
*,
pid: int,
detection_method: str,
wait_status: Optional[int],
) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _format_prisma_engine_exit_reason and _log_prisma_engine_exit_reason reference no instance state (self is only used to dispatch to _format_engine_wait_status, which is already a @staticmethod). Decorating them as @staticmethod keeps the API consistent with _format_engine_wait_status and signals to readers that they are pure utilities with no side effects on the object.

Suggested change
def _format_prisma_engine_exit_reason(
self,
*,
detection_method: str,
wait_status: Optional[int],
) -> str:
if wait_status is None:
return f"detection_method={detection_method} exit_status=unavailable"
return (
f"detection_method={detection_method} "
f"{self._format_engine_wait_status(wait_status)}"
)
def _log_prisma_engine_exit_reason(
self,
*,
pid: int,
detection_method: str,
wait_status: Optional[int],
) -> None:
@staticmethod
def _format_prisma_engine_exit_reason(
*,
detection_method: str,
wait_status: Optional[int],
) -> str:
if wait_status is None:
return f"detection_method={detection_method} exit_status=unavailable"
return (
f"detection_method={detection_method} "
f"{PrismaClient._format_engine_wait_status(wait_status)}"
)
@staticmethod
def _log_prisma_engine_exit_reason(
*,
pid: int,
detection_method: str,
wait_status: Optional[int],
) -> None:

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 43.75000% with 27 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
litellm/proxy/utils.py 43.75% 27 Missing ⚠️

📢 Thoughts on this report? Let us know!

…Client

- Changed `_format_prisma_engine_exit_reason` and `_log_prisma_engine_exit_reason` methods to static methods to improve class design and usability.
- Updated method calls to reflect the new static context, ensuring consistent logging of Prisma engine exit reasons and statuses.
- Removed unnecessary instance references to streamline the codebase.
…down and tests

- Introduced a reconnect cooldown for the health watchdog to prevent rapid reconnection attempts during database errors.
- Updated health watchdog parameters to allow for configurable intervals and timeouts via environment variables.
- Enhanced logging to provide clearer information on the watchdog's operation.
- Added tests to verify the reconnect behavior during cooldown periods and ensure proper functionality of the health watchdog.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant