feat(proxy): enhance PrismaClient with signal handling and exit logging#28199
feat(proxy): enhance PrismaClient with signal handling and exit logging#28199harish-berri wants to merge 3 commits into
Conversation
- Added signal handling utilities to format signal names and engine wait statuses. - Implemented logging for Prisma engine exit reasons, capturing detailed exit statuses and signals. - Updated methods to pass wait status information to the event loop for improved diagnostics. - Enhanced tests to validate new signal handling and exit status formatting functionalities.
Greptile SummaryThis PR enriches the
Confidence Score: 4/5Safe to merge — changes are limited to diagnostic logging helpers and their tests; no request-path logic, authentication, or database schema is touched. The new helpers are self-contained and well-tested. The only rough edges are a redundant WARNING log message emitted alongside the new ERROR log in the already-dead-at-watch-start branch, and two helpers that hold self unnecessarily when they could be static. Neither affects correctness or behavior. No files require special attention; litellm/proxy/utils.py has minor style inconsistencies noted in review comments.
|
| Filename | Overview |
|---|---|
| litellm/proxy/utils.py | Adds signal-handling helpers to enrich Prisma engine exit logs; minor duplicate WARNING log when engine is already dead at watch start, and two new helpers could be @staticmethod for consistency. |
| tests/litellm/proxy/test_prisma_engine_watchdog.py | Adds focused unit tests for _format_engine_wait_status (exit-code and signal cases) and verifies wait_status is forwarded through _waitpid_thread_func to the event loop; all tests use mocks only. |
Reviews (1): Last reviewed commit: "feat(proxy): enhance PrismaClient with s..." | Re-trigger Greptile
| if probe_pid == pid: | ||
| self._log_prisma_engine_exit_reason( | ||
| pid=pid, | ||
| detection_method="waitpid watch start", | ||
| wait_status=wait_status, | ||
| ) | ||
| verbose_proxy_logger.warning( | ||
| "prisma-query-engine PID %s already dead at watch start.", | ||
| "prisma-query-engine PID %s already dead at watch start; triggering reconnect.", | ||
| pid, | ||
| ) |
There was a problem hiding this comment.
The WARNING log immediately after
_log_prisma_engine_exit_reason is redundant. _log_prisma_engine_exit_reason already emits an ERROR-level message that includes "triggering reconnect" and the exit reason; the second call adds no new information and will appear in logs twice for the same event.
| if probe_pid == pid: | |
| self._log_prisma_engine_exit_reason( | |
| pid=pid, | |
| detection_method="waitpid watch start", | |
| wait_status=wait_status, | |
| ) | |
| verbose_proxy_logger.warning( | |
| "prisma-query-engine PID %s already dead at watch start.", | |
| "prisma-query-engine PID %s already dead at watch start; triggering reconnect.", | |
| pid, | |
| ) | |
| if probe_pid == pid: | |
| self._log_prisma_engine_exit_reason( | |
| pid=pid, | |
| detection_method="waitpid watch start", | |
| wait_status=wait_status, | |
| ) |
| def _format_prisma_engine_exit_reason( | ||
| self, | ||
| *, | ||
| detection_method: str, | ||
| wait_status: Optional[int], | ||
| ) -> str: | ||
| if wait_status is None: | ||
| return f"detection_method={detection_method} exit_status=unavailable" | ||
| return ( | ||
| f"detection_method={detection_method} " | ||
| f"{self._format_engine_wait_status(wait_status)}" | ||
| ) | ||
|
|
||
| def _log_prisma_engine_exit_reason( | ||
| self, | ||
| *, | ||
| pid: int, | ||
| detection_method: str, | ||
| wait_status: Optional[int], | ||
| ) -> None: |
There was a problem hiding this comment.
_format_prisma_engine_exit_reason and _log_prisma_engine_exit_reason reference no instance state (self is only used to dispatch to _format_engine_wait_status, which is already a @staticmethod). Decorating them as @staticmethod keeps the API consistent with _format_engine_wait_status and signals to readers that they are pure utilities with no side effects on the object.
| def _format_prisma_engine_exit_reason( | |
| self, | |
| *, | |
| detection_method: str, | |
| wait_status: Optional[int], | |
| ) -> str: | |
| if wait_status is None: | |
| return f"detection_method={detection_method} exit_status=unavailable" | |
| return ( | |
| f"detection_method={detection_method} " | |
| f"{self._format_engine_wait_status(wait_status)}" | |
| ) | |
| def _log_prisma_engine_exit_reason( | |
| self, | |
| *, | |
| pid: int, | |
| detection_method: str, | |
| wait_status: Optional[int], | |
| ) -> None: | |
| @staticmethod | |
| def _format_prisma_engine_exit_reason( | |
| *, | |
| detection_method: str, | |
| wait_status: Optional[int], | |
| ) -> str: | |
| if wait_status is None: | |
| return f"detection_method={detection_method} exit_status=unavailable" | |
| return ( | |
| f"detection_method={detection_method} " | |
| f"{PrismaClient._format_engine_wait_status(wait_status)}" | |
| ) | |
| @staticmethod | |
| def _log_prisma_engine_exit_reason( | |
| *, | |
| pid: int, | |
| detection_method: str, | |
| wait_status: Optional[int], | |
| ) -> None: |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…Client - Changed `_format_prisma_engine_exit_reason` and `_log_prisma_engine_exit_reason` methods to static methods to improve class design and usability. - Updated method calls to reflect the new static context, ensuring consistent logging of Prisma engine exit reasons and statuses. - Removed unnecessary instance references to streamline the codebase.
…down and tests - Introduced a reconnect cooldown for the health watchdog to prevent rapid reconnection attempts during database errors. - Updated health watchdog parameters to allow for configurable intervals and timeouts via environment variables. - Enhanced logging to provide clearer information on the watchdog's operation. - Added tests to verify the reconnect behavior during cooldown periods and ensure proper functionality of the health watchdog.
The Prisma client runs as a subprocess which is managed by litellm. This means that instead of prisma being a in-process ORM, the litellm process talks to database using a prisma client process. The issues around prisma client process maintenance surface in LIT-3146 and LIT-3183. These reconnect bugs seem to be resolved by this PR #26225. This is mainly related to the native prisma disconnect flow being blocking.
The second class of issues occurs with the prisma watchdog which runs in the background. This runs in a loop in the background checking / doing healthchecks on the database. If the database is unresponsive then a reconnect attempt is made wherein the prisma client subprocess is terminated and reinstantiated.
The changes in the PR aim to check the reason for prisma client subprocesses going down. The reason being core dump / seg faults / oom errors etc.
This PR adds logging around the reason for prisma client subprocess exits.
Relevant issues
Linear ticket
Pre-Submission checklist
Please complete all items before asking a LiteLLM maintainer to review your PR
tests/test_litellm/directory, Adding at least 1 test is a hard requirement - see detailsmake test-unit@greptileaiand received a Confidence Score of at least 4/5 before requesting a maintainer reviewDelays in PR merge?
If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).
CI (LiteLLM team)
Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:
Screenshots / Proof of Fix
Type
🆕 New Feature
🐛 Bug Fix
🧹 Refactoring
📖 Documentation
🚄 Infrastructure
✅ Test
Changes