Skip to content

fix: rate limiting token bucket timing and dict memory leak#4057

Open
strawgate wants to merge 2 commits into
mainfrom
codex/rate-limiting-fix
Open

fix: rate limiting token bucket timing and dict memory leak#4057
strawgate wants to merge 2 commits into
mainfrom
codex/rate-limiting-fix

Conversation

@strawgate
Copy link
Copy Markdown
Collaborator

@strawgate strawgate commented Apr 26, 2026

Summary

Fixes a memory leak in RateLimitingMiddleware and SlidingWindowRateLimitingMiddleware where per-client rate limiter entries were stored in an unbounded defaultdict, causing unlimited memory growth as new client IDs were seen.

Solution: Replaced the defaultdict with an OrderedDict-based LRU cache with a configurable max_clients limit (default 10,000). When the limit is reached, the least recently used client is evicted to make room for a new client.

This approach:

  • Bounds memory at a configurable maximum
  • Uses only stdlib (collections.OrderedDict)
  • No TTL thread/cleanup overhead
  • Preserves per-client rate limiting semantics
  • Also applied the same fix to SlidingWindowRateLimitingMiddleware which had the same defaultdict issue

Fixes #4053.

@strawgate strawgate added the DON'T MERGE PR is not ready for merging. Used by authors to prevent premature merging. label Apr 26, 2026
@strawgate
Copy link
Copy Markdown
Collaborator Author

surely we can find a non-hand rolled ttl dict

@marvin-context-protocol marvin-context-protocol Bot added bug Something isn't working. Reports of errors, unexpected behavior, or broken functionality. server Related to FastMCP server implementation or server-side functionality. labels Apr 26, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e5c8f0f2fa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 54 to 57
if self.tokens >= tokens:
self.tokens -= tokens
self.last_refill = now
return True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Advance token-bucket clock on denied consumes

Keeping last_refill inside the success branch causes elapsed time to be counted repeatedly across denied requests: after one denial, the next retry still computes elapsed from the old timestamp and adds the same interval again. In practice, a client that retries quickly after depleting tokens can refill much faster than refill_rate (bypassing the configured rate limit). The refill timestamp needs to be updated whenever refill math is applied, not only when tokens are consumed.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right - my fix was incorrect. Moving last_refill inside the success branch causes time to freeze on denials, which allows a client to accumulate tokens faster than the configured rate by repeatedly retrying after denial.

I've reverted this change. The token bucket algorithm now correctly updates last_refill on every call (original behavior), which is the standard token bucket semantics.

The only fix in this PR is now the dict TTL eviction for per-client rate limiters (issue #4053).

@marvin-context-protocol
Copy link
Copy Markdown
Contributor

CI failed due to a pre-existing flaky test (test_timeout_tool_call_overrides_client_timeout_even_if_lower) in tests/client/client/test_timeout.py, unrelated to this PR's rate-limiting changes. Safe to re-run.

Root Cause: The test creates a Client(timeout=0.1) then calls client.call_tool("sleep", {"seconds": 0.5}, timeout=2) — expecting the per-call timeout=2 to override the client-level timeout. The tool call itself completes, but the MCP library's _validate_tool_result then internally calls list_tools(), which sends a new request using the client-level 0.1s timeout rather than the per-call override. On Python 3.10 (ubuntu-latest only), this post-call validation step times out. The failure is timing-sensitive and doesn't reproduce on Python 3.13 or Python 3.10 Windows.

This PR only modifies src/fastmcp/server/middleware/rate_limiting.py and has no connection to client timeout logic.

Log excerpt
FAILED tests/client/client/test_timeout.py::TestTimeout::test_timeout_tool_call_overrides_client_timeout_even_if_lower
  - mcp.shared.exceptions.McpError: Timed out while waiting for response to ClientRequest. Waited 0.1 seconds.

Call stack: call_tool (tools.py:305) → call_tool_mcp (tools.py:161) → _await_with_session_monitoring (client.py:762)
  → mcp/client/session.py:395 call_tool → :403 _validate_tool_result → :529 list_tools → send_request [timeout]
Related files
  • tests/client/client/test_timeout.py:37–44 — the failing test (pre-exists this PR, present in main)
  • src/fastmcp/client/mixins/tools.py:161call_tool_mcp which passes to _await_with_session_monitoring
  • src/fastmcp/client/client.py:762_await_with_session_monitoring

@strawgate strawgate force-pushed the codex/rate-limiting-fix branch from e5c8f0f to a6dde66 Compare April 26, 2026 06:16
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a6dde667c1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

stale_ids = [
cid
for cid, last_access in self._limiter_access_times.items()
if now - last_access > self._limiter_ttl
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve token debt when evicting stale client limiters

Evicting buckets purely on a fixed 300s inactivity window can under-enforce low-rate limits: if capacity / refill_rate is greater than 5 minutes, a client can be removed before its bucket would naturally refill, then recreated with full tokens on the next access (_get_limiter), effectively erasing accrued debt. In a multi-client server, traffic from other clients will trigger this cleanup path, so a client configured for very slow refill (e.g. 0.001 rps) can regain burst capacity far earlier than configured.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid concern. The 5-min TTL can indeed leak tokens for very low rate limits (e.g., 1 RPS with high capacity). Increased the TTL to 1 hour (3600s), which provides a better trade-off: the memory leak is solved for realistic use cases (clients returning within an hour), and token debt leakage only affects sustained low-rate clients that are inactive for >1 hour.

For servers needing stricter guarantees with very low rate limits, a follow-up could replace TTL-based eviction with LRU eviction or track last_refill_time for more precise eviction. This fix addresses the primary issue (unbounded memory growth) with a pragmatic balance.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Rather than trying to tune the TTL (which creates second-order effects as you describe), I've replaced the TTL-based approach entirely with an LRU cache using stdlib OrderedDict. This:

  1. Bounds memory at max_clients (default 10,000) without any TTL thread
  2. Uses move_to_end for O(1) LRU access
  3. Evicts the least recently used client when capacity is reached
  4. Simpler code than the TTL approach — no timestamps, no cleanup scheduling

This is a better fit for the problem: per-client state is bounded by a fixed capacity, and LRU naturally handles the "most active clients keep their state" semantics.

@strawgate strawgate force-pushed the codex/rate-limiting-fix branch from a6dde66 to 30869a2 Compare April 26, 2026 15:13
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 30869a2e25

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +157 to +158
if len(self._client_limiters) > self._max_clients:
self._client_limiters.popitem(last=False)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve token-bucket state when evicting LRU clients

Evicting the oldest limiter here discards that client's remaining-token debt, so once active client cardinality exceeds max_clients, a previously throttled client is recreated with a full bucket on its next request. This allows configured per-client limits to be bypassed under high-cardinality traffic (or by cycling client IDs), because rate-limit state is reset by eviction rather than refill.

Useful? React with 👍 / 👎.

Comment on lines +239 to +240
if len(self._client_limiters) > self._max_clients:
self._client_limiters.popitem(last=False)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Retain sliding-window history across cache pressure

This LRU eviction drops a client's timestamp deque entirely; if the middleware sees more than max_clients distinct clients, an evicted client returns with an empty window and can immediately send another full max_requests burst. That under-enforces the advertised window limit whenever client cardinality crosses the cache size.

Useful? React with 👍 / 👎.

@strawgate strawgate force-pushed the codex/rate-limiting-fix branch from 30869a2 to d280a50 Compare May 13, 2026 03:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working. Reports of errors, unexpected behavior, or broken functionality. DON'T MERGE PR is not ready for merging. Used by authors to prevent premature merging. server Related to FastMCP server implementation or server-side functionality.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multiple memory and resource leaks in middleware and providers

1 participant