Skip to content

Conversation

@zastrowm
Copy link
Member

@zastrowm zastrowm commented Nov 11, 2025

Description

Problem

Per #995, if a MCP tool_call receives a 5XX error from the server, the call hangs and never ends.

Root Cause

The root cause is that Anthropic's MCP client - on receiving a 5XX - bubbles up an exception that ends up cancelling all TaskGroup tasks which results in the session/client/asyncio loop being torn down and the tool_call never resolves, thus the hang.

Error flow The flow is that a tool_call makes a new request:
  1. A new request is started to the server
  2. The server returns a 500
  3. Our underlying MCP Client raises on 5XX exception errors
  4. As a result, the entire thread-group is taken down, resulting in the entire async-io background thread being torn down

The bug is that our implementation of MCP Tool isn't paying attention to (4) and thus the hang occurs. There's a longer-term follow-up question of whether or not a 5XX should take down the entire connection, but going to scope the fix to the hanging issue.

Fix

The fix is two fold:

  • Detect that the situation occurs and trigger a close close_future future
  • Update all background_invokes to eagerly bail on close_future being triggered

Notes

Testing

Added an integ test to verify that:

  • the tool_call resolves with an error
  • the MCP session bails out with an error

This test failed before the changes, but succeeds now

Related Issues

#995,

Documentation PR

N/A

Type of Change

Bug fix

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Fixes strands-agents#995 where if a MCP tool_call receives a 5XX error from the server, the call hangs and never ends. The root cause is that Anthropic's MCP client - on receiving a 5XX - bubbles up an exception that ends up cancelling all TaskGroup tasks which results in the session/client/asyncio loop being torn down and the tool_call never resolves, thus the hang.

The fix is two fold:

- Detect that the situation occurs and trigger a close `close_future` future
- Update all background_invokes to eagerly bail on `close_future` being triggered
@codecov
Copy link

codecov bot commented Nov 11, 2025

Codecov Report

❌ Patch coverage is 56.00000% with 11 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/strands/tools/mcp/mcp_client.py 56.00% 8 Missing and 3 partials ⚠️

📢 Thoughts on this report? Let us know!

@zastrowm zastrowm marked this pull request as ready for review November 11, 2025 18:56
@zastrowm zastrowm merged commit 57e2081 into strands-agents:main Nov 12, 2025
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants