feat: implement MCP tool error isolation and configurable retries#7887
feat: implement MCP tool error isolation and configurable retries#7887Shlok148Dev wants to merge 1 commit into
Conversation
…abort entire agent run
a40f72c to
04b3625
Compare
|
@microsoft-github-policy-service agree |
|
@Shlok148Dev Nice work on the error isolation. We hit the same problem with MCP tool failures in HeartFlow and ended up adding a decision-gated retry layer on top of the transport-level retry you've implemented. The key insight: not all errors should be retried the same way. HeartFlow's decision router classifies MCP errors into:
This prevents the "retried a logical error 3 times and wasted 14s" problem. The configurable retry policy you added (max_retries/retry_delay/raise_on_error) is the right foundation — adding a error classifier on top would make it smarter. HeartFlow is open source: github.com/yun520-1/mark-heartflow-skill |
|
Thanks for the feedback @yun520-1! It's great to hear this aligns with what you've built in HeartFlow. To prevent the exact "wasted retry time" scenario you mentioned, this PR actually implements that separation:
I'll check out the HeartFlow repository as well! I think adding more granular error classifiers or exponential backoff options on top of this foundation is a great direction for future iterations. |
Summary
This PR addresses issue #7851 by introducing error isolation and optional retry policies for MCP tool adapters.
Currently, if one tool fails (e.g., transport timeout/infrastructure failure or execution error) during a multi-tool execution session, the exception bubbles up and aborts the entire agent run. This change intercepts transient transport/connection failures and MCP-reported execution errors, returning a structured
ToolResult(is_error=True)instead of throwing a fatal exception.Key Changes
McpToolAdapter(autogen-ext):max_retries,retry_delay, andraise_on_errorparameters.[TimeoutError]) to the textual output of theToolResultto retain context for debugging.raise_on_error: bool = Falseopt-in escape hatch for backward compatibility.StaticWorkbench&StaticStreamWorkbench(autogen-core):ToolResultdirectly. This specifically avoids double-wrapping the adapter's returnedToolResult(is_error=True)inside a second, redundantToolResultwrapper, which would corrupt the error result schema.StaticWorkbenchintegration handling.raise_on_error=Truecontinues to propagate exceptions.Verification Results
All formatting, lint checks, type checks, and tests were executed and are passing:
ruff format --checkandruff checkpassed.mypytype checks passed (Success: no issues found in 20 source filesacross bothautogen-coreandautogen-ext).pytest packages/autogen-ext/tests/tools/test_mcp_tools.pysuccessfully completed all tests.