Skip to content

feat: implement MCP tool error isolation and configurable retries#7887

Open
Shlok148Dev wants to merge 1 commit into
microsoft:mainfrom
Shlok148Dev:feature/mcp-tool-error-isolation
Open

feat: implement MCP tool error isolation and configurable retries#7887
Shlok148Dev wants to merge 1 commit into
microsoft:mainfrom
Shlok148Dev:feature/mcp-tool-error-isolation

Conversation

@Shlok148Dev

Copy link
Copy Markdown

Summary

This PR addresses issue #7851 by introducing error isolation and optional retry policies for MCP tool adapters.

Currently, if one tool fails (e.g., transport timeout/infrastructure failure or execution error) during a multi-tool execution session, the exception bubbles up and aborts the entire agent run. This change intercepts transient transport/connection failures and MCP-reported execution errors, returning a structured ToolResult(is_error=True) instead of throwing a fatal exception.

Key Changes

  • McpToolAdapter (autogen-ext):
    • Catch transient transport/connection failures and MCP-reported execution errors.
    • Added max_retries, retry_delay, and raise_on_error parameters.
    • Implemented a retry policy with a configurable delay for transient infrastructure/transport failures (logical tool errors are not retried).
    • Prepend the exception class name (e.g., [TimeoutError]) to the textual output of the ToolResult to retain context for debugging.
    • Added a raise_on_error: bool = False opt-in escape hatch for backward compatibility.
  • StaticWorkbench & StaticStreamWorkbench (autogen-core):
    • Updated tool call handlers to pass through any returned ToolResult directly. This specifically avoids double-wrapping the adapter's returned ToolResult(is_error=True) inside a second, redundant ToolResult wrapper, which would corrupt the error result schema.
  • Integration & Unit Tests:
    • Added test cases verifying retry success, retry exhaustion, and StaticWorkbench integration handling.
    • Verified raise_on_error=True continues to propagate exceptions.
    • Added explicit GC handling for flaky workbench cleanup tests.

Verification Results

All formatting, lint checks, type checks, and tests were executed and are passing:

  • ruff format --check and ruff check passed.
  • mypy type checks passed (Success: no issues found in 20 source files across both autogen-core and autogen-ext).
  • pytest packages/autogen-ext/tests/tools/test_mcp_tools.py successfully completed all tests.

@Shlok148Dev Shlok148Dev force-pushed the feature/mcp-tool-error-isolation branch from a40f72c to 04b3625 Compare June 25, 2026 18:55
@Shlok148Dev

Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

@yun520-1

Copy link
Copy Markdown

@Shlok148Dev Nice work on the error isolation. We hit the same problem with MCP tool failures in HeartFlow and ended up adding a decision-gated retry layer on top of the transport-level retry you've implemented.

The key insight: not all errors should be retried the same way. HeartFlow's decision router classifies MCP errors into:

  • Transient (timeout, connection reset) → retry with exponential backoff (3 attempts, 2s/4s/8s)
  • Semantic (tool returns is_error=True with valid data) → don't retry, route to verification layer
  • Fatal (auth failure, invalid schema) → immediate abort, no retry

This prevents the "retried a logical error 3 times and wasted 14s" problem. The configurable retry policy you added (max_retries/retry_delay/raise_on_error) is the right foundation — adding a error classifier on top would make it smarter.

HeartFlow is open source: github.com/yun520-1/mark-heartflow-skill

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants