Skip to content

Starlark scripted tools for vMCP#51

Open
jerm-dro wants to merge 7 commits intomainfrom
jerm/2026-03-06-jerm-starlark
Open

Starlark scripted tools for vMCP#51
jerm-dro wants to merge 7 commits intomainfrom
jerm/2026-03-06-jerm-starlark

Conversation

@jerm-dro
Copy link

@jerm-dro jerm-dro commented Mar 7, 2026

Summary

Proposes replacing vMCP's declarative composite tools system (DAG + Go templates) with a Starlark scripting engine for multi-step tool workflows.

The current composite tools system hits hard limits: no iteration over results, no dynamic branching, and awkward Go template data flow. Starlark provides iteration, conditional branching, dynamic tool dispatch, and data transformation while maintaining sandboxed execution with no arbitrary I/O.

Key design decisions:

  • Two-builtin error handling: call_tool() halts on error (common case), try_call_tool() returns error info (opt-in handling) — works around Starlark's lack of try/except
  • Parallel system: Runs alongside existing composite tools during migration, then replaces them
  • Small builtin API: call_tool, try_call_tool, retry, elicit, parallel, log
  • Code reuse via load(): Shared helper libraries in .star files
  • V1 and V2 session compatible: Engine sits behind the handler factory

Why Starlark

  • BSD 3-Clause license (compatible with Apache 2.0)
  • Purpose-built for embedded use in Go (Bazel, Buck2, Tilt, Cirrus CI)
  • Sandboxed by design: no I/O, no network, no OS access
  • Python-like syntax lowers learning curve
  • Alternatives evaluated: Risor (security risk from Go stdlib access), Tengo (unfamiliar syntax), Goja/JS (large attack surface), Wasm (overkill)

🤖 Generated with Claude Code

jerm-dro and others added 2 commits March 6, 2026 17:38
Proposes replacing vMCP's declarative composite tools system (DAG + Go
templates) with a Starlark scripting engine for multi-step tool
workflows. Starlark provides iteration, conditional branching, dynamic
dispatch, and data transformation while maintaining sandboxed execution
with no arbitrary I/O.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jerm-dro and others added 5 commits March 6, 2026 18:58
- Remove V2 session incompatibility bullet (temporary concern)
- Remove V1/V2 session goal (not relevant to RFC scope)
- Add parallel execution to Goals section
- Remove migration tooling phase (unnecessary)
- Remove fuzz tests from testing strategy (unnecessary)
- Simplify migration path (3 phases instead of 4)
- Add docs-website to documentation requirements
- Resolve naming question: scripted and composite are interchangeable
- Remove json.encode/decode open question

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Starlark's sandbox makes it feasible for agents to dynamically
compose and submit scripts to vMCP at runtime — something a
declarative YAML DSL could never support safely.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Six practical examples covering structured data manipulation,
returning structured data, JSON-as-string parsing from legacy
servers, fan-out with parallel, error handling patterns, and
elicitation. Includes a callout explaining when to use dict
indexing (tool results) vs attribute access (builtin return values).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kantord
Copy link
Member

kantord commented Mar 9, 2026

This RFC is incredible! Seems like I'm the first human reviewer and considering that, honestly I'm impressed by how thorough it already is!

I do have some thoughts but honestly, none of them are blockers, I just wanted to give more than a "Looks good to me". I did have to scrape the barrel to think of some things that might be actionable:

  • Perhaps for error handling, it would make sense to distinguish between runtime errors vs. input validation errors vs. output validation errors?
  • Streaming/large data: this may be mostly theoretical based on how MCP is used today, but if workflows start dispatching large fan-outs (e.g., calling a github tool across thousands of repos), the lack of streaming primitives in Starlark could become a limitation, especially when combined with strict memory limitations
  • Human-in-the-loop using elicit() is already covered, but I'm wondering if we also need to think about MCP apps.
  • About the http built-in: I would actually try to rely on MCP tool calls for things like this. This way we could load all of the complexities of access control and whatnot into a single system (MCP).
  • Reusing Starlark code across different scripts could open a can of worms, such as one change generating a cascading failure. Perhaps this is another thing we can push onto MCP, just use MCP tool calls to reuse one vMCP script in another. At least in that case we could have things like schema-based contracts (you get an error if you add a breaking change in an output schema that would generate a domino-effect.) In any case, the feature of load() might be premature without knowing exactly how vMCP would be used in real life.
  • Automated testing for these scripts might be challenging/nearly impossible. A simple example-based test runner would be trivial to add, but mocking MCP calls would be a problem. Perhaps there is a good MCP testing system that we could integrate into Toolhive that would solve these problems?

@jerm-dro
Copy link
Author

Thanks for the thoughtful review @kantord 😄

  • Streaming/large data: this may be mostly theoretical based on how MCP is used today, but if workflows start dispatching large fan-outs (e.g., calling a github tool across thousands of repos), the lack of streaming primitives in Starlark could become a limitation, especially when combined with strict memory limitations

This is an important thing to call out. I think it can be solved by building an implementation of Iterable. That would look like something within vMCP recognizes "hey, this is a huge array / dict / paginated response. Let's turn it into an Iterable."

  • Human-in-the-loop using elicit() is already covered, but I'm wondering if we also need to think about MCP apps.

Can you say more about what you're imagining? I'm not that familiar with MCP apps.

  • About the http built-in: I would actually try to rely on MCP tool calls for things like this. This way we could load all of the complexities of access control and whatnot into a single system (MCP).

Yes, that makes perfect sense. If you want a http tool, users could add a fetch tool.

  • Reusing Starlark code across different scripts could open a can of worms, such as one change generating a cascading failure. Perhaps this is another thing we can push onto MCP, just use MCP tool calls to reuse one vMCP script in another. At least in that case we could have things like schema-based contracts (you get an error if you add a breaking change in an output schema that would generate a domino-effect.) In any case, the feature of load() might be premature without knowing exactly how vMCP would be used in real life.

Yea, I could see it getting hairy too. If we get to the point people (or agents) are writing so much code that we think reuse is important, then that's a good problem to have. I like the ideas you have though. Let's wait until this problem needs more attention.

  • Automated testing for these scripts might be challenging/nearly impossible. A simple example-based test runner would be trivial to add, but mocking MCP calls would be a problem. Perhaps there is a good MCP testing system that we could integrate into Toolhive that would solve these problems?

Yes, it could be frustrating, especially with the schemas of the underlyling MCP servers potentially changing.

Some thoughts here:

  • we could build a thv repl for manually iterating
  • we could a some custom type checker that validates the inputs to tool calls and use of their responses.

I think this is another thing where we have to wait for the problem to arise to know what solution is justified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants