Starlark scripted tools for vMCP by jerm-dro · Pull Request #51 · stacklok/toolhive-rfcs

jerm-dro · 2026-03-07T01:41:28Z

Summary

Proposes replacing vMCP's declarative composite tools system (DAG + Go templates) with a Starlark scripting engine for multi-step tool workflows.

The current composite tools system hits hard limits: no iteration over results, no dynamic branching, and awkward Go template data flow. Starlark provides iteration, conditional branching, dynamic tool dispatch, and data transformation while maintaining sandboxed execution with no arbitrary I/O.

Key design decisions:

Two-builtin error handling: call_tool() halts on error (common case), try_call_tool() returns error info (opt-in handling) — works around Starlark's lack of try/except
Parallel system: Runs alongside existing composite tools during migration, then replaces them
Small builtin API: call_tool, try_call_tool, retry, elicit, parallel, log
Code reuse via load(): Shared helper libraries in .star files
V1 and V2 session compatible: Engine sits behind the handler factory

Why Starlark

BSD 3-Clause license (compatible with Apache 2.0)
Purpose-built for embedded use in Go (Bazel, Buck2, Tilt, Cirrus CI)
Sandboxed by design: no I/O, no network, no OS access
Python-like syntax lowers learning curve
Alternatives evaluated: Risor (security risk from Go stdlib access), Tengo (unfamiliar syntax), Goja/JS (large attack surface), Wasm (overkill)

🤖 Generated with Claude Code

Proposes replacing vMCP's declarative composite tools system (DAG + Go templates) with a Starlark scripting engine for multi-step tool workflows. Starlark provides iteration, conditional branching, dynamic dispatch, and data transformation while maintaining sandboxed execution with no arbitrary I/O. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rfcs/THV-0051-starlark-scripted-tools.md

- Remove V2 session incompatibility bullet (temporary concern) - Remove V1/V2 session goal (not relevant to RFC scope) - Add parallel execution to Goals section - Remove migration tooling phase (unnecessary) - Remove fuzz tests from testing strategy (unnecessary) - Simplify migration path (3 phases instead of 4) - Add docs-website to documentation requirements - Resolve naming question: scripted and composite are interchangeable - Remove json.encode/decode open question Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Starlark's sandbox makes it feasible for agents to dynamically compose and submit scripts to vMCP at runtime — something a declarative YAML DSL could never support safely. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Six practical examples covering structured data manipulation, returning structured data, JSON-as-string parsing from legacy servers, fan-out with parallel, error handling patterns, and elicitation. Includes a callout explaining when to use dict indexing (tool results) vs attribute access (builtin return values). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kantord · 2026-03-09T11:14:55Z

This RFC is incredible! Seems like I'm the first human reviewer and considering that, honestly I'm impressed by how thorough it already is!

I do have some thoughts but honestly, none of them are blockers, I just wanted to give more than a "Looks good to me". I did have to scrape the barrel to think of some things that might be actionable:

Perhaps for error handling, it would make sense to distinguish between runtime errors vs. input validation errors vs. output validation errors?
Streaming/large data: this may be mostly theoretical based on how MCP is used today, but if workflows start dispatching large fan-outs (e.g., calling a github tool across thousands of repos), the lack of streaming primitives in Starlark could become a limitation, especially when combined with strict memory limitations
Human-in-the-loop using elicit() is already covered, but I'm wondering if we also need to think about MCP apps.
About the http built-in: I would actually try to rely on MCP tool calls for things like this. This way we could load all of the complexities of access control and whatnot into a single system (MCP).
Reusing Starlark code across different scripts could open a can of worms, such as one change generating a cascading failure. Perhaps this is another thing we can push onto MCP, just use MCP tool calls to reuse one vMCP script in another. At least in that case we could have things like schema-based contracts (you get an error if you add a breaking change in an output schema that would generate a domino-effect.) In any case, the feature of load() might be premature without knowing exactly how vMCP would be used in real life.
Automated testing for these scripts might be challenging/nearly impossible. A simple example-based test runner would be trivial to add, but mocking MCP calls would be a problem. Perhaps there is a good MCP testing system that we could integrate into Toolhive that would solve these problems?

jerm-dro · 2026-03-10T03:08:58Z

Thanks for the thoughtful review @kantord 😄

Streaming/large data: this may be mostly theoretical based on how MCP is used today, but if workflows start dispatching large fan-outs (e.g., calling a github tool across thousands of repos), the lack of streaming primitives in Starlark could become a limitation, especially when combined with strict memory limitations

This is an important thing to call out. I think it can be solved by building an implementation of Iterable. That would look like something within vMCP recognizes "hey, this is a huge array / dict / paginated response. Let's turn it into an Iterable."

Human-in-the-loop using elicit() is already covered, but I'm wondering if we also need to think about MCP apps.

Can you say more about what you're imagining? I'm not that familiar with MCP apps.

About the http built-in: I would actually try to rely on MCP tool calls for things like this. This way we could load all of the complexities of access control and whatnot into a single system (MCP).

Yes, that makes perfect sense. If you want a http tool, users could add a fetch tool.

Reusing Starlark code across different scripts could open a can of worms, such as one change generating a cascading failure. Perhaps this is another thing we can push onto MCP, just use MCP tool calls to reuse one vMCP script in another. At least in that case we could have things like schema-based contracts (you get an error if you add a breaking change in an output schema that would generate a domino-effect.) In any case, the feature of load() might be premature without knowing exactly how vMCP would be used in real life.

Yea, I could see it getting hairy too. If we get to the point people (or agents) are writing so much code that we think reuse is important, then that's a good problem to have. I like the ideas you have though. Let's wait until this problem needs more attention.

Automated testing for these scripts might be challenging/nearly impossible. A simple example-based test runner would be trivial to add, but mocking MCP calls would be a problem. Perhaps there is a good MCP testing system that we could integrate into Toolhive that would solve these problems?

Yes, it could be frustrating, especially with the schemas of the underlyling MCP servers potentially changing.

Some thoughts here:

we could build a thv repl for manually iterating
we could a some custom type checker that validates the inputs to tool calls and use of their responses.

I think this is another thing where we have to wait for the problem to arise to know what solution is justified.

jerm-dro and others added 2 commits March 6, 2026 17:38

Rename RFC to THV-0051 per PR number

75c79cc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jerm-dro commented Mar 7, 2026

View reviewed changes

jerm-dro and others added 5 commits March 6, 2026 18:58

self-review

f18bb41

Add agent-composed scripts as long-term goal

53403f2

Starlark's sandbox makes it feasible for agents to dynamically compose and submit scripts to vMCP at runtime — something a declarative YAML DSL could never support safely. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cleanup formatting

d431e36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Starlark scripted tools for vMCP#51

Starlark scripted tools for vMCP#51
jerm-dro wants to merge 7 commits intomainfrom
jerm/2026-03-06-jerm-starlark

jerm-dro commented Mar 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kantord commented Mar 9, 2026

Uh oh!

jerm-dro commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jerm-dro commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why Starlark

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kantord commented Mar 9, 2026

Uh oh!

jerm-dro commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jerm-dro commented Mar 7, 2026 •

edited

Loading