FastAPI gateway that exposes an OpenAI-style Responses API (/v1/responses) in front of a vLLM OpenAI-compatible server (/v1/chat/completions), with:
- SSE streaming event shape + ordering
previous_response_idstatefulness (ResponseStore)- gateway-executed built-in tool:
code_interpreter - gateway-hosted MCP tools (
tools[].type="mcp"with configuredserver_label)
Current MCP boundary:
tools[].type="mcp"is gateway-hosted MCP resolved viaVR_MCP_CONFIG_PATH.- Request-declared MCP targets (
server_url,connector_id) are not supported yet.
📚 Full User Documentation (Guides, API Reference, Examples)
Design docs (maintainer-facing): design_docs/index.md.
The vllm-responses CLI is provided by the Python package in responses/.
Prerequisites: Python 3.12+ and uv.
Download a prebuilt wheel (vllm_responses-*.whl) from GitHub Releases (preferred) or a CI run artifact, then install it:
uv venv --python=3.12
source .venv/bin/activate
uv pip install vllm
uv pip install path/to/vllm_responses-*.whlOn Linux x86_64 wheels, the Code Interpreter server binary is bundled, so Bun is not required. Currently, wheels are only built for Linux x86_64.
Installing vllm-responses provides:
vllm-responsesfor the standalone supervisor modevllmas a CLI shim that supportsvllm serve --responsesand delegates all non-Responses paths to the upstreamvllmPython package
git clone https://github.com/EmbeddedLLM/vllm-responses
cd vllm-responses
uv venv --python=3.12
source .venv/bin/activate
uv pip install vllm
uv pip install -e ./responses
# Development: enable Code Interpreter via Bun fallback
# - Required for source checkouts when running with `code_interpreter` enabled (default)
cd responses/python/vllm_responses/tools/code_interpreter
bun install
export VR_CODE_INTERPRETER_DEV_BUN_FALLBACK=1
cd -
vllm-responses --helpVerify installation:
vllm-responses --help
vllm --helpInstall any combination via:
uv pip install -e './responses[<extra1>,<extra2>]'Available extras:
docs: MkDocs toolchain (contributors).lint: Ruff + Markdown formatting.test: Pytest + coverage + load testing tools.tracing: OpenTelemetry tracing support (only needed if you enableVR_TRACING_ENABLED=true).build: Package build/publish tools.all: Everything above.
If you want to produce a local wheel from this checkout, build from the
responses/ package directory.
This step is only needed if you want the wheel to include a freshly compiled Code Interpreter binary.
bash scripts/ci/prebuild_code_interpreter_linux_x86_64.sh responsesThe script writes the bundled executable under:
responses/python/vllm_responses/tools/code_interpreter/bin/linux/x86_64/code-interpreter-server
uv pip install -e './responses[build]'
cd responses
python -m build --wheel --sdistBuild artifacts are written to:
responses/dist/
On Linux x86_64, wheels built after the prebuild step bundle the native Code Interpreter binary. On other platforms, use the source-install Bun fallback or disable Code Interpreter.
Prereqs:
- If
code_interpreteris enabled (default), the first start may download the Pyodide runtime (~400MB) into a cache directory (seeVR_PYODIDE_CACHE_DIR). This requirestarto be installed. - For non-Linux platforms (or source installs without the bundled binary), you can disable the tool via
--code-interpreter disabled. For development you can also enable the Bun-based fallback viaVR_CODE_INTERPRETER_DEV_BUN_FALLBACK=1.
External upstream (you start vLLM yourself; /v1 is optional):
vllm-responses serve --upstream http://127.0.0.1:8457The Responses endpoint is:
POST http://127.0.0.1:5969/v1/responses
Remote access note:
- If you bind the gateway with
--gateway-host 0.0.0.0, use the machine’s IP/hostname to connect (not0.0.0.0).
Prereq:
- install upstream
vllmfirst, then installvllm-responsesinto the same environment
Example:
CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3.5-0.8B \
--responses \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--host 0.0.0.0 \
--port 8457CLI help:
vllm serve --helpshows upstream vLLM helpvllm serve --responses --helpshows the Responses-owned integrated flags
previous_response_id hydration reads the previous response state from the DB. For multi-worker deployments, you can optionally enable a Redis-backed hot cache to reduce DB reads/latency.
Env vars (default off):
VR_RESPONSE_STORE_CACHE=1VR_RESPONSE_STORE_CACHE_TTL_SECONDS=3600
Redis connection:
VR_REDIS_HOST,VR_REDIS_PORT
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:5969/v1", api_key="dummy")
with client.responses.stream(
model="MiniMaxAI/MiniMax-M2.1",
input=[{"role": "user", "content": "You MUST call the code_interpreter tool. Execute: 2+2. Reply with ONLY the number."}],
tools=[{"type": "code_interpreter"}],
tool_choice="auto",
include=["code_interpreter_call.outputs"],
) as stream:
for evt in stream:
if getattr(evt, "type", "").endswith(".delta"):
continue
print(getattr(evt, "type", evt))
r1 = stream.get_final_response().id
with client.responses.stream(
model="MiniMaxAI/MiniMax-M2.1",
previous_response_id=r1,
input=[{"role": "user", "content": "What number did you just compute? Reply with ONLY the number."}],
tool_choice="none",
) as stream:
for evt in stream:
if getattr(evt, "type", "").endswith(".delta"):
continue
print(getattr(evt, "type", evt))