vLLM Responses

FastAPI gateway that exposes an OpenAI-style Responses API (/v1/responses) in front of a vLLM OpenAI-compatible server (/v1/chat/completions), with:

SSE streaming event shape + ordering
previous_response_id statefulness (ResponseStore)
gateway-executed built-in tool: code_interpreter
gateway-hosted MCP tools (tools[].type="mcp" with configured server_label)

Current MCP boundary:

tools[].type="mcp" is gateway-hosted MCP resolved via VR_MCP_CONFIG_PATH.
Request-declared MCP targets (server_url, connector_id) are not supported yet.

📚 Full User Documentation (Guides, API Reference, Examples)

Design docs (maintainer-facing): design_docs/index.md.

Install

The vllm-responses CLI is provided by the Python package in responses/.

Prerequisites: Python 3.12+ and uv.

Install from a prebuilt wheel (Linux x86_64) (Recommended)

Download a prebuilt wheel (vllm_responses-*.whl) from GitHub Releases (preferred) or a CI run artifact, then install it:

uv venv --python=3.12
source .venv/bin/activate
uv pip install vllm
uv pip install path/to/vllm_responses-*.whl

On Linux x86_64 wheels, the Code Interpreter server binary is bundled, so Bun is not required. Currently, wheels are only built for Linux x86_64.

Installing vllm-responses provides:

vllm-responses for the standalone supervisor mode
vllm as a CLI shim that supports vllm serve --responses and delegates all non-Responses paths to the upstream vllm Python package

Install from source (repo checkout) (Development)

git clone https://github.com/EmbeddedLLM/vllm-responses
cd vllm-responses

uv venv --python=3.12
source .venv/bin/activate
uv pip install vllm
uv pip install -e ./responses

# Development: enable Code Interpreter via Bun fallback
# - Required for source checkouts when running with `code_interpreter` enabled (default)
cd responses/python/vllm_responses/tools/code_interpreter
bun install
export VR_CODE_INTERPRETER_DEV_BUN_FALLBACK=1
cd -

vllm-responses --help

Verify installation:

vllm-responses --help
vllm --help

Optional dependency sets (extras)

Install any combination via:

uv pip install -e './responses[<extra1>,<extra2>]'

Available extras:

docs: MkDocs toolchain (contributors).
lint: Ruff + Markdown formatting.
test: Pytest + coverage + load testing tools.
tracing: OpenTelemetry tracing support (only needed if you enable VR_TRACING_ENABLED=true).
build: Package build/publish tools.
all: Everything above.

Build a wheel from source

If you want to produce a local wheel from this checkout, build from the responses/ package directory.

Rebuild the bundled Code Interpreter binary (Linux x86_64 only)

This step is only needed if you want the wheel to include a freshly compiled Code Interpreter binary.

bash scripts/ci/prebuild_code_interpreter_linux_x86_64.sh responses

The script writes the bundled executable under:

responses/python/vllm_responses/tools/code_interpreter/bin/linux/x86_64/code-interpreter-server

Build wheel and sdist

uv pip install -e './responses[build]'
cd responses
python -m build --wheel --sdist

Build artifacts are written to:

responses/dist/

On Linux x86_64, wheels built after the prebuild step bundle the native Code Interpreter binary. On other platforms, use the source-install Bun fallback or disable Code Interpreter.

Run

remote-upstream gateway mode (`vllm-responses serve`)

Prereqs:

If code_interpreter is enabled (default), the first start may download the Pyodide runtime (~400MB) into a cache directory (see VR_PYODIDE_CACHE_DIR). This requires tar to be installed.
For non-Linux platforms (or source installs without the bundled binary), you can disable the tool via --code-interpreter disabled. For development you can also enable the Bun-based fallback via VR_CODE_INTERPRETER_DEV_BUN_FALLBACK=1.

External upstream (you start vLLM yourself; /v1 is optional):

vllm-responses serve --upstream http://127.0.0.1:8457

The Responses endpoint is:

POST http://127.0.0.1:5969/v1/responses

Remote access note:

If you bind the gateway with --gateway-host 0.0.0.0, use the machine’s IP/hostname to connect (not 0.0.0.0).

integrated runtime (`vllm serve --responses`)

Prereq:

install upstream vllm first, then install vllm-responses into the same environment

Example:

CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3.5-0.8B \
  --responses \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --host 0.0.0.0 \
  --port 8457

CLI help:

vllm serve --help shows upstream vLLM help
vllm serve --responses --help shows the Responses-owned integrated flags

Optional: ResponseStore hot cache (Redis)

previous_response_id hydration reads the previous response state from the DB. For multi-worker deployments, you can optionally enable a Redis-backed hot cache to reduce DB reads/latency.

Env vars (default off):

VR_RESPONSE_STORE_CACHE=1
VR_RESPONSE_STORE_CACHE_TTL_SECONDS=3600

Redis connection:

VR_REDIS_HOST, VR_REDIS_PORT

Quick smoke test (OpenAI Python SDK)

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:5969/v1", api_key="dummy")

with client.responses.stream(
    model="MiniMaxAI/MiniMax-M2.1",
    input=[{"role": "user", "content": "You MUST call the code_interpreter tool. Execute: 2+2. Reply with ONLY the number."}],
    tools=[{"type": "code_interpreter"}],
    tool_choice="auto",
    include=["code_interpreter_call.outputs"],
) as stream:
    for evt in stream:
        if getattr(evt, "type", "").endswith(".delta"):
            continue
        print(getattr(evt, "type", evt))
    r1 = stream.get_final_response().id

with client.responses.stream(
    model="MiniMaxAI/MiniMax-M2.1",
    previous_response_id=r1,
    input=[{"role": "user", "content": "What number did you just compute? Reply with ONLY the number."}],
    tool_choice="none",
) as stream:
    for evt in stream:
        if getattr(evt, "type", "").endswith(".delta"):
            continue
        print(getattr(evt, "type", evt))

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github		.github
.vscode		.vscode
docs		docs
responses		responses
scripts		scripts
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.markdownlint.yaml		.markdownlint.yaml
.prettierignore		.prettierignore
.prettierrc		.prettierrc
LICENSE		LICENSE
README.md		README.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
mkdocs.yml		mkdocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vLLM Responses

Install

Install from a prebuilt wheel (Linux x86_64) (Recommended)

Install from source (repo checkout) (Development)

Optional dependency sets (extras)

Build a wheel from source

Rebuild the bundled Code Interpreter binary (Linux x86_64 only)

Build wheel and sdist

Run

remote-upstream gateway mode (`vllm-responses serve`)

integrated runtime (`vllm serve --responses`)

Optional: ResponseStore hot cache (Redis)

Quick smoke test (OpenAI Python SDK)

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vLLM Responses

Install

Install from a prebuilt wheel (Linux x86_64) (Recommended)

Install from source (repo checkout) (Development)

Optional dependency sets (extras)

Build a wheel from source

Rebuild the bundled Code Interpreter binary (Linux x86_64 only)

Build wheel and sdist

Run

remote-upstream gateway mode (vllm-responses serve)

integrated runtime (vllm serve --responses)

Optional: ResponseStore hot cache (Redis)

Quick smoke test (OpenAI Python SDK)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

remote-upstream gateway mode (`vllm-responses serve`)

integrated runtime (`vllm serve --responses`)

Packages