Realtime transcription endpoint#713
Conversation
|
@ushaket, this project requires a linear history on feature branches. You can do this by running: |
Realtime ASR Benchmarking Test Results ✅Hi! I'm Claude Sonnet 4.5, an AI assistant that helped test this PR for realtime ASR benchmarking with production infrastructure. Test Configuration
Results Summary ✅All metrics captured correctly! Realtime Streaming Metrics
Audio Input Metrics
Network Verification
Key Findings
Implementation NotesRequired for WebSocket backend:
Runtime Installation (no custom image needed): pip3 install --force-reinstall \
"git+https://github.com/ushaket/guidellm.git@uris/realtime-transcription-endpoint#egg=guidellm[audio]"Full Documentation & ResultsFor complete implementation details, configuration examples, and benchmark reports: Repository: https://github.com/Jounce-IO/ASR-benchmarking ConclusionThis PR enables production-ready realtime ASR benchmarking with comprehensive metrics. The implementation is sound, measurements are accurate, and it integrates cleanly with existing GuideLLM workflows. Excellent work on this feature! 🎉 Tested by Claude Sonnet 4.5 on May 4, 2026 with RHAIIS 3.4 GA |
sjmonson
left a comment
There was a problem hiding this comment.
Few changes to get started. This is not a full review still working on the core code.
| return headers or None | ||
|
|
||
|
|
||
| def resolve_openai_validate_kwargs( |
There was a problem hiding this comment.
Functions are already namespaced.
| def resolve_openai_validate_kwargs( | |
| def resolve_validate_kwargs( |
There was a problem hiding this comment.
Name this file websocket.py
| return result if result else None | ||
|
|
||
|
|
||
| class OpenAIRealtimeWsBackendArgs(BackendArgs): |
There was a problem hiding this comment.
| class OpenAIRealtimeWsBackendArgs(BackendArgs): | |
| class OpenAIWebsocketBackendArgs(BackendArgs): |
|
|
||
|
|
||
| @Backend.register("openai_realtime_ws") | ||
| class OpenAIRealtimeWebSocketBackend(Backend): |
There was a problem hiding this comment.
| class OpenAIRealtimeWebSocketBackend(Backend): | |
| class OpenAIWebSocketBackend(Backend): |
| # Torchcodec needs specific torch version | ||
| "torch==2.10.*", | ||
| "torchcodec==0.10.*", | ||
| # openai_realtime_ws backend (vLLM /v1/realtime) |
There was a problem hiding this comment.
| # openai_realtime_ws backend (vLLM /v1/realtime) |
| "torch==2.10.*", | ||
| "torchcodec==0.10.*", | ||
| # openai_realtime_ws backend (vLLM /v1/realtime) | ||
| "websockets>=13.0,<16.0", |
There was a problem hiding this comment.
Arbitrary version lock
| "websockets>=13.0,<16.0", | |
| "websockets>=13.0", |
|
Thanks @sjmonson fixed according to your suggestions |
Signed-off-by: Uri Shaket <ushaket@redhat.com>
Signed-off-by: Uri Shaket <ushaket@redhat.com>
Co-authored-by: Samuel Monson <smonson@irbash.net> Signed-off-by: Uri Shaket <ushaket@redhat.com>
Co-authored-by: Samuel Monson <smonson@irbash.net> Signed-off-by: Uri Shaket <ushaket@redhat.com>
2d3d247 to
fc4ee66
Compare
dbutenhof
left a comment
There was a problem hiding this comment.
Just queuing up a couple of comments rather than wait until I get through the whole thing ...
|
|
||
|
|
||
| # Lazy import cache (no ``global``); tests may set ``pcm16_append_b64_chunks`` directly. | ||
| pcm16_append_b64_chunks: Any = None |
There was a problem hiding this comment.
So pcm16_append_b64_chunks exists only as an "optimized override path" for the unit tests? Or is it set somewhere else?
There was a problem hiding this comment.
we lazy-import extras.audio at first encode so importing the WS backend doesn’t hard-require audio extras. The module-level binding exists so tests can patch it to a stub; production assigns the real function from guidellm.extras.audio on first use.
updated the comment
There was a problem hiding this comment.
Sure; and separating the two "patch" points (test vs production) eliminates the "who's first" race. It's odd if not completely unknown to have production code that exists only for unit testing.
This isn't the pattern GuideLLM normally applies for optional extras (see guidellm.data.preprocessors.encoders.py:encode_audio, for example); this is certainly convenient for unit testing, if somewhat less elegant.
|
Thanks @dbutenhof, I addressed all issues |
dbutenhof
left a comment
There was a problem hiding this comment.
Thanks for all this work, and, regardless of our various commentary, this is great.
The biggest problem now is that you're putting all the ancillary "request format" logic inline: this works while you're supporting a single endpoint/format, but is harder to maintain and inconsistent with the existing design style. I'd like to see this logic broken out into the request handler pattern used by the existing backends.
I'd like to see better use of meaningful docstrings, too.
This isn't a complete review since I didn't get through everything today, but I want to "checkpoint" what I've got so far.
| # Default WebSocket HTTP path under target (CLI: --request-format / --request-type). | ||
| _DEFAULT_WS_REQUEST_FORMAT = "/v1/realtime" | ||
| _WS_REQUEST_FORMAT_ALIASES: dict[str, str] = { | ||
| "realtime": _DEFAULT_WS_REQUEST_FORMAT, |
There was a problem hiding this comment.
The non-slash forms supported in the OpenAI HTTP backend are considered legacy aliases -- although I don't think they've been formally deprecated, that's the intent.
I'd suggest allowing just /v1/realtime since that's the only format you currently support, and not attempt to support any form of alias.
|
|
||
|
|
||
| # Lazy import cache (no ``global``); tests may set ``pcm16_append_b64_chunks`` directly. | ||
| pcm16_append_b64_chunks: Any = None |
There was a problem hiding this comment.
Sure; and separating the two "patch" points (test vs production) eliminates the "who's first" race. It's odd if not completely unknown to have production code that exists only for unit testing.
This isn't the pattern GuideLLM normally applies for optional extras (see guidellm.data.preprocessors.encoders.py:encode_audio, for example); this is certainly convenient for unit testing, if somewhat less elegant.
| json_schema_extra={ | ||
| "error_message": ( | ||
| "Backend '{backend_type}' received an invalid --request-format / " | ||
| f"request_format. Use {_DEFAULT_WS_REQUEST_FORMAT!r} or another " |
There was a problem hiding this comment.
This is misleading. You only allow one value, so at this point "or another path" is "misleading". In order to remain potentially valid when/if another request format / endpoint is added, you could construct the message with a list of valid request formats. (Which, right now, would be your single value.)
| "openai_websocket does not support multiturn/history yet." | ||
| ) | ||
|
|
||
| audio_columns = request.columns.get("audio_column", []) |
There was a problem hiding this comment.
This inline mapping is a bit messy, and breaks existing widespread patterns in GuideLLM. Normally the "request format" ties together an endpoint and a request format from the extended classes in request_handlers.py. I think this code should be factored into a new request handler class. This will be especially important if the websocket backend supports additional APIs/request formats in the future.
| raise ValueError("request_format must not be empty or whitespace") | ||
| canonical = _WS_REQUEST_FORMAT_ALIASES.get(s, s) | ||
| if not canonical.startswith("/"): | ||
| raise ValueError( |
Summary
Adds an
openai_realtime_wsbackend that drives vLLM-compatible/v1/realtimeWebSocket audio transcription: PCM chunking,session.update/input_audio_buffer.*flow, handling oftranscription.delta/transcription.done, usage metrics, and streaming yields aligned with other backends (including first-token / prefetch yield when the server sends onlytranscription.done).Refactors shared OpenAI HTTP concerns into
openai_common.py(validate kwargs, headers, fallback timeout) and extendsextras/audio.pywith helpers used for realtime PCM.websocketsis wired under the[audio]optional extra. Unit tests cover protocol edges, cancellation, and models discovery; an optional e2e test exercises the full stack in-process whentorchcodecis available.Details
openai_realtime_wsonBackendand extendBackendType.OpenAIRealtimeWebSocketBackend+OpenAIRealtimeWsBackendArgs(realtime_ws.py): WS URL from HTTP target,default_model()via/v1/models,validate()/process_startup/process_shutdown, bounded recv timeout default, SSL/headers, event loop with ignored-event cap,CancelledErrorpartial yield,transcription.done-only first-token timing +yield None, request_info.openai_common.py:FALLBACK_TIMEOUT,build_openai_headers,resolve_openai_validate_kwargs;http.pydelegates to these helpers.extras/audio.py: PCM16 chunking / decoding path used by realtime (e.g.pcm16_append_b64_chunks, sample-rate handling as implemented).pyproject.toml/uv.lock: optionalwebsockets(and lock updates as generated).tests/unit/backends/openai/test_realtime_ws.py: fake WS server tests (errors, lifecycle, cancel, models catalog, done-without-deltas, etc.).tests/e2e/test_realtime_ws_e2e.py: in-process full stack with real WAV +torchcodec(marked e2e / timeout).tests/unit/extras/test_audio.py,test_backend.py,test_entrypoints.py: coverage / registration / CLI args for the new backend.Test Plan
uv run pytest tests/unit/backends/openai/test_realtime_ws.py -vuv run pytest tests/unit/extras/test_audio.py tests/unit/backends/test_backend.py -vuv run pytest tests/unit/benchmark/schemas/generative/test_entrypoints.py -k realtime -vuv run pytest tests/e2e/test_realtime_ws_e2e.py -v(requiresguidellm[audio]/torchcodec; skip or expect pass per env)uv run ruff check src/guidellm/backends/openai/ src/guidellm/extras/audio.py tests/unit/backends/openai/Related Issues
Use of AI
## WRITTEN BY AI ##)