Problem Statement
vLLM supports realtime transcription using Websockets:
https://developers.openai.com/api/docs/guides/speech-to-text#streaming-the-transcription-of-an-ongoing-audio-recording
https://developers.openai.com/api/docs/guides/realtime-transcription
https://developers.openai.com/api/docs/guides/realtime?use-case=transcription#connect-with-websockets
https://docs.vllm.ai/en/latest/serving/openai_compatible_server/#realtime-api
https://docs.vllm.ai/en/latest/models/supported_models/#realtime-transcription
The current audio benchmarking is insufficient to benchmark this kind of new realtime audio models. As voice agents gain popularity, this kind of realtime communication becomes the standard, and guidellm is left behind.
Proposed Solution
Can we implement realtime endpoint support for audio models?
Alternatives Considered
Audio models come in two flavors - realtime and synchronous. This means that we can't just use synchronous mode for realtime models. Supporting realtime models required supporting the realtime endpoint.
Usage Examples
python3 -m vllm.entrypoints.openai.api_server \
--model mistralai/Voxtral-Mini-4B-Realtime-2602 \
--tokenizer-mode mistral \
--config-format mistral \
--load-format mistral \
--trust-remote-code \
--compilation-config '{"cudagraph_mode":"PIECEWISE"}' \
--tensor-parallel-size 1 \
--max-model-len 45000 \
--max-num-batched-tokens 8192 \
--max-num-seqs 16 \
--gpu-memory-utilization 0.90 \
--host 0.0.0.0 --port 8000
guidellm benchmark \
--target http://localhost:8000/v1 \
--request-type audio_transcriptions_realtime \
--data /workspace/custom-audio-dataset/hf_dataset \
--profile synchronous \
--max-requests 10 \
--output-dir /workspace/repo/runs/2026-04-23T13-41-41 \
--outputs json,html,csv
Additional Context
No response
Problem Statement
vLLM supports realtime transcription using Websockets:
https://developers.openai.com/api/docs/guides/speech-to-text#streaming-the-transcription-of-an-ongoing-audio-recording
https://developers.openai.com/api/docs/guides/realtime-transcription
https://developers.openai.com/api/docs/guides/realtime?use-case=transcription#connect-with-websockets
https://docs.vllm.ai/en/latest/serving/openai_compatible_server/#realtime-api
https://docs.vllm.ai/en/latest/models/supported_models/#realtime-transcription
The current audio benchmarking is insufficient to benchmark this kind of new realtime audio models. As voice agents gain popularity, this kind of realtime communication becomes the standard, and guidellm is left behind.
Proposed Solution
Can we implement realtime endpoint support for audio models?
Alternatives Considered
Audio models come in two flavors - realtime and synchronous. This means that we can't just use synchronous mode for realtime models. Supporting realtime models required supporting the realtime endpoint.
Usage Examples
Additional Context
No response