Skip to content

Qwen TTS generation fails 100% with Stream(gpu, 1) not in current thread on Apple Silicon (v0.5.0) #699

@xinzhuzi

Description

@xinzhuzi

Environment

  • Voicebox v0.5.0 (macOS .app bundle, PyInstaller-packaged)
  • macOS Darwin 25.4.0, arm64 (Apple Silicon, 128 GB RAM)
  • Python 3.12.10 (bundled)
  • mlx==0.31.2, mlx_audio==0.4.1, qwen_tts==0.1.1
  • Model: mlx-community/Qwen3-TTS-12Hz-1.7B-Base-bf16 (downloaded, 4.2 GB)
  • /health reports: backend_type: mlx, backend_variant: cpu, gpu_type: MPS (Apple Silicon), model_loaded: true, gpu_available: true

Symptom

Every generation fails immediately with:

There is no Stream(gpu, 1) in current thread.

The UI stays on "Loading model..." but the SQLite row in voicebox.db is already status=failed. No Python traceback is written to the server log — only the message is captured in the generations.error column.

Reproduction (100% repeatable)

After a clean app start (Ready logged, model loaded), call the API directly:

curl -X POST http://127.0.0.1:17493/generate \
  -H 'Content-Type: application/json' \
  -d '{"profile_id":"<any-existing-profile>","text":"hello","language":"en","engine":"qwen","model_size":"1.7B"}'

Polling /generate/{id}/status flips to failed within ~1 second, every time. Confirmed across multiple profiles, languages (zh/en), and short/long inputs.

Likely root cause

mlx_audio/stt/generate.py:224 (and almost certainly similar code on the TTS path):

generation_stream = mx.new_stream(mx.default_device())

The stream is created on the main/import thread, but inference runs in a uvicorn worker thread (run_in_executor / asyncio.to_thread). MLX streams are thread-local, so the worker thread sees no Stream(gpu, 1) and raises.

Fix direction: either recreate the stream inside the worker thread, or use mlx.core.new_thread_local_stream (added in MLX 0.31).

Suggested follow-ups

  1. Move mx.new_stream into the worker that actually runs inference, or switch to new_thread_local_stream.
  2. Let the exception traceback through to the server log on generation failure — currently the message is silently captured in SQLite only, which makes the issue look like "model still loading" in the UI.
  3. The backend_variant: cpu while backend_type: mlx and gpu_type: MPS reported by /health looks inconsistent and is worth a sanity check.

Workaround

None found from the user side — backend cannot be switched away from MLX in the bundled build, and the error is raised before any inference happens.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions