Skip to content

feat(stt): add experimental Web Speech API mode alongside Whisper#7

Open
iamfisho wants to merge 1 commit into
deivid11:masterfrom
iamfisho:feat/experimental-web-speech-stt
Open

feat(stt): add experimental Web Speech API mode alongside Whisper#7
iamfisho wants to merge 1 commit into
deivid11:masterfrom
iamfisho:feat/experimental-web-speech-stt

Conversation

@iamfisho
Copy link
Copy Markdown
Contributor

@iamfisho iamfisho commented Apr 27, 2026

Summary

Adds a second experimental speech-to-text option that runs entirely in the browser via the Web Speech API, complementing the existing Whisper server-side mode. The two modes are mutually exclusive — toggling one on automatically turns the other off.

Why

Whisper transcription has noticeable latency because audio is uploaded and run server-side. The Web Speech API delegates recognition to the browser's built-in service, which is meaningfully faster on Chromium-based browsers and Safari, at the cost of needing a supported browser and outbound connectivity to the browser's recognition service.

What's in the PR

  • experimentalWebSpeechSTT new boolean setting (default false) added to Settings and DEFAULT_SETTINGS in src/packages/client/store/types.ts.
  • useWebSpeechSTT hook (src/packages/client/hooks/useWebSpeechSTT.ts) — same interface as useSTT, with continuous = true so natural pauses don't end the session early.
  • TerminalInputArea — branches between the two STT hooks based on which experimental flag is on; surfaces hook errors via toast (useToast) so users can see failures without opening DevTools (the existing Whisper hook also benefits).
  • ConfigSection — second toggle under Experimental with mutex (turning one on turns the other off). Toggle is auto-disabled when the browser exposes neither SpeechRecognition nor webkitSpeechRecognition (Firefox), with an explanatory tooltip. Adds optional disabled prop to the local Toggle component plus a .config-toggle-disabled style.
  • i18n — new keys webSpeechSTT, ttsHint, webSpeechSTTHint, webSpeechSTTUnsupported (config namespace) and voiceInputErrorTitle (terminal namespace) added to all 10 locales (de, en, es, fr, hi, it, ja, pt, ru, zh-CN).

Notes / known limitations

  • Web Speech API in Chromium browsers depends on a cloud recognition service (Google's). Networks that block it produce a network error — now visible via toast.
  • Default recognition language is es-ES; making it user-configurable is left for a follow-up.

Test plan

  • Toggle "Web Speech STT (Browser)" turns on and automatically switches off "Text to Speech" (Whisper).
  • Toggling the Whisper one back on switches the Web Speech one off.
  • In Firefox, the Web Speech toggle is disabled with an explanatory tooltip.
  • In Chrome/Edge over HTTPS, click mic → speak → click again → message is sent.
  • Error path: with the toggle on, simulate offline / blocked network → toast appears with the error message.
  • Whisper mode still works as before (no regressions).

Follow-up suggestion (separate iteration): rather than auto-sending the transcription as a command, the STT result should populate the input field, leaving the final send action to the user.

The current behavior (handleTranscription in TerminalInputArea calls store.sendCommand directly) optimizes for speed but bypasses any chance for the user to correct misrecognized words. STT is inherently lossy — accents, homophones, background noise, and (for Web Speech) cloud-side guesswork can all produce subtly wrong transcripts that would never have left the user's keyboard. Sending those straight to an agent burns turns on noise and erodes trust in the feature.

Proposed change: have the transcription fill the existing input (so the user can review, edit, and submit normally with Enter / send button). As a future enhancement, we could add an opt-in voice-confirmation step ("send" / "cancel" spoken commands) for fully hands-free operation without sacrificing accuracy.

This keeps the prompt quality guarantee in the user's hands, which matters more than shaving a click — especially on agents where every command costs tokens and may trigger irreversible actions.

Introduces a second experimental speech-to-text option that runs entirely
in the browser via the Web Speech API, complementing the existing Whisper
server-side mode. The two modes are mutually exclusive — toggling one on
turns the other off.

- New `experimentalWebSpeechSTT` setting (default false).
- New `useWebSpeechSTT` hook with the same interface as `useSTT` and a
  `continuous = true` recognizer so natural pauses don't end the session.
- `TerminalInputArea` branches between the two hooks based on which flag
  is active and surfaces hook errors via toast for visibility.
- Config toggle is auto-disabled when the browser exposes neither
  `SpeechRecognition` nor `webkitSpeechRecognition`.
- i18n: `webSpeechSTT`, `ttsHint`, `webSpeechSTTHint`,
  `webSpeechSTTUnsupported` in config, `voiceInputErrorTitle` in terminal,
  across all 10 locales.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants