feat(stt): add experimental Web Speech API mode alongside Whisper#7
Open
iamfisho wants to merge 1 commit into
Open
feat(stt): add experimental Web Speech API mode alongside Whisper#7iamfisho wants to merge 1 commit into
iamfisho wants to merge 1 commit into
Conversation
Introduces a second experimental speech-to-text option that runs entirely in the browser via the Web Speech API, complementing the existing Whisper server-side mode. The two modes are mutually exclusive — toggling one on turns the other off. - New `experimentalWebSpeechSTT` setting (default false). - New `useWebSpeechSTT` hook with the same interface as `useSTT` and a `continuous = true` recognizer so natural pauses don't end the session. - `TerminalInputArea` branches between the two hooks based on which flag is active and surfaces hook errors via toast for visibility. - Config toggle is auto-disabled when the browser exposes neither `SpeechRecognition` nor `webkitSpeechRecognition`. - i18n: `webSpeechSTT`, `ttsHint`, `webSpeechSTTHint`, `webSpeechSTTUnsupported` in config, `voiceInputErrorTitle` in terminal, across all 10 locales. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a second experimental speech-to-text option that runs entirely in the browser via the Web Speech API, complementing the existing Whisper server-side mode. The two modes are mutually exclusive — toggling one on automatically turns the other off.
Why
Whisper transcription has noticeable latency because audio is uploaded and run server-side. The Web Speech API delegates recognition to the browser's built-in service, which is meaningfully faster on Chromium-based browsers and Safari, at the cost of needing a supported browser and outbound connectivity to the browser's recognition service.
What's in the PR
experimentalWebSpeechSTTnew boolean setting (defaultfalse) added toSettingsandDEFAULT_SETTINGSinsrc/packages/client/store/types.ts.useWebSpeechSTThook (src/packages/client/hooks/useWebSpeechSTT.ts) — same interface asuseSTT, withcontinuous = trueso natural pauses don't end the session early.TerminalInputArea— branches between the two STT hooks based on which experimental flag is on; surfaces hook errors via toast (useToast) so users can see failures without opening DevTools (the existing Whisper hook also benefits).ConfigSection— second toggle under Experimental with mutex (turning one on turns the other off). Toggle is auto-disabled when the browser exposes neitherSpeechRecognitionnorwebkitSpeechRecognition(Firefox), with an explanatory tooltip. Adds optionaldisabledprop to the localTogglecomponent plus a.config-toggle-disabledstyle.webSpeechSTT,ttsHint,webSpeechSTTHint,webSpeechSTTUnsupported(config namespace) andvoiceInputErrorTitle(terminal namespace) added to all 10 locales (de, en, es, fr, hi, it, ja, pt, ru, zh-CN).Notes / known limitations
networkerror — now visible via toast.es-ES; making it user-configurable is left for a follow-up.Test plan
Follow-up suggestion (separate iteration): rather than auto-sending the transcription as a command, the STT result should populate the input field, leaving the final send action to the user.
The current behavior (handleTranscription in TerminalInputArea calls store.sendCommand directly) optimizes for speed but bypasses any chance for the user to correct misrecognized words. STT is inherently lossy — accents, homophones, background noise, and (for Web Speech) cloud-side guesswork can all produce subtly wrong transcripts that would never have left the user's keyboard. Sending those straight to an agent burns turns on noise and erodes trust in the feature.
Proposed change: have the transcription fill the existing input (so the user can review, edit, and submit normally with Enter / send button). As a future enhancement, we could add an opt-in voice-confirmation step ("send" / "cancel" spoken commands) for fully hands-free operation without sacrificing accuracy.
This keeps the prompt quality guarantee in the user's hands, which matters more than shaving a click — especially on agents where every command costs tokens and may trigger irreversible actions.