This is a real-time voice-based chat application that allows users to have spoken conversations with an AI built from first principles. The application uses client-side Voice Activity Detection (VAD) to capture user speech, Cloudflare Workers for backend processing, and Cloudflare AI for Speech-to-Text (STT), Large Language Model (LLM) inference, and Text-to-Speech (TTS).
- Real-time Voice Interaction: Speak directly to the AI and hear its responses.
- Client-Side VAD: Efficiently detects speech, sending audio only when the user is talking.
- Streaming AI Responses: AI responses are streamed back for a more interactive feel.
- Cloudflare Powered: Leverages Cloudflare Workers for scalable backend logic and Cloudflare AI for cutting-edge AI models.
- Chat History: Maintains a conversation history within a session.
- Simple UI: Clean interface displaying the conversation and providing controls.
The application facilitates a voice conversation between a user and an AI through a series of steps orchestrated between the frontend (browser) and the backend (Cloudflare Workers).
Frontend
- User Interaction & Permissions:
- The user clicks the "Start Conversation" button.
- The browser requests microphone access.
- Voice Activity Detection (VAD):
- Once permission is granted, the client-side VAD system is initialized.
- The VAD listens to the microphone input. When the user speaks, it captures audio.
- It processes the speech into audio chunks (ArrayBuffers).
- WebSocket Connection:
- A WebSocket connection is established with the backend Cloudflare Worker.
- Sending Audio:
- The captured audio chunks are sent directly to the backend via the WebSocket.
- Any currently playing AI audio is stopped before sending new user audio.
- Receiving & Displaying Messages:
- The frontend listens for messages from the WebSocket:
text
type: This is the user's speech transcribed by the backend. It's displayed in the chat UI as a user message.audio
type: This is the AI's response. Thetext
content is displayed as an AI message, and the accompanyingaudio
data is queued and played back to the user.
- The UI updates status messages (e.g., "Listening...", "AI Speaking...", "Processing...").
- A visualizer provides feedback when the user is speaking.
- The frontend listens for messages from the WebSocket:
- Controls:
_ Start/Stop Conversation: Manages the VAD, WebSocket activity, and UI state.
_ Clear Chat: Clears the displayed messages and sends a
clear
command to the backend to reset the conversation history for the session.
Backend
The backend is built using a Cloudflare Worker that utilizes a Durable Object to manage the state for each WebSocket connection (i.e., each user session).
- WebSocket Handshake:
- When the frontend attempts to connect to
/websocket
, the main Workerfetch
handler upgrades the HTTP request to a WebSocket connection. - It gets or creates a unique Durable Object instance (using
idFromName(crypto.randomUUID())
) to handle this specific WebSocket connection.
- When the frontend attempts to connect to
- Receiving User Audio & Commands:
- The Durable Object's WebSocket event listener receives messages from the client.
- If the message is a stringified JSON command (e.g.,
{ "type": "cmd", "data": "clear" }
), it processes the command (e.g., clearsthis.msgHistory
). - If the message is an audio buffer (user's speech):
- Speech-to-Text (STT):
- The audio buffer (an
ArrayBuffer
) is converted to aUint8Array
. - This array is sent to the Cloudflare AI STT model (
@cf/openai/whisper-tiny-en
). - The model transcribes the audio to text.
- The transcribed text is sent back to the client via WebSocket (
{ type: 'text', text: user_transcription }
) so the user can see what the AI heard. - The user's transcribed text is added to the
msgHistory
array for context ({ role: 'user', content: text }
).
- The audio buffer (an
- Large Language Model (LLM) Inference:
- The
msgHistory
(containing the conversation so far) is sent to the Cloudflare AI LLM (@cf/meta/llama-4-scout-17b-16e-instruct
). - A system prompt ("You in a voice conversation with the user") guides the LLM's behavior.
- The LLM generates a response as a text stream.
smoothStream()
is used for potentially smoother output.
- The
- Text Buffering & Text-to-Speech (TTS):
- The
bufferText
utility processes the LLM's text stream, breaking it into sentences (or manageable chunks). - For each sentence:
- The sentence is added to
msgHistory
({ role: 'assistant', content: sentence }
). - The sentence is sent to the Cloudflare AI TTS model (
@cf/myshell-ai/melotts
) using aPQueue
to manage concurrency (one TTS request at a time for this session to ensure order). - The TTS model converts the text sentence into audio data.
- The sentence is added to
- The
- Sending AI Response to Client:
- The generated audio data (along with the corresponding text sentence) is sent back to the client via WebSocket (
{ type: 'audio', text: sentence, audio: audio_data }
).
- The generated audio data (along with the corresponding text sentence) is sent back to the client via WebSocket (
- WebSocket Closure: * If the WebSocket connection closes, the Durable Object handles the closure.
User Speech → VAD (Client) → Audio Chunk → WebSocket → Durable Object (Backend) → STT Model → User Text Transcript (to Client & LLM) → LLM → AI Text Response Stream → Sentence Buffer → TTS Model → AI Audio Chunk → WebSocket → Client (Play Audio & Display Text)
git clone https://github.com/megaconfidence/talk2ai
cd talk2ai
npm install
npm run dev
- LLM Context Window: The
msgHistory
grows with the conversation. Long conversations might exceed the LLM's context window or token limits. - Error Handling: While some error handling is present, more robust mechanisms could be added (e.g., WebSocket reconnection logic).
Issues and PRs are welcome.