Skip to content

Conversation

@kachenjr
Copy link
Contributor

Fix Nova Sonic audio routing, tool use, and barge-in handling

This commit fixes three critical bugs in the Nova Sonic realtime plugin that
prevent it from working correctly in production scenarios.

Issues Fixed

1. Audio Routing After Tool Calls

Problem: Audio frames not playing after tool execution
Root Cause: Audio routed to wrong generation after tool calls complete
Solution: Create new generation per ASSISTANT SPECULATIVE text event
Impact: Audio now plays correctly for each assistant response

Nova Sonic sends ASSISTANT SPECULATIVE text events to signal new assistant
turns, including after tool calls. Each turn needs its own generation to
ensure audio frames route to the correct audio channel.

2. Tool Use Across Multiple Turns

Problem: Tool calls fail or behave incorrectly across multiple turns
Root Cause: Generation not closed after tool call, preventing framework
from delivering tool results via update_chat_ctx()
Solution: Close generation immediately after emitting tool call
Impact: Tool use now works reliably across multiple conversation turns

The LiveKit framework expects the generation to close so it can call
update_chat_ctx() with tool results. A new generation is created when
Nova Sonic sends the next ASSISTANT SPECULATIVE event with the response.

3. Crashes on User Interruption (Barge-In)

Problem: Session crashes when user interrupts assistant mid-response
Root Cause: Race conditions with future initialization and None pointer
access after barge-in sets _current_generation to None
Solution:

  • Initialize futures as None, create lazily in initialize_streams()
  • Add defensive None checks throughout event handlers
    Impact: Interruptions handled gracefully without crashes

Creating futures in init causes race conditions during session restart.
Lazy initialization ensures the event loop exists before future creation.

Additional Improvements

Simplified Architecture

  • Message tracking: Single content_id_map dict instead of 4 separate dicts
    (messages, user_messages, speculative_messages, tool_messages)
  • Restart tracking: Per-turn _restart_attempts instead of session-level
    tracking for better barge-in metrics
  • Timestamps: Float (time.time()) instead of ISO-8601 strings for easier
    duration calculations

The single dict approach is simpler, easier to debug, and sufficient for
tracking content IDs and their types. Both approaches have identical memory
characteristics (no leaks) since dicts live inside _ResponseGeneration
instances that are created and destroyed per turn.

Adopted from Origin

  • Added ModelStreamErrorException to recoverable errors (from origin/main
    commit c674705, Oct 16, 2025)

Testing

All features verified working:

  • Audio playback ✓
  • Tool use (multiple turns) ✓
  • Barge-in/interruptions ✓
  • Multi-turn conversations ✓

Tested against origin/main and confirmed tool use does not work without
these fixes.

Breaking Changes

None. Public API unchanged. Metrics format unchanged. Only internal
implementation differs.

AI Assistance

Portions of this code were developed with assistance from AI tools for
debugging, testing, and implementation of the fixes described above.


Co-authored-by: Amazon Q Developer

This commit fixes three critical bugs in the Nova Sonic realtime plugin that
prevent it from working correctly in production scenarios.

## Issues Fixed

### 1. Audio Routing After Tool Calls
**Problem**: Audio frames not playing after tool execution
**Root Cause**: Audio routed to wrong generation after tool calls complete
**Solution**: Create new generation per ASSISTANT SPECULATIVE text event
**Impact**: Audio now plays correctly for each assistant response

Nova Sonic sends ASSISTANT SPECULATIVE text events to signal new assistant
turns, including after tool calls. Each turn needs its own generation to
ensure audio frames route to the correct audio channel.

### 2. Tool Use Across Multiple Turns
**Problem**: Tool calls fail or behave incorrectly across multiple turns
**Root Cause**: Generation not closed after tool call, preventing framework
from delivering tool results via update_chat_ctx()
**Solution**: Close generation immediately after emitting tool call
**Impact**: Tool use now works reliably across multiple conversation turns

The LiveKit framework expects the generation to close so it can call
update_chat_ctx() with tool results. A new generation is created when
Nova Sonic sends the next ASSISTANT SPECULATIVE event with the response.

### 3. Crashes on User Interruption (Barge-In)
**Problem**: Session crashes when user interrupts assistant mid-response
**Root Cause**: Race conditions with future initialization and None pointer
access after barge-in sets _current_generation to None
**Solution**:
- Initialize futures as None, create lazily in initialize_streams()
- Add defensive None checks throughout event handlers
**Impact**: Interruptions handled gracefully without crashes

Creating futures in __init__ causes race conditions during session restart.
Lazy initialization ensures the event loop exists before future creation.

## Additional Improvements

### Simplified Architecture
- **Message tracking**: Single content_id_map dict instead of 4 separate dicts
  (messages, user_messages, speculative_messages, tool_messages)
- **Restart tracking**: Per-turn _restart_attempts instead of session-level
  tracking for better barge-in metrics
- **Timestamps**: Float (time.time()) instead of ISO-8601 strings for easier
  duration calculations

The single dict approach is simpler, easier to debug, and sufficient for
tracking content IDs and their types. Both approaches have identical memory
characteristics (no leaks) since dicts live inside _ResponseGeneration
instances that are created and destroyed per turn.

### Adopted from Origin
- Added ModelStreamErrorException to recoverable errors (from origin/main
  commit c674705, Oct 16, 2025)
- Removed child-safety line from DEFAULT_SYSTEM_PROMPT to match origin

## Testing

All features verified working:
- Audio playback ✓
- Tool use (multiple turns) ✓
- Barge-in/interruptions ✓
- Multi-turn conversations ✓

Tested against origin/main and confirmed tool use does not work without
these fixes.

## Breaking Changes

None. Public API unchanged. Metrics format unchanged. Only internal
implementation differs.

## AI Assistance

Portions of this code were developed with assistance from AI tools for
debugging, testing, and implementation of the fixes described above.

---

Co-authored-by: Amazon Q Developer
Add safe_mode=True to jokeapi call to filter out inappropriate content.
@CLAassistant
Copy link

CLAassistant commented Oct 22, 2025

CLA assistant check
All committers have signed the CLA.

JarrettAWS and others added 3 commits October 22, 2025 16:33
- Add None check before accessing content_id_map in tool handler
- Add type cast for audio_bytes to satisfy mypy Buffer type requirement
Add type cast for tool task result to fix indexing error on line 1261
@bcherry bcherry requested a review from a team October 24, 2025 17:58
@theomonnom theomonnom merged commit e6b2309 into livekit:main Oct 27, 2025
9 checks passed
@theomonnom
Copy link
Member

Thanks!

akshaym1shra pushed a commit to akshaym1shra/agents that referenced this pull request Nov 3, 2025
akshaym1shra pushed a commit to akshaym1shra/agents that referenced this pull request Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants