gemini_live: send_client_content(turn_complete=True) on first audio user-stop races with realtime_input → 1007 invalid argument after text→voice transition

## Summary

In a Gemini Live pipeline that mixes text-input turns with voice-input turns,
the first user-audio turn following a text-only sequence reliably triggers a
`1007 Request contains an invalid argument` rejection from the Gemini server.
The session recovers via `session_resumption_handle` reconnect (~300ms), but
the in-flight user utterance is lost and turn boundaries get merged across
the reconnect. In our repro, the user-visible stall before the bot responds
is ~12s.

- Pipecat: 1.1.0
- Python: 3.12.5
- Model: `gemini-2.5-flash-native-audio-latest` (also reproduces on `*-preview-12-2025`)
- Modalities: `AUDIO`
- Config: `inference_on_context_initialization=False`
- VAD: server-side Gemini VAD enabled

## Repro

1. Connect a Gemini Live session with `inference_on_context_initialization=False`.
2. Drive several text-input turns via either `_send_user_text` or
   `LLMMessagesAppendFrame` (no audio frames). Each turn completes normally.
3. Switch to voice mode. Mic publishes audio. Audio frames flow to Gemini via
   `realtime_input.audio`.
4. User finishes speaking. Pipeline emits `UserStoppedSpeakingFrame` (via
   either `SpeechTimeoutUserTurnStopStrategy` or
   `TranscriptionUserTurnStartStrategy`-driven turn cycle).
5. ~200ms later: `Connection error (failure 1/3): 1007 None. Request contains
   an invalid argument.`

## Root cause

`GeminiLiveLLMService._handle_user_stopped_speaking` unconditionally sends
`client_content(turn_complete=True)` on the first user-stopped event when
`_needs_initial_turn_complete_message=True`:

https://github.com/pipecat-ai/pipecat/blob/v1.1.0/src/pipecat/services/google/gemini_live/llm.py#L692-L703

```python
async def _handle_user_stopped_speaking(self, frame):
    self._user_is_speaking = False
    self._user_audio_buffer = bytearray()
    await self.start_ttfb_metrics()
    if self._vad_disabled and self._session and self._ready_for_realtime_input:
        try:
            await self._session.send_realtime_input(activity_end=ActivityEnd())
        except Exception as e:
            await self._handle_send_error(e)
    if self._needs_initial_turn_complete_message:
        self._needs_initial_turn_complete_message = False
        # NOTE: without this, the model ignores the context it's been
        # seeded with before the user started speaking
        await self._session.send_client_content(turn_complete=True)
```

The flag is set to True at session-ready when
`inference_on_context_initialization=False` (Gemini 2.5 only):

https://github.com/pipecat-ai/pipecat/blob/v1.1.0/src/pipecat/services/google/gemini_live/llm.py#L1378

```python
if not trigger_inference and not self._is_gemini_3:
    self._needs_initial_turn_complete_message = True
```

The flag is only cleared in `_handle_user_stopped_speaking`. Text-input
turns driven via `_send_user_text` → `realtime_input(text=...)`, or via
`LLMMessagesAppendFrame` → `_create_single_response`, do NOT clear it
because they don't go through the audio user-speaking path.

So when the first real audio turn lands after a text-mode prelude:

1. Audio frames are streamed to Gemini via `realtime_input.audio` while the
   user speaks.
2. User stops speaking → `_handle_user_stopped_speaking` fires.
3. `client_content(turn_complete=True)` is sent on the same session while
   `realtime_input` is still being processed for the just-finished audio
   turn.
4. Gemini server rejects the mixed-stream state with 1007.

## Log evidence

```
20:39:29.785 [Transcription:user] "can you tell me what is good"
20:39:29.787 LLMUserAggregator: broadcasting interruption (TranscriptionUserTurnStartStrategy)
20:39:29.789 GeminiLive _handle_interruption (no-op when bot_responding=False)
[client_content(turn_complete=True) sent here per code path]
20:39:30.514 WARNING Connection error (failure 1/3): 1007 None. Request contains an invalid argument.
20:39:30.515 Attempting reconnection (1/3)
20:39:30.791 Reconnected via session_resumption_handle
[no further 1007s for the rest of the session]
```

After reconnect, the new session does not have the flag primed (or the
flag gets cleared by the reconnect path in `_create_initial_response` for
the reconnect case), so subsequent voice turns work normally. The
user-visible impact is restricted to the first text→voice transition.

## Expected behaviour

Either:

- `client_content(turn_complete=True)` should not be sent while
 `realtime_input.audio` frames are still being processed for the same
  turn, OR
- The flag should be cleared by ANY user-input path (text or audio), so
  that a text-mode prelude doesn't leave it primed to fire on the first
  audio user-stop, OR
- The implementation should guard the send with a check that the audio
  path is genuinely the first input (not a transition from text input).

## Mitigation observed

`session_resumption_handle` reconnect succeeds in ~300ms after the 1007.
Subsequent voice turns work normally for the rest of the session. As a
local workaround we explicitly clear `_needs_initial_turn_complete_message`
after our own text-input paths (after `_send_user_text`, after handling
`LLMMessagesAppendFrame`, and as a safety net at turn-complete). That
prevents the racy send on the first audio user-stop. The proper fix
belongs upstream though, since any pipecat-internal text-input path that
doesn't hit our overrides will still hit the race.

## Related issues

- #1347 / #1495 / #2341 (closed): historical 1007 errors in pipecat 0.x.
  Different root causes; fixed in their respective releases. The
  text→voice transition flag race appears to be a separate code path
  that survived into 1.1.
- #4108 (closed): `TranscriptionUserTurnStartStrategy` causing 5-15s
  latency. Same strategy involvement; different downstream symptom
  (delay, not disconnect).
- #4339 (open): function call cancellation race driven by the same
  `TranscriptionUserTurnStartStrategy` broadcast. Adjacent — same
  upstream trigger, but the failure mode here is `client_content` /
  `realtime_input` collision rather than coroutine cancellation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gemini_live: send_client_content(turn_complete=True) on first audio user-stop races with realtime_input → 1007 invalid argument after text→voice transition #4403

Summary

Repro

Root cause

Log evidence

Expected behaviour

Mitigation observed

Related issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

gemini_live: send_client_content(turn_complete=True) on first audio user-stop races with realtime_input → 1007 invalid argument after text→voice transition #4403

Description

Summary

Repro

Root cause

Log evidence

Expected behaviour

Mitigation observed

Related issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions