Skip to content

gemini_live: send_client_content(turn_complete=True) on first audio user-stop races with realtime_input → 1007 invalid argument after text→voice transition #4403

@ysinh

Description

@ysinh

Summary

In a Gemini Live pipeline that mixes text-input turns with voice-input turns,
the first user-audio turn following a text-only sequence reliably triggers a
1007 Request contains an invalid argument rejection from the Gemini server.
The session recovers via session_resumption_handle reconnect (~300ms), but
the in-flight user utterance is lost and turn boundaries get merged across
the reconnect. In our repro, the user-visible stall before the bot responds
is ~12s.

  • Pipecat: 1.1.0
  • Python: 3.12.5
  • Model: gemini-2.5-flash-native-audio-latest (also reproduces on *-preview-12-2025)
  • Modalities: AUDIO
  • Config: inference_on_context_initialization=False
  • VAD: server-side Gemini VAD enabled

Repro

  1. Connect a Gemini Live session with inference_on_context_initialization=False.
  2. Drive several text-input turns via either _send_user_text or
    LLMMessagesAppendFrame (no audio frames). Each turn completes normally.
  3. Switch to voice mode. Mic publishes audio. Audio frames flow to Gemini via
    realtime_input.audio.
  4. User finishes speaking. Pipeline emits UserStoppedSpeakingFrame (via
    either SpeechTimeoutUserTurnStopStrategy or
    TranscriptionUserTurnStartStrategy-driven turn cycle).
  5. ~200ms later: Connection error (failure 1/3): 1007 None. Request contains an invalid argument.

Root cause

GeminiLiveLLMService._handle_user_stopped_speaking unconditionally sends
client_content(turn_complete=True) on the first user-stopped event when
_needs_initial_turn_complete_message=True:

https://github.com/pipecat-ai/pipecat/blob/v1.1.0/src/pipecat/services/google/gemini_live/llm.py#L692-L703

async def _handle_user_stopped_speaking(self, frame):
    self._user_is_speaking = False
    self._user_audio_buffer = bytearray()
    await self.start_ttfb_metrics()
    if self._vad_disabled and self._session and self._ready_for_realtime_input:
        try:
            await self._session.send_realtime_input(activity_end=ActivityEnd())
        except Exception as e:
            await self._handle_send_error(e)
    if self._needs_initial_turn_complete_message:
        self._needs_initial_turn_complete_message = False
        # NOTE: without this, the model ignores the context it's been
        # seeded with before the user started speaking
        await self._session.send_client_content(turn_complete=True)

The flag is set to True at session-ready when
inference_on_context_initialization=False (Gemini 2.5 only):

https://github.com/pipecat-ai/pipecat/blob/v1.1.0/src/pipecat/services/google/gemini_live/llm.py#L1378

if not trigger_inference and not self._is_gemini_3:
    self._needs_initial_turn_complete_message = True

The flag is only cleared in _handle_user_stopped_speaking. Text-input
turns driven via _send_user_textrealtime_input(text=...), or via
LLMMessagesAppendFrame_create_single_response, do NOT clear it
because they don't go through the audio user-speaking path.

So when the first real audio turn lands after a text-mode prelude:

  1. Audio frames are streamed to Gemini via realtime_input.audio while the
    user speaks.
  2. User stops speaking → _handle_user_stopped_speaking fires.
  3. client_content(turn_complete=True) is sent on the same session while
    realtime_input is still being processed for the just-finished audio
    turn.
  4. Gemini server rejects the mixed-stream state with 1007.

Log evidence

20:39:29.785 [Transcription:user] "can you tell me what is good"
20:39:29.787 LLMUserAggregator: broadcasting interruption (TranscriptionUserTurnStartStrategy)
20:39:29.789 GeminiLive _handle_interruption (no-op when bot_responding=False)
[client_content(turn_complete=True) sent here per code path]
20:39:30.514 WARNING Connection error (failure 1/3): 1007 None. Request contains an invalid argument.
20:39:30.515 Attempting reconnection (1/3)
20:39:30.791 Reconnected via session_resumption_handle
[no further 1007s for the rest of the session]

After reconnect, the new session does not have the flag primed (or the
flag gets cleared by the reconnect path in _create_initial_response for
the reconnect case), so subsequent voice turns work normally. The
user-visible impact is restricted to the first text→voice transition.

Expected behaviour

Either:

  • client_content(turn_complete=True) should not be sent while
    realtime_input.audio frames are still being processed for the same
    turn, OR
  • The flag should be cleared by ANY user-input path (text or audio), so
    that a text-mode prelude doesn't leave it primed to fire on the first
    audio user-stop, OR
  • The implementation should guard the send with a check that the audio
    path is genuinely the first input (not a transition from text input).

Mitigation observed

session_resumption_handle reconnect succeeds in ~300ms after the 1007.
Subsequent voice turns work normally for the rest of the session. As a
local workaround we explicitly clear _needs_initial_turn_complete_message
after our own text-input paths (after _send_user_text, after handling
LLMMessagesAppendFrame, and as a safety net at turn-complete). That
prevents the racy send on the first audio user-stop. The proper fix
belongs upstream though, since any pipecat-internal text-input path that
doesn't hit our overrides will still hit the race.

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions