Summary
In a Gemini Live pipeline that mixes text-input turns with voice-input turns,
the first user-audio turn following a text-only sequence reliably triggers a
1007 Request contains an invalid argument rejection from the Gemini server.
The session recovers via session_resumption_handle reconnect (~300ms), but
the in-flight user utterance is lost and turn boundaries get merged across
the reconnect. In our repro, the user-visible stall before the bot responds
is ~12s.
- Pipecat: 1.1.0
- Python: 3.12.5
- Model:
gemini-2.5-flash-native-audio-latest (also reproduces on *-preview-12-2025)
- Modalities:
AUDIO
- Config:
inference_on_context_initialization=False
- VAD: server-side Gemini VAD enabled
Repro
- Connect a Gemini Live session with
inference_on_context_initialization=False.
- Drive several text-input turns via either
_send_user_text or
LLMMessagesAppendFrame (no audio frames). Each turn completes normally.
- Switch to voice mode. Mic publishes audio. Audio frames flow to Gemini via
realtime_input.audio.
- User finishes speaking. Pipeline emits
UserStoppedSpeakingFrame (via
either SpeechTimeoutUserTurnStopStrategy or
TranscriptionUserTurnStartStrategy-driven turn cycle).
- ~200ms later:
Connection error (failure 1/3): 1007 None. Request contains an invalid argument.
Root cause
GeminiLiveLLMService._handle_user_stopped_speaking unconditionally sends
client_content(turn_complete=True) on the first user-stopped event when
_needs_initial_turn_complete_message=True:
https://github.com/pipecat-ai/pipecat/blob/v1.1.0/src/pipecat/services/google/gemini_live/llm.py#L692-L703
async def _handle_user_stopped_speaking(self, frame):
self._user_is_speaking = False
self._user_audio_buffer = bytearray()
await self.start_ttfb_metrics()
if self._vad_disabled and self._session and self._ready_for_realtime_input:
try:
await self._session.send_realtime_input(activity_end=ActivityEnd())
except Exception as e:
await self._handle_send_error(e)
if self._needs_initial_turn_complete_message:
self._needs_initial_turn_complete_message = False
# NOTE: without this, the model ignores the context it's been
# seeded with before the user started speaking
await self._session.send_client_content(turn_complete=True)
The flag is set to True at session-ready when
inference_on_context_initialization=False (Gemini 2.5 only):
https://github.com/pipecat-ai/pipecat/blob/v1.1.0/src/pipecat/services/google/gemini_live/llm.py#L1378
if not trigger_inference and not self._is_gemini_3:
self._needs_initial_turn_complete_message = True
The flag is only cleared in _handle_user_stopped_speaking. Text-input
turns driven via _send_user_text → realtime_input(text=...), or via
LLMMessagesAppendFrame → _create_single_response, do NOT clear it
because they don't go through the audio user-speaking path.
So when the first real audio turn lands after a text-mode prelude:
- Audio frames are streamed to Gemini via
realtime_input.audio while the
user speaks.
- User stops speaking →
_handle_user_stopped_speaking fires.
client_content(turn_complete=True) is sent on the same session while
realtime_input is still being processed for the just-finished audio
turn.
- Gemini server rejects the mixed-stream state with 1007.
Log evidence
20:39:29.785 [Transcription:user] "can you tell me what is good"
20:39:29.787 LLMUserAggregator: broadcasting interruption (TranscriptionUserTurnStartStrategy)
20:39:29.789 GeminiLive _handle_interruption (no-op when bot_responding=False)
[client_content(turn_complete=True) sent here per code path]
20:39:30.514 WARNING Connection error (failure 1/3): 1007 None. Request contains an invalid argument.
20:39:30.515 Attempting reconnection (1/3)
20:39:30.791 Reconnected via session_resumption_handle
[no further 1007s for the rest of the session]
After reconnect, the new session does not have the flag primed (or the
flag gets cleared by the reconnect path in _create_initial_response for
the reconnect case), so subsequent voice turns work normally. The
user-visible impact is restricted to the first text→voice transition.
Expected behaviour
Either:
client_content(turn_complete=True) should not be sent while
realtime_input.audio frames are still being processed for the same
turn, OR
- The flag should be cleared by ANY user-input path (text or audio), so
that a text-mode prelude doesn't leave it primed to fire on the first
audio user-stop, OR
- The implementation should guard the send with a check that the audio
path is genuinely the first input (not a transition from text input).
Mitigation observed
session_resumption_handle reconnect succeeds in ~300ms after the 1007.
Subsequent voice turns work normally for the rest of the session. As a
local workaround we explicitly clear _needs_initial_turn_complete_message
after our own text-input paths (after _send_user_text, after handling
LLMMessagesAppendFrame, and as a safety net at turn-complete). That
prevents the racy send on the first audio user-stop. The proper fix
belongs upstream though, since any pipecat-internal text-input path that
doesn't hit our overrides will still hit the race.
Related issues
Summary
In a Gemini Live pipeline that mixes text-input turns with voice-input turns,
the first user-audio turn following a text-only sequence reliably triggers a
1007 Request contains an invalid argumentrejection from the Gemini server.The session recovers via
session_resumption_handlereconnect (~300ms), butthe in-flight user utterance is lost and turn boundaries get merged across
the reconnect. In our repro, the user-visible stall before the bot responds
is ~12s.
gemini-2.5-flash-native-audio-latest(also reproduces on*-preview-12-2025)AUDIOinference_on_context_initialization=FalseRepro
inference_on_context_initialization=False._send_user_textorLLMMessagesAppendFrame(no audio frames). Each turn completes normally.realtime_input.audio.UserStoppedSpeakingFrame(viaeither
SpeechTimeoutUserTurnStopStrategyorTranscriptionUserTurnStartStrategy-driven turn cycle).Connection error (failure 1/3): 1007 None. Request contains an invalid argument.Root cause
GeminiLiveLLMService._handle_user_stopped_speakingunconditionally sendsclient_content(turn_complete=True)on the first user-stopped event when_needs_initial_turn_complete_message=True:https://github.com/pipecat-ai/pipecat/blob/v1.1.0/src/pipecat/services/google/gemini_live/llm.py#L692-L703
The flag is set to True at session-ready when
inference_on_context_initialization=False(Gemini 2.5 only):https://github.com/pipecat-ai/pipecat/blob/v1.1.0/src/pipecat/services/google/gemini_live/llm.py#L1378
The flag is only cleared in
_handle_user_stopped_speaking. Text-inputturns driven via
_send_user_text→realtime_input(text=...), or viaLLMMessagesAppendFrame→_create_single_response, do NOT clear itbecause they don't go through the audio user-speaking path.
So when the first real audio turn lands after a text-mode prelude:
realtime_input.audiowhile theuser speaks.
_handle_user_stopped_speakingfires.client_content(turn_complete=True)is sent on the same session whilerealtime_inputis still being processed for the just-finished audioturn.
Log evidence
After reconnect, the new session does not have the flag primed (or the
flag gets cleared by the reconnect path in
_create_initial_responseforthe reconnect case), so subsequent voice turns work normally. The
user-visible impact is restricted to the first text→voice transition.
Expected behaviour
Either:
client_content(turn_complete=True)should not be sent whilerealtime_input.audioframes are still being processed for the sameturn, OR
that a text-mode prelude doesn't leave it primed to fire on the first
audio user-stop, OR
path is genuinely the first input (not a transition from text input).
Mitigation observed
session_resumption_handlereconnect succeeds in ~300ms after the 1007.Subsequent voice turns work normally for the rest of the session. As a
local workaround we explicitly clear
_needs_initial_turn_complete_messageafter our own text-input paths (after
_send_user_text, after handlingLLMMessagesAppendFrame, and as a safety net at turn-complete). Thatprevents the racy send on the first audio user-stop. The proper fix
belongs upstream though, since any pipecat-internal text-input path that
doesn't hit our overrides will still hit the race.
Related issues
Different root causes; fixed in their respective releases. The
text→voice transition flag race appears to be a separate code path
that survived into 1.1.
TranscriptionUserTurnStartStrategycausing 5-15slatency. Same strategy involvement; different downstream symptom
(delay, not disconnect).
TranscriptionUserTurnStartStrategybroadcast. Adjacent — sameupstream trigger, but the failure mode here is
client_content/realtime_inputcollision rather than coroutine cancellation.