Skip to content

Inworld TTSTextFrame word timestamps strip punctuation and poison assistant context across turns #4261

@timofey-TK

Description

@timofey-TK

pipecat version

0.0.108

Python version

3.13

Operating System

Ubuntu 24.04

Issue description

Summary

We are seeing a reproducible issue with InworldTTSService + post-TTS assistant aggregation in pipecat-ai==0.0.108.

When Inworld returns word timestamps, Pipecat emits word-level TTSTextFrames without punctuation. Because the assistant context is built downstream from TTS, those punctuation-less tokens become the canonical assistant message stored in LLMContext.

That flattened assistant text is then reused in later LLM prompts, and the model starts imitating the punctuation-less style on subsequent turns.

This also shows up in frontend transcript streams: interim bot transcript buffers are built word-by-word without punctuation

Environment

  • pipecat-ai==0.0.108
  • pipecat-ai-flows>=0.0.22
  • InworldTTSService
  • WebRTC call flow

Pipeline shape:

transport.input() -> stt -> context_aggregator.user() -> llm -> tts -> transport.output() -> context_aggregator.assistant()

Reproduction steps

  1. Use InworldTTSService with assistant aggregation after TTS.
  2. Let the LLM produce a punctuated response with multiple clauses/sentences.
  3. Let Inworld return word timestamps.
  4. Observe that assistant history stored in context is punctuation-less.
  5. Trigger the next LLM turn.
  6. Observe that the next prompt already contains punctuation-less assistant messages and that the model starts imitating that style.

Expected behavior

  • Assistant memory stored in LLMContext should preserve the original assistant text punctuation.
  • Future LLM prompts should not be degraded by punctuation-less TTS alignment text.
  • Frontend transcript consumers should not receive a final transcript that is effectively a flattened run-on sentence with punctuation separated into its own final event.

Actual behavior

When Inworld timestamps are present:

  1. The spoken assistant text is reconstructed from word timestamps.
  2. Those timestamps contain bare words without punctuation.
  3. LLMAssistantAggregator stores that punctuation-less text in assistant context.
  4. The next LLM request includes assistant history like:
Hey welcome back It’s good to have you again We’ll just pick up where we left off and continue with the screening interview Are you ready to get started with the next set of questions

instead of the original punctuated text.

  1. The LLM then starts replying in the same flattened style.

Logs

From our local logs, `OpenAILLMService` receives assistant history like this:


{'role': 'assistant', 'content': 'Hey welcome back It’s good to have you again We’ll just pick up where we left off and continue with the screening interview Are you ready to get started with the next set of questions'}
{'role': 'assistant', 'content': 'Hey are you still there Just wanted to check in real quick'}
{'role': 'assistant', 'content': 'Hey just checking in one more time are you ready to continue If I don’t hear back I’ll have to go ahead and end the call on my side'}
{'role': 'assistant', 'content': 'Hey uh this is virtual assistant actually thanks for jumping back in Are you ready to continue with the screening questions'}
{'role': 'assistant', 'content': 'Great thanks So just so I understand your current situation are you working right now or are you between roles'}


Those turns were originally generated as natural punctuated speech, but the history fed back into the LLM is flattened.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions