perf: halve avatar fps and skip TTS resample on event loop#30
Merged
Conversation
Two further per-call CPU wins found by walking through the `AnamAvatarPublisher` send path against the running venv. ## 1. Avatar VP8 encode rate: 30fps → 15fps `AnamAvatarPublisher` wraps an `AVSynchronizer` whose `video_track` inherits from `aiortc`'s `VideoStreamTrack`. The track's `_framerate` attribute drives `next_timestamp()`, which is what wakes up `recv()` and triggers VP8 encode of the next frame. At 30fps × 7 concurrent sessions on one pod that's ~210 encodes/sec in the executor thread pool. py-spy dumps during the previous load test showed 5 worker threads simultaneously inside `encode (aiortc/codecs/vpx.py:240)` — half of those go away at 15fps. Patching `_framerate` directly on the instance is brittle (private attribute, no public setter exists in 0.5.8), but it's a one-line override and the upstream contract for "subclass or wrap" requires copying significantly more state to do the same thing. 15fps is imperceptible for a talking-head avatar. ## 2. Inworld TTS native 24kHz output `anam.AnamAvatarPublisher._send_audio` calls `pcm.resample(target_sample_rate=24000, target_channels=1)` for every TTS chunk before pushing it to Anam's audio input stream. The `resample` call is **synchronous** (no thread offload) and runs on the asyncio event loop. With the default Inworld TTS output of 16kHz, every chunk triggers numpy linear interpolation inline. `getstream`'s `PcmData.resample` does an early return when source sample rate and channels already match the target. Setting `inworld.TTS(..., sample_rate=24000)` makes the TTS itself emit at 24kHz, so the resample call is now a no-op and the event loop is spared the work. This was the exact shape we'd been looking for — synchronous numpy work on the loop, scaling with concurrent TTS streams, invisible to profilers as a Python-level hot spot because numpy releases the GIL during the call but still blocks the loop coroutine. ## What this PR does not address Per-pod single-event-loop saturation when one pod ends up with 6+ sessions due to LB skew. Likely next step is more replicas, either with smaller per-pod requests or on a bigger node.
|
@aliev must be a member of the GetStreamio team on Vercel to deploy. Learn more about collaboration on Vercel and other options here. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two further per-call CPU wins found by walking the
AnamAvatarPublishersend path against the running venv.1. Avatar VP8 encode rate: 30fps → 15fps
AnamAvatarPublisherwraps anAVSynchronizerwhosevideo_trackinherits fromaiortc'sVideoStreamTrack. The track's_framerateattribute drivesnext_timestamp(), which is what wakes uprecv()and triggers VP8 encode of the next frame.At 30fps × 7 concurrent sessions on one pod that's ~210 encodes/sec in the executor thread pool. py-spy dumps during the previous load test showed 5 worker threads simultaneously inside
encode (aiortc/codecs/vpx.py:240)— half of those go away at 15fps.Patching
_frameratedirectly on the instance is brittle (private attribute, no public setter in 0.5.8), but it's a one-line override and the upstream "subclass or wrap" contract would require copying significantly more state to achieve the same thing. 15fps is imperceptible for a talking-head avatar.2. Inworld TTS native 24kHz output
anam.AnamAvatarPublisher._send_audiocallspcm.resample(target_sample_rate=24000, target_channels=1)for every TTS chunk before pushing it to Anam's audio input stream. Theresamplecall is synchronous (no thread offload) and runs on the asyncio event loop. With the default Inworld TTS output of 16kHz, every chunk triggers numpy linear interpolation inline.getstream'sPcmData.resampledoes an early return when source sample rate and channels already match the target. Settinginworld.TTS(..., sample_rate=24000)makes the TTS itself emit at 24kHz, so the resample call is now a no-op and the event loop is spared the work.This was the exact shape we'd been looking for — synchronous numpy work on the loop, scaling with concurrent TTS streams, invisible to profilers as a Python-level hot spot because numpy releases the GIL during the call but still blocks the loop coroutine.
What this PR does not address
Per-pod single-event-loop saturation when one pod ends up with 6+ sessions due to LB skew. Likely next step is more replicas, either with smaller per-pod requests or on a bigger node.