Skip to content

Parallelize chunked Parakeet batch transcription#507

Open
hamzaq2000 wants to merge 2 commits intoFluidInference:mainfrom
hamzaq2000:main
Open

Parallelize chunked Parakeet batch transcription#507
hamzaq2000 wants to merge 2 commits intoFluidInference:mainfrom
hamzaq2000:main

Conversation

@hamzaq2000
Copy link
Copy Markdown

@hamzaq2000 hamzaq2000 commented Apr 9, 2026

Why is this change needed?

This PR speeds up Parakeet batch transcription for long audio by ~2.2-2.8x, by parallelizing the existing stateless chunked path. It doesn't change the streaming/live transcription path.

It adds a configurable parallelChunkConcurrency setting to ASRConfig, lets AsrManager create worker clones from already-loaded AsrModels, and updates ChunkProcessor to send independent chunks across that worker pool before merging the results with the existing merge logic.

The important part is that the decoding behavior for each chunk stays the same. The patch is really about scheduling chunk work in parallel so the runtime can keep more hardware busy and improve throughput on longer files.

Validation

Benchmarked on Apple M3, using 16 KHz 16-bit mono wav file downloaded from this video (~1 hour duration), with 5 runs each for current upstream vs. PR branch.

Model Upstream Avg Time PR Branch Avg Time Speedup Upstream Avg Peak Mem PR Branch Avg Peak Mem Delta
Parakeet v2 31.84 s 11.25 s 2.83x 515.9 MiB 537.4 MiB +21.4 MiB
Parakeet v3 31.37 s 12.75 s 2.46x 496.0 MiB 527.0 MiB +31.0 MiB
Parakeet tdt-ctc-110m 19.89 s 9.08 s 2.19x 489.6 MiB 509.2 MiB +19.7 MiB

I compared the resulting transcripts and word timings before and after this change for v2, v3, and tdt-ctc-110m, and found no differences. So based on this one test file at least, the optimization appears safe.

Peak memory footprint was measured with macOS /usr/bin/time -lp. While it does increase, the measured increase is modest relative to the speedup, so I think it's reasonable to keep parallelChunkConcurrency set to 4 by default rather than make it opt-in.

parallelChunkConcurrency Optimal Value

A default value of 4 for the chunk parallelism was chosen becuase values higher than it yielded little to no extra speedup and values less than it still left speed on the table; on the two devices I tested on, at least, which were iPhone SE 3 and M3 MacBook Air.

AI Disclosure

OpenAI Codex was used to write the code for this patch.


Open with Devin

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

Open in Devin Review

@Alex-Wengg
Copy link
Copy Markdown
Member

@hamzaq2000 could you compare the test-clean performances between your changes and main on your Mac. See the benchmarks.md for more info

@hamzaq2000
Copy link
Copy Markdown
Author

Yep, ran it, results (Apple M3):

Model main Overall RTFx PR Overall RTFx main Total Time PR Total Time Speedup
Parakeet v3 89.81x 95.20x 216.59s 204.34s 1.06x
Parakeet v2 84.46x 87.64x 230.32s 221.97s 1.04x
Parakeet tdt-ctc-110m 135.33x 141.73x 143.74s 137.25s 1.05x

Not much of a speedup on shorter files, which the test-clean dataset is made up of:

Average duration: 7.42s
Median duration: 5.79s
Min duration: 1.29s
Max duration: 34.96s

But on longer files, quite a speedup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants