fix: clamp GlobalMemoryManager to zero to prevent OOM from negative memory accounting#74898
Conversation
…emory accounting The GlobalMemoryManager's free() method allowed currentMemoryBytes to go negative when callers freed more memory than was tracked. This caused requestMemory() to believe memory was always available (since negative < max), disabling all backpressure. The BufferManager would then allocate unbounded buffers until the JVM ran out of heap, crashing with OOM. Root cause: BufferDequeue.take() could free a negative value when queue.maxMemoryUsage (the queue's local allocation counter) drifted from what GlobalMemoryManager actually granted — e.g. when requestMemory() returned 0 during memory pressure but the queue's internal counter was not decremented accordingly. Changes: - GlobalMemoryManager.free(): CAS-clamp currentMemoryBytes to 0 when it would go negative, with a WARN log for observability. - BufferDequeue.take(): guard the free call so we never pass a negative unusedBytes value to the memory manager. - Added two regression tests covering single and repeated over-free scenarios.
|
Note 📝 PR Converted to Draft More info...Thank you for creating this PR. As a policy to protect our engineers' time, Airbyte requires all PRs to be created first in draft status. Your PR has been automatically converted to draft status in respect for this policy. As soon as your PR is ready for formal review, you can proceed to convert the PR to "ready for review" status by clicking the "Ready for review" button at the bottom of the PR page. To skip draft status in future PRs, please include |
👋 Welcome to Airbyte!Thank you for your contribution from atritch/airbyte! We're excited to have you in the Airbyte community. If you have any questions, feel free to ask in the PR comments or join our Slack community. 💡 Show Tips and TricksPR Slash CommandsAs needed or by request, Airbyte Maintainers can execute the following slash commands on your PR:
Tips for Working with CI
📚 Show Repo GuidanceHelpful Resources
|
…negative memory accounting When GlobalMemoryManager.free() releases more bytes than were allocated, currentMemoryBytes goes negative. This disables the backpressure gate in requestMemory() (currentMemoryBytes >= maxMemoryBytes never triggers when negative), causing unbounded buffering that leads to OOM and stuck syncs. Changes: - GlobalMemoryManager.free(): Use CAS loop to clamp currentMemoryBytes to 0 instead of allowing negative values. Log a warning when clamping. - BufferDequeue.take(): Guard memoryManager.free() behind unusedBytes > 0 check to prevent over-freeing at the source. - GlobalMemoryManagerTest: Add regression tests for over-free scenarios. Related to https://github.com/airbytehq/oncall/issues/11670 Related to #74898 Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Summary
Fixes #74897
GlobalMemoryManager.free()allowscurrentMemoryBytesto go negative, which disables the backpressure gate inrequestMemory()and leads to unbounded buffering → OOM crashes or infinite zero-byte flush loops.This has been reported multiple times (#42109, #31905, Discussion #36827) across different connectors (BigQuery, Intercom, Mixpanel, GitHub) and was never root-caused. The issue is in the platform's async buffer framework, not in any specific connector.
Changes
GlobalMemoryManager.ktfree()now CAS-clampscurrentMemoryBytesto 0 when it would go negativeBufferDequeue.kttake()now guardsmemoryManager.free(unusedBytes)behindif (unusedBytes > 0)queue.maxMemoryUsagedrifts above whatGlobalMemoryManageractually granted (becauserequestMemory()returned 0 during memory pressure)GlobalMemoryManagerTest.ktfreeMoreThanAllocatedClampsToZero: verifies single over-free clamps to 0 and subsequent allocations work correctlyrepeatedOverFreeDoesNotAccumulateNegativeDebt: verifies multiple sequential over-frees don't accumulate negative debt (the production scenario)Root Cause Analysis
BufferDequeue.take()freesqueue.maxMemoryUsage - batchSizeByteswhen a queue is emptied. Butqueue.maxMemoryUsagecan be higher than whatGlobalMemoryManageractually allocated —requestMemory()returns 0 when full, but the queue'saddMaxMemory()retry loop inBufferEnqueueadjusts the queue's local counter regardless. The delta drivescurrentMemoryBytesnegative.Once negative,
requestMemory()'s gate (currentMemoryBytes >= maxMemoryBytes) never fires, so all subsequent allocation requests succeed unconditionally. The system buffers without limit until the JVM OOMs.Test Plan
GlobalMemoryManagerTest.test()passes unchangedpull_request_commitsstream on large repo (previously OOM'd at -1.5GB after ~75min, now expected to maintain backpressure)