Skip to content

fix: clamp GlobalMemoryManager to zero to prevent OOM from negative memory accounting#74898

Open
Avery Tritch (atritch) wants to merge 2 commits intoairbytehq:masterfrom
atritch:fix/global-memory-manager-negative-accounting
Open

fix: clamp GlobalMemoryManager to zero to prevent OOM from negative memory accounting#74898
Avery Tritch (atritch) wants to merge 2 commits intoairbytehq:masterfrom
atritch:fix/global-memory-manager-negative-accounting

Conversation

@atritch
Copy link
Contributor

Summary

Fixes #74897

GlobalMemoryManager.free() allows currentMemoryBytes to go negative, which disables the backpressure gate in requestMemory() and leads to unbounded buffering → OOM crashes or infinite zero-byte flush loops.

This has been reported multiple times (#42109, #31905, Discussion #36827) across different connectors (BigQuery, Intercom, Mixpanel, GitHub) and was never root-caused. The issue is in the platform's async buffer framework, not in any specific connector.

Changes

GlobalMemoryManager.kt

  • free() now CAS-clamps currentMemoryBytes to 0 when it would go negative
  • Changed the log from INFO to WARN for observability
  • Thread-safe via CAS loop (no lock required)

BufferDequeue.kt

  • take() now guards memoryManager.free(unusedBytes) behind if (unusedBytes > 0)
  • Prevents the most common entry point for negative values: when queue.maxMemoryUsage drifts above what GlobalMemoryManager actually granted (because requestMemory() returned 0 during memory pressure)

GlobalMemoryManagerTest.kt

  • freeMoreThanAllocatedClampsToZero: verifies single over-free clamps to 0 and subsequent allocations work correctly
  • repeatedOverFreeDoesNotAccumulateNegativeDebt: verifies multiple sequential over-frees don't accumulate negative debt (the production scenario)

Root Cause Analysis

BufferDequeue.take() frees queue.maxMemoryUsage - batchSizeBytes when a queue is emptied. But queue.maxMemoryUsage can be higher than what GlobalMemoryManager actually allocated — requestMemory() returns 0 when full, but the queue's addMaxMemory() retry loop in BufferEnqueue adjusts the queue's local counter regardless. The delta drives currentMemoryBytes negative.

Once negative, requestMemory()'s gate (currentMemoryBytes >= maxMemoryBytes) never fires, so all subsequent allocation requests succeed unconditionally. The system buffers without limit until the JVM OOMs.

Test Plan

  • Existing GlobalMemoryManagerTest.test() passes unchanged
  • New test: over-free clamps to zero instead of going negative
  • New test: repeated over-frees don't accumulate negative debt
  • Manual validation: self-hosted OSS v2.0.1, source-github → destination-redshift, pull_request_commits stream on large repo (previously OOM'd at -1.5GB after ~75min, now expected to maintain backpressure)

…emory accounting

The GlobalMemoryManager's free() method allowed currentMemoryBytes to go
negative when callers freed more memory than was tracked. This caused
requestMemory() to believe memory was always available (since negative <
max), disabling all backpressure. The BufferManager would then allocate
unbounded buffers until the JVM ran out of heap, crashing with OOM.

Root cause: BufferDequeue.take() could free a negative value when
queue.maxMemoryUsage (the queue's local allocation counter) drifted
from what GlobalMemoryManager actually granted — e.g. when
requestMemory() returned 0 during memory pressure but the queue's
internal counter was not decremented accordingly.

Changes:
- GlobalMemoryManager.free(): CAS-clamp currentMemoryBytes to 0 when it
  would go negative, with a WARN log for observability.
- BufferDequeue.take(): guard the free call so we never pass a negative
  unusedBytes value to the memory manager.
- Added two regression tests covering single and repeated over-free
  scenarios.
@octavia-bot
Copy link
Contributor

octavia-bot bot commented Mar 16, 2026

Note

📝 PR Converted to Draft

More info...

Thank you for creating this PR. As a policy to protect our engineers' time, Airbyte requires all PRs to be created first in draft status. Your PR has been automatically converted to draft status in respect for this policy.

As soon as your PR is ready for formal review, you can proceed to convert the PR to "ready for review" status by clicking the "Ready for review" button at the bottom of the PR page.

To skip draft status in future PRs, please include [ready] in your PR title or add the skip-draft-status label when creating your PR.

@github-actions
Copy link
Contributor

👋 Welcome to Airbyte!

Thank you for your contribution from atritch/airbyte! We're excited to have you in the Airbyte community.

If you have any questions, feel free to ask in the PR comments or join our Slack community.

💡 Show Tips and Tricks

PR Slash Commands

As needed or by request, Airbyte Maintainers can execute the following slash commands on your PR:

  • /format-fix - Fixes most formatting issues.
  • /bump-version - Bumps connector versions.
  • /run-connector-tests - Runs connector tests.
  • /run-cat-tests - Runs CAT tests.
  • /run-regression-tests - Runs regression tests for the modified connector(s).
  • /build-connector-images - Builds and publishes a pre-release docker image for the modified connector(s).
  • /publish-connectors-prerelease - Publishes pre-release connector builds (tagged as {version}-preview.{git-sha}) for all modified connectors in the PR.
  • /ai-review - AI-powered PR review for connector safety and quality gates.
  • /force-merge reason="<A_GOOD_REASON>" - Force merges the PR using admin privileges, bypassing CI checks. Requires a reason.

Tips for Working with CI

  1. Pre-Release Checks. Please pay attention to these, as they contain standard checks on the metadata.yaml file, docs requirements, etc. If you need help resolving a pre-release check, please ask a maintainer.
    • Note: If you are creating a new connector, please be sure to replace the default logo.svg file with a suitable icon.
  2. Connector CI Tests. Some failures here may be expected if your tests require credentials. Please review these results to ensure (1) unit tests are passing, if applicable, and (2) integration tests pass to the degree possible and expected.
  3. (Optional.) BYO Connector Credentials for tests in your fork. You can optionally set up your fork with BYO credentials for your connector. This can significantly speed up your review, ensuring your changes are fully tested before the maintainers begin their review.
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

devin-ai-integration bot added a commit that referenced this pull request Mar 16, 2026
…negative memory accounting

When GlobalMemoryManager.free() releases more bytes than were allocated,
currentMemoryBytes goes negative. This disables the backpressure gate in
requestMemory() (currentMemoryBytes >= maxMemoryBytes never triggers when
negative), causing unbounded buffering that leads to OOM and stuck syncs.

Changes:
- GlobalMemoryManager.free(): Use CAS loop to clamp currentMemoryBytes
  to 0 instead of allowing negative values. Log a warning when clamping.
- BufferDequeue.take(): Guard memoryManager.free() behind unusedBytes > 0
  check to prevent over-freeing at the source.
- GlobalMemoryManagerTest: Add regression tests for over-free scenarios.

Related to https://github.com/airbytehq/oncall/issues/11670
Related to #74898

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@atritch Avery Tritch (atritch) marked this pull request as ready for review March 17, 2026 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

GlobalMemoryManager negative accounting disables backpressure, causes OOM and stuck syncs

1 participant