Alpha OOMKilled: posting cache exceeds budget + GOGC goal surpasses container memory limit

## Describe the bug
Dgraph Alpha nodes running v25.0.0 experience unbounded heap growth leading to repeated OOMKill events in Kubernetes. The Go garbage collector's target heap size (next_gc) grows beyond the container's 20Gi memory limit because:

1. No GOMEMLIMIT is set — Go has no awareness of the container's memory ceiling.
2. GOGC=100 (default) — GC triggers at 2× live heap, so with ~10GB live data the GC goal reaches ~20GB, exceeding the 20Gi cgroup limit.
3. Posting cache size bugs — v25.0.0 has two known bugs where the posting list cache underestimates entry sizes (#9515) and does not enforce the max cost limit (#9526), causing the cache to consume far more memory than the configured --cache size-mb=4096.
4. GOMAXPROCS=128 — Dgraph v25.0.0 hardcodes GOMAXPROCS from the node CPU count (128) rather than the container's CPU limit (6), increasing scheduling overhead and memory fragmentation.

Alphas OOMKill in rotation: alpha-2 was killed on 2026-03-06 at 19:10 UTC, alpha-0 on 2026-03-02, and alpha-1's GC goal currently sits at 21.00GB (above the 20Gi limit), making it the next to be killed. This has been happening repeatedly.


## To Reproduce
1. Deploy Dgraph v25.0.0 alpha StatefulSet with --cache size-mb=4096, memory limit of 20Gi, and no GOMEMLIMIT/GOGC env vars.
2. Allow normal production query and mutation workload to run over days.
3. Observe go_memstats_heap_alloc_bytes growing steadily due to the posting cache bug exceeding its configured budget.
4. With GOGC=100, the GC goal (go_memstats_next_gc_bytes) reaches 2× live heap (~18-22GB), exceeding the 20Gi container limit.
5. Kubernetes OOMKills the alpha (exit code 137, reason: OOMKilled). The pattern rotates across alphas as load shifts after each restart from time to time.


## Expected behavior
The Go garbage collector should trigger frequently enough to keep heap usage well within the 20Gi container memory limit. The posting list cache should respect the configured --cache size-mb=4096 (4GB) budget and not grow unboundedly.

Screenshots

Prometheus metrics captured on 2026-03-06 ~20:55 UTC:
| Alpha   | Heap Live | GC Goal (next_gc) | RSS      | Container Limit | Status                             |
| ------- | --------- | ----------------- | -------- | --------------- | ---------------------------------- |
| alpha-1 | 12.75 GB  | 21.00 GB          | 16.73 GB | 20Gi            | GC goal exceeds limit              |
| alpha-0 | 9.20 GB   | 18.32 GB          | 17.02 GB | 20Gi            | High OOM risk                      |
| alpha-2 | 4.17 GB   | 6.86 GB           | 9.54 GB  | 20Gi            | Recovering (OOMKilled 1h47m prior) |


## Environment
•  OS: Linux (GKE nodes: Container-Optimized OS, c4d-standard-8 — 8 vCPU, 31GB RAM)
•  Orchestration: Kubernetes (GKE cluster)
•  Language: Go (toolchain v1.24, bundled with Dgraph v25.0.0)
•  Dgraph Version: v25.0.0
•  Go runtime config: GOGC=100 (default), GOMEMLIMIT=not set, GOMAXPROCS=128 (hardcoded by Dgraph from node CPUs)
•  Container resources: requests cpu=4 / memory=16Gi, limits cpu=6 / memory=20Gi
•  Dgraph flags: --cache size-mb=4096, --raft snapshot-after-entries=100000, --limit mutations=strict


## Additional context
•  Posting cache hit ratios: posting list 70.6%, block cache 93.5% — cache is actively used but unbounded growth defeats the purpose.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alpha OOMKilled: posting cache exceeds budget + GOGC goal surpasses container memory limit #9655

Describe the bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Alpha	Heap Live	GC Goal (next_gc)	RSS	Container Limit	Status
alpha-1	12.75 GB	21.00 GB	16.73 GB	20Gi	GC goal exceeds limit
alpha-0	9.20 GB	18.32 GB	17.02 GB	20Gi	High OOM risk
alpha-2	4.17 GB	6.86 GB	9.54 GB	20Gi	Recovering (OOMKilled 1h47m prior)

Alpha OOMKilled: posting cache exceeds budget + GOGC goal surpasses container memory limit #9655

Description

Describe the bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions