Skip to content

fix(skstore): purge DeletedDir tombstones at end of update transaction#1245

Open
zaccharyfavere wants to merge 1 commit into
SkipLabs:mainfrom
zaccharyfavere:fix-deleteddir-leak
Open

fix(skstore): purge DeletedDir tombstones at end of update transaction#1245
zaccharyfavere wants to merge 1 commit into
SkipLabs:mainfrom
zaccharyfavere:fix-deleteddir-leak

Conversation

@zaccharyfavere
Copy link
Copy Markdown

DeletedDir entries were created in Dirs.state by removeDir() to mark directories as deleted, but were never cleaned up. They accumulated indefinitely, causing a persistent memory leak proportional to the number of subdirectories created and destroyed over time.

The fix adds a purgeDeletedDirs() method on the Dirs class and calls it at the end of updateWithStatus(), right after the cascade finishes and fromContext is reset. At that point the tombstones have served their purpose and can safely be removed.

This is the natural cleanup point: it runs after each merge of changes into the context state, matching the lifecycle of the tombstones themselves.

cc @skiplabsdaniel

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 27, 2026

CLA assistant check
All committers have signed the CLA.

@mbouaziz
Copy link
Copy Markdown
Contributor

Hi, thanks for your contribution!
Sorry, CI failures might be because I switched jobs to use gen 2 circle ci instances but they may be counting memory differently

mbouaziz added a commit that referenced this pull request May 27, 2026
gen2 CircleCI machines enforce virtual memory limits at the cgroup
level. The Skip runtime's default SKIP_CAPACITY of 16 GB (palloc.c)
fails to mmap on any class with <16 GB RAM, producing
"ERROR (MAP FAILED): Cannot allocate memory" early in the job.

Observed on skipruntime/large.gen2 in PR #1245 and on an earlier
skdb-wasm/large.gen2 run that died at 61 s. gen1 tolerated the same
16 GB virtual mmap because only RSS was enforced and the mmap is lazy.

Set SKIP_CAPACITY explicitly per gen2 job, leaving ~25% RAM headroom
for OS and tooling:

  check-ts        medium.gen2 (4 GB)   -> 3G
  skdb            large.gen2  (8 GB)   -> 6G
  skdb-wasm       large.gen2  (8 GB)   -> 6G
  skipruntime     large.gen2  (8 GB)*  -> 6G
  compiler        xlarge.gen2 (16 GB)  -> 12G

* postgres + kafka sidecars get their own RAM allocation

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mbouaziz added a commit that referenced this pull request May 27, 2026
## Summary

Gen2 CircleCI machines enforce virtual memory limits at the cgroup
level. The Skip runtime's default `SKIP_CAPACITY` of 16 GB
(`skiplang/prelude/runtime/palloc.c:449-451`) fails to `mmap` on any
class with <16 GB RAM, producing `ERROR (MAP FAILED): Cannot allocate
memory` early in the job.

Observed on `skipruntime` / `large.gen2` in PR #1245 (job 16682) and on
an earlier `skdb-wasm` / `large.gen2` run that died at 61 s (workflow
`f78dd673`, 2026-05-27 09:37). Gen1 tolerated the same 16 GB virtual
mmap because only RSS was enforced and the mmap is lazy — gen2 evidently
does not.

Set `SKIP_CAPACITY` explicitly per gen2 job, leaving ~25% RAM headroom
for OS + tooling:

| Job | Class | RAM | `SKIP_CAPACITY` |
|---|---|---|---|
| check-ts | medium.gen2 | 4 GB | `3G` |
| skdb | large.gen2 | 8 GB | `6G` |
| skdb-wasm | large.gen2 | 8 GB | `6G` |
| skipruntime | large.gen2 (+ pg/kafka sidecars) | 8 GB primary | `6G` |
| compiler | xlarge.gen2 | 16 GB | `12G` |

Why also setting the ones that haven't (yet) failed: same `mmap` happens
on every Skip-toolchain invocation, so any gen2 job using
`skiplabs/skip*` images is one coin flip away from the same failure.
`skipruntime` on `large.gen2` succeeded 3× in a row on the Phase 1 PR
before failing on #1245 — the variance is real.

`check-examples` is on `large.gen2` but uses `cimg/base` and only runs
the Skip toolchain inside docker-compose containers (separate cgroups),
so it's not affected.

## Test plan

- [ ] Land this, then trigger a PR that exercises each gen2 workflow
(touch `skiplang/prelude/src/foo` for compiler + skdb + skdb-wasm +
skipruntime, and a ts workspace file for check-ts).
- [ ] Confirm `skipruntime` no longer dies with `MAP FAILED` at `skargo
build`.
- [ ] Confirm `compiler` still passes — 12 G cap is well above measured
peak (~10 GB).
- [ ] Watch `skdb` and `skdb-wasm` over the next few PRs for any new
failures specifically at allocation time; if they appear, drop the cap
further.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
@zaccharyfavere
Copy link
Copy Markdown
Author

Hi, alright thanks for your return !

DeletedDir entries were created in Dirs.state by removeDir() to mark
directories as deleted, but were never cleaned up. They accumulated
indefinitely, causing a persistent memory leak proportional to the
number of subdirectories created and destroyed over time.

The fix adds a purgeDeletedDirs() method on the Dirs class and calls
it at the end of updateWithStatus(), right after the cascade finishes
and fromContext is reset. At that point the tombstones have served
their purpose and can safely be removed.

This is the natural cleanup point: it runs after each merge of
changes into the context state, matching the lifecycle of the
tombstones themselves.
@mbouaziz
Copy link
Copy Markdown
Contributor

Sorry, still fixing the CI...
Very bad timing for a broken CI, we haven't had any external contribution for months 😆

@zaccharyfavere
Copy link
Copy Markdown
Author

No worries, I just re pushed because I realised there was a format error still in my code. That should be fixed now since it made one of the CI pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants