Skip to content

fix: clear device cache when a queue item is cancelled#9223

Open
plz12345 wants to merge 2 commits into
invoke-ai:mainfrom
plz12345:fix/clear-device-cache-on-cancel
Open

fix: clear device cache when a queue item is cancelled#9223
plz12345 wants to merge 2 commits into
invoke-ai:mainfrom
plz12345:fix/clear-device-cache-on-cancel

Conversation

@plz12345
Copy link
Copy Markdown

Summary

When an image generation job is cancelled mid-denoising, the PyTorch CUDA/MPS allocator retains its memory pool and never returns it to the OS. This causes RAM/VRAM usage to accumulate across cancellations and never drop — even as more jobs run — until the app is quit and restarted.

Root cause: TorchDevice.empty_cache() is called at the end of successful invocations (e.g. at line 957 of denoise_latents.py), but a CanceledException raised during the denoising step callback causes execution to jump directly to the except CanceledException: pass handler in run_node(), bypassing that cleanup entirely. PyTorch's allocator holds the freed tensor pool (intermediate latents, activations, noise tensors) indefinitely without an explicit empty_cache() call.

Two fixes:

  1. run_node() except CanceledException handler — add gc.collect() + TorchDevice.empty_cache() so GPU/MPS memory from a cancelled invocation is returned to the OS immediately, not deferred until app restart.

  2. _process() pre-job cleanup — add TorchDevice.empty_cache() alongside the existing gc.collect() call so any residual allocator memory from the previous job (whether it completed normally, errored, or was cancelled) is cleared before the next job begins.

Related Issues / Discussions

Closes #6759

QA Instructions

  1. Start InvokeAI and queue one or more image generation jobs.
  2. Cancel a job mid-generation (during denoising).
  3. Observe RAM/VRAM in Activity Monitor (macOS), nvidia-smi, or equivalent — memory should drop back toward the pre-generation baseline within a few seconds of cancellation.
  4. Before this fix: memory stays elevated permanently and accumulates with each cancellation, only recovering on app restart. After this fix: it drops promptly after each cancel.
  5. Verify that a normal (non-cancelled) generation still completes correctly and produces expected output.

Tested on: macOS (Apple Silicon / MPS unified memory), cancelling single jobs and mid-queue jobs. Memory pressure confirmed to return to baseline after each cancellation.

Merge Plan

No database changes. Single file, two small additions. Safe to merge at any time.

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • ❗Changes to a redux slice have a corresponding migration
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

PyTorch's CUDA/MPS allocator holds freed tensors in a pool and never
returns them to the OS unless empty_cache() is called explicitly.

Before this change, TorchDevice.empty_cache() was only called inside
successful invocations (e.g. at the end of denoise_latents). A
CanceledException raised during denoising skips that cleanup path,
leaving working memory (intermediate latents, activations, noise
tensors) stuck in the allocator pool for the lifetime of the process.

Two fixes:
1. Call gc.collect() + TorchDevice.empty_cache() in the
   CanceledException handler in run_node(), so GPU/MPS memory is
   returned to the OS immediately when a node is cancelled.
2. Add TorchDevice.empty_cache() alongside the existing gc.collect()
   in _process() so any residual memory from the previous job
   (completed or cancelled) is cleared before starting the next one.
@github-actions github-actions Bot added python PRs that change python files services PRs that change app services labels May 22, 2026
@lstein
Copy link
Copy Markdown
Collaborator

lstein commented May 25, 2026

The referenced bug report is from 2024. Is this still a problem? If so, could you provide a recipe for reproducing the memory leak? Thanks.

@plz12345
Copy link
Copy Markdown
Author

Yes it is, at least on Mac.

  1. Run a generation job.
  2. Cancel it halfway through
  3. That RAM is not freed up
  4. Python process grows until RAM is exhausted if you repeat this.

I've been running with this patch since I submitted, and see the desired RAM flush via Activity Monitor, reliably.

@lstein lstein self-assigned this May 30, 2026
@lstein lstein added the 6.13.5 Library Updates label May 30, 2026
@lstein lstein moved this to 6.13.5 LIBRARY UPDATES in Invoke - Community Roadmap May 30, 2026
Copy link
Copy Markdown
Collaborator

@lstein lstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long wait! I've spent some time working with your PR. Unfortunately the memory leak doesn't appear to happen on my Linux development system. It may be a Mac-specific issue and I can ask our Mac developer to give it a spin.

Before I do that, though, I'd like to draw your attention to a couple of issues I spotted looking at your proposed patch. I see two potential issues:

  1. The calls to gc.collect() and TorchDevice.empty_cache() at 460-462 are occurring within the denoiser's Exception block, when the local execution frames are still active and are referencing the in-flight latents and activation buffers. The garbage collection calls shouldn't be able to clean up these data structures. They will only be released when the call stack unwinds. It might be more effective to set a flag in the exception block, and then calling garbage collection calls in a finally block outside the exception handler?
  2. Line 462 is calling TorchDevice.empty_cache() before the execution of each and every queue item, regardless of whether the previous one completed successfully. This is defeating the purpose of the torch cache and may bring a performance penalty. I think that if you move the GC operations out of the exception block as described above, you won't need to make this call. However, if this is necessary to avoid the memory leak on your system, could you do a little benchmarking to see if it has a noticeable impact on generation speed?

Also a minor nit: The comment on line 457 says that python never cedes memory back to the OS (which is true), but it is contradicted by line 461 that says the "memory....is returned to the OS". A more accurate description is that the memory is returned to the python pool for reallocation.

@keturn
Copy link
Copy Markdown
Contributor

keturn commented Jun 1, 2026

The described behavior is also a symptom of something else going on. Memory not returned to the OS isn't unusual for some allocator implementations. But even if it doesn't go back to the OS, it should be re-used by the allocator for the next generation.

If instead you see it continue to accumulate with every subsequent cancellation, that could be a different kind of memory leak.

My two cents: If it's not obviously a bug in the app and there's a chance it's Python or PyTorch, it could be really dissatisfying to sink time into trying to suss out the intricacies of the current behavior only to discover those implementation details have been fixed or changed in the last five versions of torch… Could be a good thing to table until after the runtime updates go in. (Soon™)

@plz12345
Copy link
Copy Markdown
Author

plz12345 commented Jun 1, 2026

So my gripe was around not freeing up memory when a job was cancelled. I have since become aware of the invokeai.yaml settings to force cache eviction more quickly. My workflow was flipping between Invoke and ltx-2-mlx for video gen, so Invoke being bad at freeing RAM was a pain point. I think Invoke's hot cache solution is fine where needed, but on Mac with MLX, you don't even need it with sub-30GB models because the Metal loading is that fast.

Since then, I ended up just vibe coding an app that actually uses native MLX models for image gen, via mflux,. I know there is likely not even a whiff of Invoke supporting MLX until a miracle happens in PyTorch/Apple land, and Apple releasing MLX as an architecture tells me that's never happening unless Apple and Nvidia get in bed like Microsoft and Nvidia are.

If you want me to revise this as noted, I will, but I'm likely moving on since native MLX is nearly twice as fast as the MPS/PyTorch hand-off that has to happen in Invoke without MLX.

Also the Flux.2 Klein 9b Q8 MLX model actually works, unlike the Q8 GGUF model, which which is broken on Mac (or is in Invoke, I didn't get far).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.13.5 Library Updates python PRs that change python files services PRs that change app services

Projects

Status: 6.13.5 LIBRARY UPDATES

Development

Successfully merging this pull request may close these issues.

[bug]: Clearing queues in mid generation does not free VRAM, even when configured to do so

4 participants