[WIP][Experimental] Activation Offloading #1218

lessw2020 · 2025-05-23T16:35:27Z

WIP - This PR adds an option to use Activation Offloading via a torch.autograd context manager called manage_activations.
This can reduce memory by 3x - however, atm it comes at strong perf reductions. This is experimental to see if B200's and GB200's are better equipped to do things faster re:AO. If so, we can revisit better overlapping/prefetching etc.

Usage:
You should backup your current graph.py under:
https://github.com/pytorch/pytorch/blob/main/torch/autograd/graph.py
and replace it with the graph.py at the root of this PR.

From there, you can toggle Activation Manager by adding this context or the nullcontext (to turn it off without re-indenting) in train.py.

          with torch.autograd.graph.manage_activations():
                # with contextlib.nullcontext():
                with self.train_context(optional_context_parallel_ctx):
                    assert len(model_parts) == 1
                    pred = model_parts[0](inputs)
                    loss = self.loss_fn(pred, labels)
                    # need to free to before bwd to avoid peaking memory
                    del pred
                    loss.backward()

You'll see the activations being offloaded and then restored:

tianyu-l · 2025-05-23T20:34:08Z

Oh I think earlier we had #467. I wonder how this PR is related to / different from that PR.

lessw2020 · 2025-05-23T20:46:42Z

Oh I think earlier we had #467. I wonder how this PR is related to / different from that PR.

Andrew and I were both working in this space last year as part of the Spirits project. Andrews has better overlapping at the potential expense of needing more model integration (as I recall) vs what I have in my PR is a very straightforward baseline, just offload activations, works with any model.

We largely stopped work on offloading b/c streaming or not, PCIe is quite limited bandwidth..you can get fancy but fundamentally a small pipe is a small pipe.
However, Blackwell opens up much higher bandwidth and so we may be able to better leverage this as an option to cut memory overhead.
I dusted this code off to test with B200 and now waiting for GB200 results and then we can see if we want to invest in streaming and if so I can pull in work from Andrews PR here to integrate these into one.

lessw2020 added 3 commits May 23, 2025 09:26

Update run_train.sh

caf7f16

Update train.py

b5810d6

Create graph.py

4b691f9

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][Experimental] Activation Offloading #1218

[WIP][Experimental] Activation Offloading #1218

Uh oh!

lessw2020 commented May 23, 2025 •

edited

Loading

Uh oh!

tianyu-l commented May 23, 2025

Uh oh!

lessw2020 commented May 23, 2025

Uh oh!

Uh oh!

[WIP][Experimental] Activation Offloading #1218

Are you sure you want to change the base?

[WIP][Experimental] Activation Offloading #1218

Uh oh!

Conversation

lessw2020 commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l commented May 23, 2025

Uh oh!

lessw2020 commented May 23, 2025

Uh oh!

Uh oh!

lessw2020 commented May 23, 2025 •

edited

Loading