Skip to content

[WIP][Experimental] Activation Offloading #1218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

lessw2020
Copy link
Contributor

@lessw2020 lessw2020 commented May 23, 2025

WIP - This PR adds an option to use Activation Offloading via a torch.autograd context manager called manage_activations.
This can reduce memory by 3x - however, atm it comes at strong perf reductions. This is experimental to see if B200's and GB200's are better equipped to do things faster re:AO. If so, we can revisit better overlapping/prefetching etc.

Usage:
You should backup your current graph.py under:
https://github.com/pytorch/pytorch/blob/main/torch/autograd/graph.py
and replace it with the graph.py at the root of this PR.

From there, you can toggle Activation Manager by adding this context or the nullcontext (to turn it off without re-indenting) in train.py.

          with torch.autograd.graph.manage_activations():
                # with contextlib.nullcontext():
                with self.train_context(optional_context_parallel_ctx):
                    assert len(model_parts) == 1
                    pred = model_parts[0](inputs)
                    loss = self.loss_fn(pred, labels)
                    # need to free to before bwd to avoid peaking memory
                    del pred
                    loss.backward()

You'll see the activations being offloaded and then restored:
Screenshot 2025-05-22 at 3 54 36 PM

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 23, 2025
@tianyu-l
Copy link
Contributor

Oh I think earlier we had #467. I wonder how this PR is related to / different from that PR.

@lessw2020
Copy link
Contributor Author

Oh I think earlier we had #467. I wonder how this PR is related to / different from that PR.

Andrew and I were both working in this space last year as part of the Spirits project. Andrews has better overlapping at the potential expense of needing more model integration (as I recall) vs what I have in my PR is a very straightforward baseline, just offload activations, works with any model.

We largely stopped work on offloading b/c streaming or not, PCIe is quite limited bandwidth..you can get fancy but fundamentally a small pipe is a small pipe.
However, Blackwell opens up much higher bandwidth and so we may be able to better leverage this as an option to cut memory overhead.
I dusted this code off to test with B200 and now waiting for GB200 results and then we can see if we want to invest in streaming and if so I can pull in work from Andrews PR here to integrate these into one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants