Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuse cumulative sum into FP8xINT4 Grouped Gemm #3812

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jwfromm
Copy link
Contributor

@jwfromm jwfromm commented Mar 13, 2025

Summary: Rather than do a separate cumsum operator in torch (which can be expensive), its quite trivial to fuse the sum into the kernel setup we have to do for GEMM anyway. This diff makes that change and we see no measurable performance impact.

Reviewed By: jiawenliu64

Differential Revision: D71081537

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71081537

Copy link

netlify bot commented Mar 13, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 11da092
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67d48e6e21268200081e3cf6
😎 Deploy Preview https://deploy-preview-3812--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Mar 13, 2025
Summary:
X-link: facebookresearch/FBGEMM#898


Rather than do a separate cumsum operator in torch (which can be expensive), its quite trivial to fuse the sum into the kernel setup we have to do for GEMM anyway. This diff makes that change and we see no measurable performance impact.

Reviewed By: jiawenliu64

Differential Revision: D71081537
jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Mar 13, 2025
Summary:
X-link: facebookresearch/FBGEMM#898


Rather than do a separate cumsum operator in torch (which can be expensive), its quite trivial to fuse the sum into the kernel setup we have to do for GEMM anyway. This diff makes that change and we see no measurable performance impact.

Reviewed By: jiawenliu64

Differential Revision: D71081537
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71081537

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71081537

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Mar 13, 2025
Summary:
X-link: facebookresearch/FBGEMM#898

Pull Request resolved: pytorch#3812

Rather than do a separate cumsum operator in torch (which can be expensive), its quite trivial to fuse the sum into the kernel setup we have to do for GEMM anyway. This diff makes that change and we see no measurable performance impact.

Reviewed By: jiawenliu64

Differential Revision: D71081537
Josh Fromm added 3 commits March 14, 2025 13:03
Summary:
X-link: facebookresearch/FBGEMM#847

Pull Request resolved: pytorch#3766

One of the new interesting changes in the preshuffled F8I4 kernel is that group scales are downcast to FP8. This has the risk of running into dynamic range issues and impacting accuracy. We can mitigate this risk by adding FP32 columnwise scaling to the output. Fortunately, we can do this using EVT so the performance impact is negligible.

Differential Revision: D70587477
Summary:
X-link: facebookresearch/FBGEMM#855

Pull Request resolved: pytorch#3775

This diff introduces a set of quantization helper functions to fbgemm_gpu/experimental/gen_ai to make it easier to apply the new Int4 packing and preshuffling to weights.

Differential Revision: D70643388

Reviewed By: summerdengfb
Summary:
Pull Request resolved: pytorch#3800

Working on adding support for stacked mixed dtype grouped gemm with preshuffling.

Differential Revision: D70870933
jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Mar 14, 2025
Summary:
X-link: facebookresearch/FBGEMM#898


Rather than do a separate cumsum operator in torch (which can be expensive), its quite trivial to fuse the sum into the kernel setup we have to do for GEMM anyway. This diff makes that change and we see no measurable performance impact.

Reviewed By: jiawenliu64

Differential Revision: D71081537
jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Mar 14, 2025
Summary:
X-link: facebookresearch/FBGEMM#898


Rather than do a separate cumsum operator in torch (which can be expensive), its quite trivial to fuse the sum into the kernel setup we have to do for GEMM anyway. This diff makes that change and we see no measurable performance impact.

Reviewed By: jiawenliu64

Differential Revision: D71081537
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71081537

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Mar 14, 2025
Summary:
X-link: facebookresearch/FBGEMM#898

Pull Request resolved: pytorch#3812

Rather than do a separate cumsum operator in torch (which can be expensive), its quite trivial to fuse the sum into the kernel setup we have to do for GEMM anyway. This diff makes that change and we see no measurable performance impact.

Reviewed By: jiawenliu64

Differential Revision: D71081537
Summary:
X-link: facebookresearch/FBGEMM#898

Pull Request resolved: pytorch#3812

Rather than do a separate cumsum operator in torch (which can be expensive), its quite trivial to fuse the sum into the kernel setup we have to do for GEMM anyway. This diff makes that change and we see no measurable performance impact.

Reviewed By: jiawenliu64

Differential Revision: D71081537
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71081537

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants