Fuse cumulative sum into FP8xINT4 Grouped Gemm #3812

jwfromm · 2025-03-13T22:53:51Z

Summary: Rather than do a separate cumsum operator in torch (which can be expensive), its quite trivial to fuse the sum into the kernel setup we have to do for GEMM anyway. This diff makes that change and we see no measurable performance impact.

Reviewed By: jiawenliu64

Differential Revision: D71081537

facebook-github-bot · 2025-03-13T22:54:01Z

This pull request was exported from Phabricator. Differential Revision: D71081537

netlify · 2025-03-13T22:54:18Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`11da092`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67d48e6e21268200081e3cf6
😎 Deploy Preview	https://deploy-preview-3812--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Summary: X-link: facebookresearch/FBGEMM#898 Rather than do a separate cumsum operator in torch (which can be expensive), its quite trivial to fuse the sum into the kernel setup we have to do for GEMM anyway. This diff makes that change and we see no measurable performance impact. Reviewed By: jiawenliu64 Differential Revision: D71081537

facebook-github-bot · 2025-03-13T23:01:14Z

This pull request was exported from Phabricator. Differential Revision: D71081537

facebook-github-bot · 2025-03-13T23:04:57Z

This pull request was exported from Phabricator. Differential Revision: D71081537

Summary: X-link: facebookresearch/FBGEMM#898 Pull Request resolved: pytorch#3812 Rather than do a separate cumsum operator in torch (which can be expensive), its quite trivial to fuse the sum into the kernel setup we have to do for GEMM anyway. This diff makes that change and we see no measurable performance impact. Reviewed By: jiawenliu64 Differential Revision: D71081537

Summary: X-link: facebookresearch/FBGEMM#847 Pull Request resolved: pytorch#3766 One of the new interesting changes in the preshuffled F8I4 kernel is that group scales are downcast to FP8. This has the risk of running into dynamic range issues and impacting accuracy. We can mitigate this risk by adding FP32 columnwise scaling to the output. Fortunately, we can do this using EVT so the performance impact is negligible. Differential Revision: D70587477

Summary: X-link: facebookresearch/FBGEMM#855 Pull Request resolved: pytorch#3775 This diff introduces a set of quantization helper functions to fbgemm_gpu/experimental/gen_ai to make it easier to apply the new Int4 packing and preshuffling to weights. Differential Revision: D70643388 Reviewed By: summerdengfb

Summary: Pull Request resolved: pytorch#3800 Working on adding support for stacked mixed dtype grouped gemm with preshuffling. Differential Revision: D70870933

Summary: X-link: facebookresearch/FBGEMM#898 Rather than do a separate cumsum operator in torch (which can be expensive), its quite trivial to fuse the sum into the kernel setup we have to do for GEMM anyway. This diff makes that change and we see no measurable performance impact. Reviewed By: jiawenliu64 Differential Revision: D71081537

facebook-github-bot · 2025-03-14T20:09:40Z

This pull request was exported from Phabricator. Differential Revision: D71081537

Summary: X-link: facebookresearch/FBGEMM#898 Pull Request resolved: pytorch#3812 Rather than do a separate cumsum operator in torch (which can be expensive), its quite trivial to fuse the sum into the kernel setup we have to do for GEMM anyway. This diff makes that change and we see no measurable performance impact. Reviewed By: jiawenliu64 Differential Revision: D71081537

facebook-github-bot · 2025-03-14T20:15:32Z

This pull request was exported from Phabricator. Differential Revision: D71081537

facebook-github-bot added the cla signed label Mar 13, 2025

facebook-github-bot added the fb-exported label Mar 13, 2025

jwfromm force-pushed the export-D71081537 branch from 7b14029 to ae44862 Compare March 13, 2025 23:00

jwfromm force-pushed the export-D71081537 branch from ae44862 to 9fcd7c3 Compare March 13, 2025 23:01

jwfromm force-pushed the export-D71081537 branch from 9fcd7c3 to 1bf1da4 Compare March 13, 2025 23:05

Josh Fromm added 3 commits March 14, 2025 13:03

Add Preshuffled FP8 x INT4 Grouped Gemm Kernel (pytorch#3800)

6041307

Summary: Pull Request resolved: pytorch#3800 Working on adding support for stacked mixed dtype grouped gemm with preshuffling. Differential Revision: D70870933

jwfromm force-pushed the export-D71081537 branch from 1bf1da4 to 84ad32e Compare March 14, 2025 20:06

jwfromm force-pushed the export-D71081537 branch from 84ad32e to 6505cad Compare March 14, 2025 20:07

jwfromm force-pushed the export-D71081537 branch from 6505cad to 5715b78 Compare March 14, 2025 20:09

jwfromm force-pushed the export-D71081537 branch from 5715b78 to 11da092 Compare March 14, 2025 20:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse cumulative sum into FP8xINT4 Grouped Gemm #3812

Fuse cumulative sum into FP8xINT4 Grouped Gemm #3812

jwfromm commented Mar 13, 2025

facebook-github-bot commented Mar 13, 2025

netlify bot commented Mar 13, 2025 •

edited

Loading

facebook-github-bot commented Mar 13, 2025

facebook-github-bot commented Mar 13, 2025

facebook-github-bot commented Mar 14, 2025

facebook-github-bot commented Mar 14, 2025

Fuse cumulative sum into FP8xINT4 Grouped Gemm #3812

Are you sure you want to change the base?

Fuse cumulative sum into FP8xINT4 Grouped Gemm #3812

Conversation

jwfromm commented Mar 13, 2025

facebook-github-bot commented Mar 13, 2025

netlify bot commented Mar 13, 2025 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

facebook-github-bot commented Mar 13, 2025

facebook-github-bot commented Mar 13, 2025

facebook-github-bot commented Mar 14, 2025

facebook-github-bot commented Mar 14, 2025

netlify bot commented Mar 13, 2025 •

edited

Loading