-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuse cumulative sum into FP8xINT4 Grouped Gemm #3812
base: main
Are you sure you want to change the base?
Conversation
This pull request was exported from Phabricator. Differential Revision: D71081537 |
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
7b14029
to
ae44862
Compare
Summary: X-link: facebookresearch/FBGEMM#898 Rather than do a separate cumsum operator in torch (which can be expensive), its quite trivial to fuse the sum into the kernel setup we have to do for GEMM anyway. This diff makes that change and we see no measurable performance impact. Reviewed By: jiawenliu64 Differential Revision: D71081537
ae44862
to
9fcd7c3
Compare
Summary: X-link: facebookresearch/FBGEMM#898 Rather than do a separate cumsum operator in torch (which can be expensive), its quite trivial to fuse the sum into the kernel setup we have to do for GEMM anyway. This diff makes that change and we see no measurable performance impact. Reviewed By: jiawenliu64 Differential Revision: D71081537
This pull request was exported from Phabricator. Differential Revision: D71081537 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D71081537 |
9fcd7c3
to
1bf1da4
Compare
Summary: X-link: facebookresearch/FBGEMM#898 Pull Request resolved: pytorch#3812 Rather than do a separate cumsum operator in torch (which can be expensive), its quite trivial to fuse the sum into the kernel setup we have to do for GEMM anyway. This diff makes that change and we see no measurable performance impact. Reviewed By: jiawenliu64 Differential Revision: D71081537
Summary: X-link: facebookresearch/FBGEMM#847 Pull Request resolved: pytorch#3766 One of the new interesting changes in the preshuffled F8I4 kernel is that group scales are downcast to FP8. This has the risk of running into dynamic range issues and impacting accuracy. We can mitigate this risk by adding FP32 columnwise scaling to the output. Fortunately, we can do this using EVT so the performance impact is negligible. Differential Revision: D70587477
Summary: X-link: facebookresearch/FBGEMM#855 Pull Request resolved: pytorch#3775 This diff introduces a set of quantization helper functions to fbgemm_gpu/experimental/gen_ai to make it easier to apply the new Int4 packing and preshuffling to weights. Differential Revision: D70643388 Reviewed By: summerdengfb
Summary: Pull Request resolved: pytorch#3800 Working on adding support for stacked mixed dtype grouped gemm with preshuffling. Differential Revision: D70870933
1bf1da4
to
84ad32e
Compare
Summary: X-link: facebookresearch/FBGEMM#898 Rather than do a separate cumsum operator in torch (which can be expensive), its quite trivial to fuse the sum into the kernel setup we have to do for GEMM anyway. This diff makes that change and we see no measurable performance impact. Reviewed By: jiawenliu64 Differential Revision: D71081537
84ad32e
to
6505cad
Compare
Summary: X-link: facebookresearch/FBGEMM#898 Rather than do a separate cumsum operator in torch (which can be expensive), its quite trivial to fuse the sum into the kernel setup we have to do for GEMM anyway. This diff makes that change and we see no measurable performance impact. Reviewed By: jiawenliu64 Differential Revision: D71081537
This pull request was exported from Phabricator. Differential Revision: D71081537 |
6505cad
to
5715b78
Compare
Summary: X-link: facebookresearch/FBGEMM#898 Pull Request resolved: pytorch#3812 Rather than do a separate cumsum operator in torch (which can be expensive), its quite trivial to fuse the sum into the kernel setup we have to do for GEMM anyway. This diff makes that change and we see no measurable performance impact. Reviewed By: jiawenliu64 Differential Revision: D71081537
Summary: X-link: facebookresearch/FBGEMM#898 Pull Request resolved: pytorch#3812 Rather than do a separate cumsum operator in torch (which can be expensive), its quite trivial to fuse the sum into the kernel setup we have to do for GEMM anyway. This diff makes that change and we see no measurable performance impact. Reviewed By: jiawenliu64 Differential Revision: D71081537
This pull request was exported from Phabricator. Differential Revision: D71081537 |
5715b78
to
11da092
Compare
Summary: Rather than do a separate cumsum operator in torch (which can be expensive), its quite trivial to fuse the sum into the kernel setup we have to do for GEMM anyway. This diff makes that change and we see no measurable performance impact.
Reviewed By: jiawenliu64
Differential Revision: D71081537