-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update to tune for small m
s and quantized gemv
#3712
base: main
Are you sure you want to change the base?
Conversation
This pull request was exported from Phabricator. Differential Revision: D69819701 |
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
9921d57
to
ea4d984
Compare
Summary: X-link: facebookresearch/FBGEMM#794 as title Differential Revision: D69819701
This pull request was exported from Phabricator. Differential Revision: D69819701 |
Summary: X-link: facebookresearch/FBGEMM#794 as title Differential Revision: D69819701
ea4d984
to
2f6b29f
Compare
This pull request was exported from Phabricator. Differential Revision: D69819701 |
Summary: X-link: facebookresearch/FBGEMM#794 as title Reviewed By: ipiszy Differential Revision: D69819701
2f6b29f
to
202044f
Compare
This pull request was exported from Phabricator. Differential Revision: D69819701 |
Summary: X-link: facebookresearch/FBGEMM#758 add small m (m = 2, 3, 4) support for fast gemv - bf16_fast_gemv [+] - bf16fp8bf16_fast_gemv[+] - fp8fp8bf16_fast_gemv[+] **(v20 perf analysis from quantize_bench**) | B | M | N | K | Kernel Name | Elapsed Time (ms) | TFLOPS | Bandwidth (GB/s) | |---|---|---|---|-------------|-------------------|--------|------------------| | 1 | 1 | 8192 | 1024 | bf16_baseline | 0.017 | 0.973 | 973.581 | | 1 | 1 | 8192 | 1024 | fp8fp8_oss_fast_gemv | 0.013 | 1.251 | 626.711 | | 1 | 1 | 8192 | 1024 | cuda_lite | 0.014 | 1.205 | 603.859 | | 1 | 1 | 8192 | 1024 | marlin_bf16i4 | 0.014 | 1.189 | 298.625 | | 1 | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.189 | 298.669 | | 1 | 2 | 8192 | 1024 | bf16_baseline | 0.017 | 1.963 | 983.820 | | 1 | 2 | 8192 | 1024 | fp8fp8_oss_fast_gemv | 0.014 | 2.414 | 605.920 | | 1 | 2 | 8192 | 1024 | cuda_lite | 0.014 | 2.379 | 597.311 | | 1 | 2 | 8192 | 1024 | marlin_bf16i4 | 0.014 | 2.322 | 292.742 | | 1 | 2 | 8192 | 1024 | machete_bf16i4 | 0.014 | 2.345 | 295.741 | | 1 | 3 | 8192 | 1024 | bf16_baseline | 0.017 | 3.006 | 1005.276 | | 1 | 3 | 8192 | 1024 | fp8fp8_oss_fast_gemv | 0.014 | 3.513 | 589.214 | | 1 | 3 | 8192 | 1024 | cuda_lite | 0.015 | 3.381 | 566.948 | | 1 | 3 | 8192 | 1024 | marlin_bf16i4 | 0.014 | 3.474 | 293.277 | | 1 | 3 | 8192 | 1024 | machete_bf16i4 | 0.014 | 3.513 | 296.593 | | 1 | 4 | 8192 | 1024 | bf16_baseline | 0.017 | 3.920 | 984.419 | | 1 | 4 | 8192 | 1024 | fp8fp8_oss_fast_gemv | 0.015 | 4.466 | 562.896 | | 1 | 4 | 8192 | 1024 | cuda_lite | 0.016 | 4.100 | 516.728 | | 1 | 4 | 8192 | 1024 | marlin_bf16i4 | 0.014 | 4.629 | 294.426 | | 1 | 4 | 8192 | 1024 | machete_bf16i4 | 0.014 | 4.792 | 304.764 | | 1 | 1 | 8192 | 3584 | bf16_baseline | 0.044 | 1.327 | 1327.169 | | 1 | 1 | 8192 | 3584 | fp8fp8_oss_fast_gemv | 0.026 | 2.283 | 1142.422 | | 1 | 1 | 8192 | 3584 | cuda_lite | 0.026 | 2.298 | 1149.878 | | 1 | 1 | 8192 | 3584 | marlin_bf16i4 | 0.020 | 2.877 | 720.408 | | 1 | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.468 | 617.894 | | 1 | 2 | 8192 | 3584 | bf16_baseline | 0.044 | 2.675 | 1338.550 | | 1 | 2 | 8192 | 3584 | fp8fp8_oss_fast_gemv | 0.026 | 4.512 | 1129.580 | | 1 | 2 | 8192 | 3584 | cuda_lite | 0.026 | 4.515 | 1130.280 | | 1 | 2 | 8192 | 3584 | marlin_bf16i4 | 0.020 | 5.743 | 720.190 | | 1 | 2 | 8192 | 3584 | machete_bf16i4 | 0.024 | 4.911 | 615.829 | | 1 | 3 | 8192 | 3584 | bf16_baseline | 0.044 | 4.014 | 1339.480 | | 1 | 3 | 8192 | 3584 | fp8fp8_oss_fast_gemv | 0.028 | 6.391 | 1067.367 | | 1 | 3 | 8192 | 3584 | cuda_lite | 0.027 | 6.471 | 1080.655 | | 1 | 3 | 8192 | 3584 | marlin_bf16i4 | 0.020 | 8.606 | 720.622 | | 1 | 3 | 8192 | 3584 | machete_bf16i4 | 0.024 | 7.366 | 616.763 | | 1 | 4 | 8192 | 3584 | bf16_baseline | 0.044 | 5.350 | 1339.637 | | 1 | 4 | 8192 | 3584 | fp8fp8_oss_fast_gemv | 0.028 | 8.275 | 1037.158 | | 1 | 4 | 8192 | 3584 | cuda_lite | 0.029 | 8.063 | 1010.621 | | 1 | 4 | 8192 | 3584 | marlin_bf16i4 | 0.020 | 11.460 | 720.846 | | 1 | 4 | 8192 | 3584 | machete_bf16i4 | 0.024 | 9.911 | 623.402 | | 1 | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.872 | 872.425 | | 1 | 1 | 1280 | 8192 | fp8fp8_oss_fast_gemv | 0.015 | 1.403 | 702.176 | | 1 | 1 | 1280 | 8192 | cuda_lite | 0.015 | 1.421 | 711.264 | | 1 | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.779 | 195.420 | | 1 | 1 | 1280 | 8192 | machete_bf16i4 | 0.025 | 0.837 | 209.928 | | 1 | 2 | 1280 | 8192 | bf16_baseline | 0.024 | 1.737 | 870.022 | | 1 | 2 | 1280 | 8192 | fp8fp8_oss_fast_gemv | 0.015 | 2.760 | 691.374 | | 1 | 2 | 1280 | 8192 | cuda_lite | 0.015 | 2.836 | 710.432 | | 1 | 2 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 1.558 | 196.179 | | 1 | 2 | 1280 | 8192 | machete_bf16i4 | 0.026 | 1.624 | 204.431 | | 1 | 3 | 1280 | 8192 | bf16_baseline | 0.024 | 2.594 | 866.953 | | 1 | 3 | 1280 | 8192 | fp8fp8_oss_fast_gemv | 0.015 | 4.094 | 684.375 | | 1 | 3 | 1280 | 8192 | cuda_lite | 0.015 | 4.167 | 696.571 | | 1 | 3 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 2.327 | 196.054 | | 1 | 3 | 1280 | 8192 | machete_bf16i4 | 0.026 | 2.458 | 207.069 | | 1 | 4 | 1280 | 8192 | bf16_baseline | 0.024 | 3.458 | 867.559 | | 1 | 4 | 1280 | 8192 | fp8fp8_oss_fast_gemv | 0.015 | 5.414 | 679.479 | | 1 | 4 | 1280 | 8192 | cuda_lite | 0.016 | 5.408 | 678.758 | | 1 | 4 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 3.069 | 194.570 | | 1 | 4 | 1280 | 8192 | machete_bf16i4 | 0.025 | 3.321 | 210.571 | | 1 | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.612 | 1612.302 | | 1 | 1 | 7168 | 8192 | fp8fp8_oss_fast_gemv | 0.043 | 2.752 | 1376.396 | | 1 | 1 | 7168 | 8192 | cuda_lite | 0.044 | 2.685 | 1342.856 | | 1 | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.580 | 896.051 | | 1 | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.019 | 755.510 | | 1 | 2 | 7168 | 8192 | bf16_baseline | 0.073 | 3.227 | 1614.307 | | 1 | 2 | 7168 | 8192 | fp8fp8_oss_fast_gemv | 0.043 | 5.430 | 1358.541 | | 1 | 2 | 7168 | 8192 | cuda_lite | 0.044 | 5.324 | 1332.114 | | 1 | 2 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 7.214 | 903.651 | | 1 | 2 | 7168 | 8192 | machete_bf16i4 | 0.039 | 6.091 | 763.029 | | 1 | 3 | 7168 | 8192 | bf16_baseline | 0.072 | 4.863 | 1622.296 | | 1 | 3 | 7168 | 8192 | fp8fp8_oss_fast_gemv | 0.044 | 7.949 | 1326.423 | | 1 | 3 | 7168 | 8192 | cuda_lite | 0.044 | 7.954 | 1327.215 | | 1 | 3 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 10.816 | 904.127 | | 1 | 3 | 7168 | 8192 | machete_bf16i4 | 0.038 | 9.172 | 766.770 | | 1 | 4 | 7168 | 8192 | bf16_baseline | 0.073 | 6.452 | 1614.684 | | 1 | 4 | 7168 | 8192 | fp8fp8_oss_fast_gemv | 0.046 | 10.219 | 1279.299 | | 1 | 4 | 7168 | 8192 | cuda_lite | 0.047 | 9.904 | 1239.944 | | 1 | 4 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 14.345 | 900.287 | | 1 | 4 | 7168 | 8192 | machete_bf16i4 | 0.039 | 12.128 | 761.147 | Reviewed By: ipiszy Differential Revision: D69492556
Summary: X-link: facebookresearch/FBGEMM#794 as title Reviewed By: ipiszy Differential Revision: D69819701
202044f
to
86ab01a
Compare
This pull request was exported from Phabricator. Differential Revision: D69819701 |
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/794
as title
Differential Revision: D69819701