Replies: 4 comments 1 reply
-
AMMO URL: |
Beta Was this translation helpful? Give feedback.
-
This is a much needed feature. The benefits of supporting this dtype can be seen here. |
Beta Was this translation helpful? Give feedback.
-
Bump for this 👍 |
Beta Was this translation helpful? Give feedback.
-
@HaiShaw as for tensor scaling , are supporting log2 scaling in quark tool and export it as a file? I am asking this because FP8 has better precision near 0 (with 1e-4 precision for fp8_e4m3fnuz), usually we are usualy scaling it to [-32, 32] before quantized as fp8 for better precisions. Simply scaling fp16 with tensorwise scalar by using this equation won't create best nermeric accuracy in PTQ:
Hence we need to develop a routine :
This simply because as you known, fp8 is in non-uniform distribution. Wish this question draw your attention. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This RFC is to facilitate the community to enable new FP8 data type to vLLM for the benefits to both memory bandwidth and computation throughput (on FP8 capable hardware: AMD MI300, nVIDIA H100, etc.)
fp16/half precision is used exclusively as higher precision example, but same specs. apply to bfloat16, fp32, etc.
Support loading FP8 quantized model from AMMO or similar quantizer, quantized model includes:
Support OCP e4m3 as FP8 data type during inference
Per-Tensor Scaling is required
FP8 Tensor Core computation (e4m3 gemm) feasibility:
Support both AMD and nVIDIA hardware:
Computation kernel with FP8 input
*1/S
(inverse scaling factor) for each FP8 inputReference
RFC: FP8 Quantization Schema in vLLM #3218
Beta Was this translation helpful? Give feedback.
All reactions