[Feature]: Quark quantization format upstream to VLLM #10294

kewang-xlnx · 2024-11-13T12:06:03Z

Quark is a comprehensive cross-platform toolkit designed to simplify and enhance the quantization of deep learning models. Supporting both PyTorch and ONNX models, Quark empowers developers to optimize their models for deployment on a wide range of hardware backends, achieving significant performance gains without compromising accuracy.
Here is the introduction to Quark.
Currently, the format of the quantized model exported by Quark is different from the formats supported by VLLM, so we need to contribute codes to VLLM to add support for the Quark format.

Quark Format

configuration file config.json of Quark format
key names and data types of Quark safetensors

model.layers.1.self_attn.k_proj.input_scale, 	torch.float16
model.layers.1.self_attn.k_proj.weight, 	torch.float8_e4m3fn
model.layers.1.self_attn.k_proj.weight_scale, 	torch.float16
model.layers.1.self_attn.o_proj.input_scale, 	torch.float16
model.layers.1.self_attn.o_proj.weight, 	torch.float8_e4m3fn
model.layers.1.self_attn.o_proj.weight_scale, 	torch.float16
model.layers.1.self_attn.q_proj.input_scale, 	torch.float16
model.layers.1.self_attn.q_proj.weight, 	torch.float8_e4m3fn
model.layers.1.self_attn.q_proj.weight_scale, 	torch.float16
model.layers.1.self_attn.v_proj.input_scale, 	torch.float16
model.layers.1.self_attn.v_proj.weight, 	torch.float8_e4m3fn
model.layers.1.self_attn.v_proj.weight_scale, 	torch.float16

KV scale format if kv cache used

model.layers.1.self_attn.k_proj.output_scale, 	torch.float16
model.layers.1.self_attn.v_proj.output_scale, 	torch.float16

Design

Add the quark format to ROCm/vllm repo by creating a directory for it in vllm/model_executor/layers/quantization and including the following files.

quark.py: implements and manages quantization configurations and processing for quark quantization format for LLMs.
quark_moe.py: implements and manages quantization configurations and processing for quark quantization format for LLMs with MOE structure.
schemes/quark_scheme.py: an abstract base class for various quantization schemes in Quark, including the structure for weight creation, forward process, and post-loading weight processing.
schemes/quark_fp8.py: provides the implementation of the W8A8Fp8 quantization scheme within the Quark framework

At the first stage, we will first integrate the FP8 quantification in Quark format into VLLM, and then integrate other Quark formats such as INT4/INT8 per_tensor/per_channel/per_group into VLLM later when needed.

The text was updated successfully, but these errors were encountered:

kewang-xlnx added the feature request label Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Quark quantization format upstream to VLLM #10294

[Feature]: Quark quantization format upstream to VLLM #10294

kewang-xlnx commented Nov 13, 2024 •

edited

Loading

[Feature]: Quark quantization format upstream to VLLM #10294

[Feature]: Quark quantization format upstream to VLLM #10294

Comments

kewang-xlnx commented Nov 13, 2024 • edited Loading

Quark Format

Design

kewang-xlnx commented Nov 13, 2024 •

edited

Loading