Skip to content

Differences in Dynamic Quantization Speedup for Varying SFT Tasks on Qwen2-72b-Instruct Models #40

@IPostYellow

Description

@IPostYellow

I have applied dynamic quantization to two models based on the qwen2-72b-Instruct, which were fine-tuned on different SFT tasks. I've noticed that the acceleration effects vary significantly between the two tasks, even though both models are based on the same base model, qwen2-72b-Instruct.
I'm curious to know if you could shed some light on why different SFT tasks might influence the quantization effects on the first token's acceleration, especially considering that both tasks have similar input lengths.
Below are the deployment details for my two tasks.

  • Task A:
    -- unquantized model
    8*L40S vllm0.4.2
    qps 0.15
    average prompt tokens 7962.74
    TTFT 3744.11 ms
    TPOT 67.43 ms
    Latency 22125.21 ms

-- quantized model
4*L40S vllm0.4.2
qps 0.15
average prompt tokens 7965.9
TTFT 3358.27 ms
TPOT 50.56 ms
Latency 17823.79 ms

  • Task B:
    -- unquantized model
    8*L40S vllm0.4.2
    qps 0.145
    average prompt tokens 8216.11
    TTFT 3790.8 ms
    TPOT 118.64 ms
    Latency 57087.26 ms

-- quantized model
4*L40S vllm0.4.2
qps 0.15
average prompt tokens 8042.46
TTFT 3674.77 ms
TPOT 113.39 ms
Latency 50649.97 ms

As you can see, the benefits of FP8 quantization on Task B are not significant; it only saves 4 L40S GPUs. However, Task A not only saves GPU but also reduces inference time.
I also compared the proportion of parameters equal to 0 in the two FP8 quantized models, and the difference is not significant. I would like to know what other reasons might cause this kind of discrepancy?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions