vLLM benchmark - Quantized models #10326

plp38 · 2024-11-14T13:04:10Z

plp38
Nov 14, 2024

Hi everyone,
I'd like to test vLLM performance (T/S, TTFT) but I only have results for Ollama and their Q4 models.
In order to compare my results, I'd like to reuse the same models from Ollama repo (Llama3.1, Gemma2, Mistral) but I don't know how to dump them and/or make them compatible with vLLM.

Do you know a way of importing Ollama models into vLLM (reassigning blobs)?
If not, would an AWQ quantization method give me the same result (model) as the Ollama Q4 models ?
Other solution ?

Regards

mahenning · 2025-02-10T11:54:49Z

mahenning
Feb 10, 2025

I don't think you can use Ollama models for vLLM directly, but you could dowload similar models from huggingfaceand load them with vLLM. The Q4 models from Ollama are GGUF, and are (mostly) Q4_K_M quantization. You can find this quantisation for most bigger models on HF. Alternatively, you can use AWQ or GPTQ model quants for vLLM, they are roughly the same size as Q4. Note that gguf is experimental on vLLM and most likely slower than other quant types.

If the GGUF model files are in subfolders on the HF repo, or on different branches, you have to download then and give vLLM the path to the local .gguuf file. If the model is in multiple parts, you can download the parts and merge them locally, asvLLM only supports gguf with one file. I wrote an instruction how to merge the files #8570 (comment). There is also the official docs https://docs.vllm.ai/en/latest/features/quantization/gguf.html

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM benchmark - Quantized models #10326

{{title}}

Replies: 1 comment

{{title}}

Select a reply

vLLM benchmark - Quantized models #10326

plp38 Nov 14, 2024

Replies: 1 comment

mahenning Feb 10, 2025

plp38
Nov 14, 2024

mahenning
Feb 10, 2025