Replies: 1 comment
-
I don't think you can use Ollama models for vLLM directly, but you could dowload similar models from huggingfaceand load them with vLLM. The Q4 models from Ollama are GGUF, and are (mostly) Q4_K_M quantization. You can find this quantisation for most bigger models on HF. Alternatively, you can use AWQ or GPTQ model quants for vLLM, they are roughly the same size as Q4. Note that gguf is experimental on vLLM and most likely slower than other quant types. If the GGUF model files are in subfolders on the HF repo, or on different branches, you have to download then and give vLLM the path to the local .gguuf file. If the model is in multiple parts, you can download the parts and merge them locally, asvLLM only supports gguf with one file. I wrote an instruction how to merge the files #8570 (comment). There is also the official docs https://docs.vllm.ai/en/latest/features/quantization/gguf.html |
Beta Was this translation helpful? Give feedback.
-
Hi everyone,
I'd like to test vLLM performance (T/S, TTFT) but I only have results for Ollama and their Q4 models.
In order to compare my results, I'd like to reuse the same models from Ollama repo (Llama3.1, Gemma2, Mistral) but I don't know how to dump them and/or make them compatible with vLLM.
Do you know a way of importing Ollama models into vLLM (reassigning blobs)?
If not, would an AWQ quantization method give me the same result (model) as the Ollama Q4 models ?
Other solution ?
Regards
Beta Was this translation helpful? Give feedback.
All reactions