vllm on ray claster with asymmetric hardware #12455

arm2arm · 2025-01-26T20:58:41Z

arm2arm
Jan 26, 2025

Dear vllm experts, I am trying to deploy vllm in distributed mode, we have at our research institute 4 nodes each with 1xA100, they are working pretty good with distributed ray cluster. Now we got an another node with 2xL40S, ray can show all 6 gpus, but one node with 2 gpus . how to start vllm to use all gpus?
currently we use:

vllm serve  Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4    --tensor-parallel-size 1  --dtype=half    --pipeline-parallel-size 5  --distributed-executor-backend ray  --max-model-len 8192 --trust-remote-code

wedobetter · 2025-02-03T20:36:23Z

wedobetter
Feb 3, 2025

Either run it on 4 GPU's or buy another 2.
The problem is attention heads and vocab need to be divisible by the number of instances and typically they come in 40, 64, and multiples of 8 sizes. Qwen-72b-Instruct has 64 attention heads.
I had the same problem with 7 GPU's, I ended up buying one more.
Otherwise the other GPU's can be used for other purposes such as:

TTS / STT
Model guard (detect malicious prompts)
Judge model
Embedding models (RAG)

Given that you have generous GPU specs, another option would be to run 3x replicas with each replica taking 2 GPUs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm on ray claster with asymmetric hardware #12455

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

vllm on ray claster with asymmetric hardware #12455

arm2arm Jan 26, 2025

Replies: 1 comment

wedobetter Feb 3, 2025

arm2arm
Jan 26, 2025

wedobetter
Feb 3, 2025