Skip to content

[RFC]: Support multi-node serving in CI #2173

@pkking

Description

@pkking

Motivation.

Currently, all CI workflows run on a single Ascend server node, which limits the maximum available NPU count to 8 cards.

But its impossible to test multi-node scenarios witch maybe more close to real world use case.

This RFC aims to provide a solution that enables community developers to write test cases for multi-node vllm serving deployments.

Proposed Change.

Since the community CI runs on a kubernetes cluster, there are many out-of-box multi-node serving solution, for example lws and in vllm project there's also an example for reference, the straightforward idea is to build by lws directly, here's a general plan:

  1. add a new workflow, which contains two jobs:
    1. job1: create a new lws instance which expose a vllm service with multi pods on seperatly node
    2. job2: wait the lws service is ready, then run the tests, job must cleanup the resource when tests finished
  2. add some guides to help developer how to setup more multi-node style test cases

since multi-node serving may be time and NPU comsuming, It's best not to triggered by a PR

Feedback Period.

Maybe one week

CC List.

@Yikun @wangxiyuan @Potabk @MengqingCao

Any Other Things.

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCRequest For Comments

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions