[WIP][AQUA] Add Supporting Fine-Tuned Models in Multi-Model Deployment #1186
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The current implementation of Multi-Model Deployment in AQUA supports base models only. Fine-tuned models, however, are a critical part of many customer workflows - allowing them to adapt base models to domain-specific use cases.
This PR introduces support for deploying fine-tuned LLM models as part of a multi-model deployment group on the VLLM container.
Implementation
In the first iteration, we will treat each selected model, whether it's a base model or a fine-tuned variant—as an independent entity. Even if multiple fine-tuned models share the same base model, each one will be deployed in its own isolated VLLM instance.
On the SMC side, we will leverage VLLM's capability to dynamically merge LoRA adapter weights during runtime. This means each VLLM instance will load the base model and its corresponding fine-tuned weights independently.
To avoid routing conflicts caused by multiple instances using the same base model name, we will route the base model name to one instance only, but we will not advertise this base model as an endpoint to users (This is current behavior with Single Model Deployment).
This configuration structure will prepare us for future enhancements, such as stacked fine-tuned deployments, where multiple fine-tuned variants are hosted under a single base model within one VLLM instance. However, this future enhancement will apply to single-model deployments initially.
In a second iteration, we will explore expanding this capability to multi-model deployments, enabling grouped deployment of fine-tuned variants with shared GPU allocation. That enhancement will require additional work across the ADS SDK, AQUA UI, and validation logic.