Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document model server compatibility and config options #537

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion config/charts/inferencepool/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@

A chart to deploy an InferencePool and a corresponding EndpointPicker (epp) deployment.


## Install

To install an InferencePool named `vllm-llama3-8b-instruct` that selects from endpoints with label `app: vllm-llama3-8b-instruct` and listening on port `8000`, you can run the following command:
Expand All @@ -23,6 +22,17 @@ $ helm install vllm-llama3-8b-instruct \

Note that the provider name is needed to deploy provider-specific resources. If no provider is specified, then only the InferencePool object and the EPP are deployed.

### Install for Triton TensorRT-LLM

Use `--set inferencePool.modelServerType=triton-tensorrt-llm` to install for Triton TensorRT-LLM, e.g.,

```txt
$ helm install triton-llama3-8b-instruct \
--set inferencePool.modelServers.matchLabels.app=triton-llama3-8b-instruct \
--set inferencePool.modelServerType=triton-tensorrt-llm \
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0
```

## Uninstall

Run the following command to uninstall the chart:
Expand All @@ -38,6 +48,7 @@ The following table list the configurable parameters of the chart.
| **Parameter Name** | **Description** |
|---------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
| `inferencePool.targetPortNumber` | Target port number for the vllm backends, will be used to scrape metrics by the inference extension. Defaults to 8000. |
| `inferencePool.modelServerType` | Type of the model servers in the pool, valid options are [vllm, triton-tensorrt-llm] |
| `inferencePool.modelServers.matchLabels` | Label selector to match vllm backends managed by the inference pool. |
| `inferenceExtension.replicas` | Number of replicas for the endpoint picker extension service. Defaults to `1`. |
| `inferenceExtension.image.name` | Name of the container image used for the endpoint picker. |
Expand Down
7 changes: 6 additions & 1 deletion config/charts/inferencepool/templates/epp-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,12 @@ spec:
- "9003"
- -metricsPort
- "9090"
{{- if eq (.Values.inferencePool.modelServerType | default "vllm") "triton-tensorrt-llm" }}
- -totalQueuedRequestsMetric
- "nv_trt_llm_request_metrics{request_type=waiting}"
- -kvCacheUsagePercentageMetric
- "nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}"
{{- end }}
env:
- name: USE_STREAMING
value: "true"
Expand All @@ -57,4 +63,3 @@ spec:
service: inference-extension
initialDelaySeconds: 5
periodSeconds: 10

1 change: 1 addition & 0 deletions config/charts/inferencepool/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ inferenceExtension:

inferencePool:
targetPortNumber: 8000
modelServerType: vllm # vllm, triton-tensorrt-llm
# modelServers: # REQUIRED
# matchLabels:
# app: vllm-llama3-8b-instruct
Expand Down
4 changes: 3 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,9 @@ nav:
API Overview: concepts/api-overview.md
Conformance: concepts/conformance.md
Roles and Personas: concepts/roles-and-personas.md
- Implementations: implementations.md
- Implementations:
- Gateways: implementations/gateways.md
- Model Servers: implementations/model-servers.md
- FAQ: faq.md
- Guides:
- User Guides:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Implementations
# Gateway Implementations

This project has several implementations that are planned or in progress:

Expand Down
36 changes: 36 additions & 0 deletions site-src/implementations/model-servers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@


# Supported Model Servers

Any model server that conform to the [model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol) are supported by the inference extension.

## Compatible Model Server Versions

| Model Server | Version | Commit | Notes |
| -------------------- | ---------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- |
| vLLM V0 | v0.6.4 and above | [commit 0ad216f](https://github.com/vllm-project/vllm/commit/0ad216f5750742115c686723bf38698372d483fd) | |
| vLLM V1 | v0.8.0 and above | [commit bc32bc7](https://github.com/vllm-project/vllm/commit/bc32bc73aad076849ac88565cff745b01b17d89c) | |
| Triton(TensorRT-LLM) | [25.03](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-25-03.html#rel-25-03) and above | [commit 15cb989](https://github.com/triton-inference-server/tensorrtllm_backend/commit/15cb989b00523d8e92dce5165b9b9846c047a70d). | LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in Triton yet. |

## vLLM

vLLM is configured as the default in the [endpoint picker extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp). No further configuration is required.

## Triton with TensorRT-LLM Backend

Triton specific metric names need to be specified when starting the EPP.

### Option 1: Use Helm

Use `--set inferencePool.modelServerType=triton-tensorrt-llm` to install the [`inferencepool` chart](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/42eb5ff1c5af1275df43ac384df0ddf20da95134/config/charts/inferencepool). See the [`inferencepool` chart doc](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/42eb5ff1c5af1275df43ac384df0ddf20da95134/config/charts/inferencepool/README.md) for more details.

### Option 2: Edit EPP deployment yaml

Add the following to the `args` of the [EPP deployment](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/42eb5ff1c5af1275df43ac384df0ddf20da95134/config/manifests/inferencepool-resources.yaml#L32)

```
- -totalQueuedRequestsMetric
- "nv_trt_llm_request_metrics{request_type=waiting}"
- -kvCacheUsagePercentageMetric
- "nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}"
```