-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[WIP] GKE Inference Gateway example #2448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[WIP] GKE Inference Gateway example #2448
Conversation
|
/gemini review |
|
Hi @SinaChavoshi - Is this still WIP? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new example for deploying the GKE Inference Gateway on a standard cluster. The configuration is comprehensive and utilizes modern practices. My review has identified a few areas for improvement, primarily concerning correctness and maintainability. Key findings include a resource allocation issue that would prevent some pods from scheduling, a fragile method for applying Kubernetes CRDs that depends on local tooling, and a missing provider definition that would cause Terraform to fail. I've provided specific suggestions to address these points and enhance the robustness and clarity of the example.
| resource "null_resource" "apply_crds" { | ||
| provisioner "local-exec" { | ||
| command = "kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.0/manifests.yaml" | ||
| } | ||
| depends_on = [module.gke] | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using local-exec to apply Kubernetes manifests is fragile as it depends on kubectl being installed and configured on the machine running Terraform. The README instructs users to configure kubectl after terraform apply, which creates a chicken-and-egg problem. A more robust, Terraform-native approach is to use the http data source to fetch the manifest and the kubernetes_manifest resource to apply it. This ensures the process is self-contained within Terraform's execution.
data "http" "inference_gateway_crds" {
url = "https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.0/manifests.yaml"
}
resource "kubernetes_manifest" "apply_crds" {
for_each = { for i, v in split("\n---\n", data.http.inference_gateway_crds.body) : i => v if trimspace(v) != "" }
manifest = yamldecode(each.value)
depends_on = [module.gke]
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @SinaChavoshi - I would recommend using the kubectl module which handles installation, auth, etc:
module "kubectl" {
source = "terraform-google-modules/gcloud/google//modules/kubectl-wrapper"
project_id = var.project_id
cluster_name = var.cluster_name
cluster_location = var.cluster_location
kubectl_create_command = "kubectl create deploy nginx --image=nginx"
kubectl_destroy_command = "kubectl delete deploy nginx"
}
| name = "healthCheckPolicy.create" | ||
| value = "false" | ||
| } | ||
| depends_on = [kubernetes_deployment.vllm, null_resource.apply_crds] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| required_providers { | ||
| google = { | ||
| source = "hashicorp/google" | ||
| } | ||
| google-beta = { | ||
| source = "hashicorp/google-beta" | ||
| } | ||
| kubernetes = { | ||
| source = "hashicorp/kubernetes" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The helm provider is used in main.tf but it is not declared in the required_providers block. This will cause terraform init to fail. Please add the helm provider to this block.
required_providers {
google = {
source = "hashicorp/google"
}
google-beta = {
source = "hashicorp/google-beta"
}
kubernetes = {
source = "hashicorp/kubernetes"
}
helm = {
source = "hashicorp/helm"
}
}
| name = "vllm-llama3.1-8b-instruct" | ||
| } | ||
| spec { | ||
| replicas = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vllm deployment is configured with 3 replicas, but the gpupool node pool is configured with only one node (node_count = 1). The a3-highgpu-2g machine type has 2 GPUs, and each pod requests 1 GPU. This means only 2 of the 3 replicas can be scheduled, leaving one pod in a Pending state. To fix this, you should either increase the node_count to 2 in the node pool definition, or reduce the number of replicas to 2 to match the available GPUs on a single node.
replicas = 2
| "model": "food-review", | ||
| "prompt": "What is a good recipe for a chicken curry?", | ||
| "max_tokens": 100, | ||
| "temperature": "0.7" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The temperature parameter in the JSON payload is specified as a string ("0.7"). While some servers might be lenient, the OpenAI API specification (which vLLM aims to be compatible with) expects this to be a number. It's better to provide it as a numeric value for correctness and broader compatibility.
| "temperature": "0.7" | |
| "temperature": 0.7 |
| guest_accelerator = { | ||
| type = "nvidia-h100-80gb" | ||
| count = 2 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| enable_service_links = false | ||
| container { | ||
| name = "vllm" | ||
| image = "vllm/vllm-openai:latest" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| name = "HUGGING_FACE_HUB_TOKEN" | ||
| value_from { | ||
| secret_key_ref { | ||
| name = kubernetes_secret.hf_secret.metadata[0].name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accessing metadata via metadata[0] is unnecessary and less readable. The Terraform Kubernetes provider schema defines metadata as a single block, not a list. You can access its attributes directly. Please use kubernetes_secret.hf_secret.metadata.name instead. This applies to other resource references in this file as well (e.g., lines 299, 341, 345, 355, 368, 380).
name = kubernetes_secret.hf_secret.metadata.name
| } | ||
| container { | ||
| name = "lora-adapter-syncer" | ||
| image = "us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer:main" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } | ||
|
|
||
| output "addons_config" { | ||
| description = "The configuration for addons supported by GKE Autopilot." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR adds the
gke_inference_gateway_standard_clusterexample that uses latest GA (General Availability) guide for the GKE Inference Gateway. The configuration has been modernized to reflect current best practices and resource definitions.