Skip to content

Conversation

@SinaChavoshi
Copy link

This PR adds the gke_inference_gateway_standard_cluster example that uses latest GA (General Availability) guide for the GKE Inference Gateway. The configuration has been modernized to reflect current best practices and resource definitions.

@SinaChavoshi SinaChavoshi requested review from a team, apeabody and ericyz as code owners September 16, 2025 19:12
@apeabody
Copy link
Collaborator

/gemini review

@apeabody
Copy link
Collaborator

Hi @SinaChavoshi - Is this still WIP?

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new example for deploying the GKE Inference Gateway on a standard cluster. The configuration is comprehensive and utilizes modern practices. My review has identified a few areas for improvement, primarily concerning correctness and maintainability. Key findings include a resource allocation issue that would prevent some pods from scheduling, a fragile method for applying Kubernetes CRDs that depends on local tooling, and a missing provider definition that would cause Terraform to fail. I've provided specific suggestions to address these points and enhance the robustness and clarity of the example.

Comment on lines +310 to +315
resource "null_resource" "apply_crds" {
provisioner "local-exec" {
command = "kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.0/manifests.yaml"
}
depends_on = [module.gke]
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using local-exec to apply Kubernetes manifests is fragile as it depends on kubectl being installed and configured on the machine running Terraform. The README instructs users to configure kubectl after terraform apply, which creates a chicken-and-egg problem. A more robust, Terraform-native approach is to use the http data source to fetch the manifest and the kubernetes_manifest resource to apply it. This ensures the process is self-contained within Terraform's execution.

data "http" "inference_gateway_crds" {
  url = "https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.0/manifests.yaml"
}

resource "kubernetes_manifest" "apply_crds" {
  for_each   = { for i, v in split("\n---\n", data.http.inference_gateway_crds.body) : i => v if trimspace(v) != "" }
  manifest   = yamldecode(each.value)
  depends_on = [module.gke]
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @SinaChavoshi - I would recommend using the kubectl module which handles installation, auth, etc:

module "kubectl" {
  source = "terraform-google-modules/gcloud/google//modules/kubectl-wrapper"

  project_id              = var.project_id
  cluster_name            = var.cluster_name
  cluster_location        = var.cluster_location
  kubectl_create_command  = "kubectl create deploy nginx --image=nginx"
  kubectl_destroy_command = "kubectl delete deploy nginx"
}

name = "healthCheckPolicy.create"
value = "false"
}
depends_on = [kubernetes_deployment.vllm, null_resource.apply_crds]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Since null_resource.apply_crds is being replaced by kubernetes_manifest.apply_crds, this depends_on needs to be updated to reflect the new resource.

  depends_on = [kubernetes_deployment.vllm, kubernetes_manifest.apply_crds]

Comment on lines +19 to +28
required_providers {
google = {
source = "hashicorp/google"
}
google-beta = {
source = "hashicorp/google-beta"
}
kubernetes = {
source = "hashicorp/kubernetes"
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The helm provider is used in main.tf but it is not declared in the required_providers block. This will cause terraform init to fail. Please add the helm provider to this block.

  required_providers {
    google = {
      source = "hashicorp/google"
    }
    google-beta = {
      source = "hashicorp/google-beta"
    }
    kubernetes = {
      source = "hashicorp/kubernetes"
    }
    helm = {
      source = "hashicorp/helm"
    }
  }

name = "vllm-llama3.1-8b-instruct"
}
spec {
replicas = 3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The vllm deployment is configured with 3 replicas, but the gpupool node pool is configured with only one node (node_count = 1). The a3-highgpu-2g machine type has 2 GPUs, and each pod requests 1 GPU. This means only 2 of the 3 replicas can be scheduled, leaving one pod in a Pending state. To fix this, you should either increase the node_count to 2 in the node pool definition, or reduce the number of replicas to 2 to match the available GPUs on a single node.

    replicas = 2

"model": "food-review",
"prompt": "What is a good recipe for a chicken curry?",
"max_tokens": 100,
"temperature": "0.7"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The temperature parameter in the JSON payload is specified as a string ("0.7"). While some servers might be lenient, the OpenAI API specification (which vLLM aims to be compatible with) expects this to be a number. It's better to provide it as a numeric value for correctness and broader compatibility.

Suggested change
"temperature": "0.7"
"temperature": 0.7

Comment on lines +107 to +110
guest_accelerator = {
type = "nvidia-h100-80gb"
count = 2
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For accelerator-optimized machine types like a3-highgpu-2g, the accelerator type and count are inferred from the machine type. This guest_accelerator block is redundant and can be removed for clarity and to rely on the machine type's implicit configuration.

enable_service_links = false
container {
name = "vllm"
image = "vllm/vllm-openai:latest"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using the :latest tag for container images is not recommended as it can lead to unpredictable behavior and makes deployments difficult to reproduce. It's a best practice to pin images to a specific, immutable version tag (e.g., v0.5.1).

name = "HUGGING_FACE_HUB_TOKEN"
value_from {
secret_key_ref {
name = kubernetes_secret.hf_secret.metadata[0].name

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing metadata via metadata[0] is unnecessary and less readable. The Terraform Kubernetes provider schema defines metadata as a single block, not a list. You can access its attributes directly. Please use kubernetes_secret.hf_secret.metadata.name instead. This applies to other resource references in this file as well (e.g., lines 299, 341, 345, 355, 368, 380).

                name = kubernetes_secret.hf_secret.metadata.name

}
container {
name = "lora-adapter-syncer"
image = "us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer:main"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a floating tag like :main is not recommended for the same reasons as using :latest. It can cause unexpected changes and break reproducibility. Please use a specific release tag or commit SHA if available.

}

output "addons_config" {
description = "The configuration for addons supported by GKE Autopilot."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The description for this output mentions GKE Autopilot, but this example provisions a GKE Standard cluster. The description should be updated to be accurate.

  description = "The configuration for addons supported by GKE."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants