Skip to content

[Bug] Some fixes and details required to deploy successfully  #237

@mnichols

Description

@mnichols

Temporal Worker Controller — Adoption Pain Points

Written from the perspective of an engineer who has shipped it to production.
This is not a setup guide — it is a record of what the documentation omits,
what required source-code research, and where the happy path has gaps that
will cost you time.


1. CRDs are not in the published Helm chart

What the docs say: Install the controller with Helm.

What actually happens: The OCI chart at
oci://docker.io/temporalio/temporal-worker-controller installs the controller
pod but ships with an empty crds/ directory. There is also no separately
published temporal-worker-controller-crds chart — that reference returns a
404. The CRDs exist in the GitHub repository under
helm/temporal-worker-controller/crds/ but were added to main after the
v1.3.1 tag, so they are not reachable via any published artifact.

Resolution: Apply CRDs directly from raw GitHub at the release tag:

CRDS_BASE="https://raw.githubusercontent.com/temporalio/temporal-worker-controller/v1.3.1/helm/temporal-worker-controller/crds"
kubectl apply -f "${CRDS_BASE}/temporal.io_temporalconnections.yaml"
kubectl apply -f "${CRDS_BASE}/temporal.io_temporalworkerdeployments.yaml"

Impact: Anyone following the docs blindly installs a controller that cannot
function because the CRDs it manages do not exist. The controller starts
without error; the failure only surfaces when you attempt to apply a
TemporalWorkerDeployment and get "no matches for kind".


2. TemporalConnection uses hostPort, not address

What the docs show: Minimal examples that omit the connection spec entirely,
or show a field named address.

What actually happens: The CRD schema uses strict decoding. Supplying
spec.address causes an immediate validation error:

strict decoding error: unknown field "spec.address"

Resolution: The correct field is spec.hostPort:

spec:
  hostPort: us-east-1.aws.api.temporal.io:7233

Impact: The field name mismatch is not surfaced until you apply the resource.
There is no admission webhook warning, no suggested correction — just a hard
rejection. The correct name requires reading the CRD schema directly.


3. TemporalConnection is for the controller, not the worker pod

What is confusing: The TemporalConnection sits in the same namespace as
your worker pods and references the same Temporal service. It looks like it
should be the worker's connection config but its role isn't so obvious.

What it actually is: The controller's own connection to the Temporal API —
used to register build-ids, query worker deployment state, and manage traffic
ramp. The worker pod connects to Temporal independently using its own
credentials (mounted as a volume or env var by your application).

Impact: This distinction can matter for secret management. The minimum privileges for the API Key
should be explictly stated (against Cloud). Do I need Global Admin? Developer? etc.
The controller needs an API key with worker deployment management permissions (describe/register
worker deployments). A standard worker service account key — with only worker
connection permissions — will authenticate successfully for the Spring app but might
fail (?) with Request unauthorized when the controller calls the Worker
Deployment API. You need a separate, higher-privilege key for the
TemporalConnection.


4. spec.sunset is required despite being schema-optional

What the docs say: Nothing explicit. The field appears optional in schema
references.

What actually happens: Applying a TemporalWorkerDeployment without
sunset results in:

spec.sunset: Required value

Resolution: Always include it:

sunset:
  scaledownDelay: 30s
  deleteDelay: 120s

Impact: Silent until apply time. The schema does not communicate this
requirement, so linters and IDE validation will not catch it.


5. rampPercentage must be strictly increasing and capped at 99, not 100

What you would naturally try: A final step of rampPercentage: 100 to
complete the rollout.

What actually happens: The controller rejects values outside 1–99 and
requires each step to be strictly greater than the previous.

Resolution: The final ramp to 100% is implicit — the controller completes
the rollout after the last step's pauseDuration elapses. A valid two-step
demo config:

steps:
  - rampPercentage: 50
    pauseDuration: 30s
  - rampPercentage: 90
    pauseDuration: 30s

Impact: Both the 100% ceiling and the strictly-increasing constraint are
enforced at reconcile time with a validation error, not at apply time. Your
resource is accepted and then the controller immediately stops reconciling it,
leaving workers unscheduled with no pods created and no obvious indication of
why.


6. Build-id derivation is not documented — requires source code research

The question: What is the relationship between the image tag and the
Temporal build-id? Does using :latest work?

What the docs say: Nothing concrete.

What the source code shows (internal/k8s/deployments.go,
ComputeBuildID()): The build-id is computed as:

<image-tag>-<hash-of-pod-template-spec>

The image tag is extracted from the image reference. If the tag is :latest,
the pod template spec hash does not change between deploys (the spec is
identical), so the computed build-id is identical, and Temporal does not see a
new version.

Resolution: Always use explicit version tags (:v1, :v2, etc.). The
version bump trigger is the image tag change — that is the entire deployment
primitive.

Impact: Using :latest silently produces the same build-id on every
deploy. No error is raised. The controller reconciles successfully. Workers
start. Nothing in Temporal changes. This would be very difficult to diagnose
without reading the source.


7. workerOptions is required for Temporal Cloud but absent from minimal examples

What the docs show: A minimal TemporalWorkerDeployment with no
workerOptions.

What actually happens: Without workerOptions, the controller does not
know which TemporalConnection to use or which Temporal namespace to target.
Workers may start but the controller cannot register build-ids or manage
traffic, and the TemporalConnectionHealthy condition stays false.

Resolution:

workerOptions:
  connectionRef:
    name: temporal-connection
  temporalNamespace: fde-oms-processing.sdvdw

Impact: The minimal example works in a self-contained local setup where the
controller can infer defaults. Against Temporal Cloud it silently fails.


8. Kustomize does not update configmap/secret name references inside CRDs

What Kustomize does for standard resources: When configmaps or secrets are
created with configMapGenerator / secretGenerator, Kustomize appends a
content hash to their names and automatically updates all references in
Deployments, StatefulSets, etc.

What Kustomize does for CRDs: Nothing. The TemporalWorkerDeployment
pod template references temporal-oms-config by name. Kustomize generates
temporal-oms-config-fdtkdc8kkc. The CRD keeps the bare name. Pods fail with
CreateContainerConfigError: configmap "temporal-oms-config" not found.

Resolution: Set disableNameSuffixHash: true in generatorOptions so
configmap/secret names remain stable and CRDs can reference them directly.

Impact: Completely silent until pod scheduling. The kustomize build
succeeds, the apply succeeds, the TemporalWorkerDeployment is accepted, the
controller creates pods — and then they fail to start. This requires knowing to
look at pod events, not controller status.


9. The controller installs two replicas by default with leader election

Observation: After a standard helm install, two controller pods appear in
temporal-worker-controller-system. This is correct — the chart deploys two
replicas with leader election for HA. Both pods show 2/2 Running (manager +
sidecar containers).

Why it matters: Controller logs are split across both pods. When debugging,
kubectl logs deployment/temporal-worker-controller-manager picks one pod
arbitrarily. If the active leader is on the other pod, you may see no relevant
log output. Use -l app=temporal-worker-controller and check both.


Summary: What requires source code research vs. documentation

Finding Source
CRDs not in published chart Trial and error + GitHub inspection
hostPort vs address field name CRD schema (not docs)
sunset is required Trial and error
rampPercentage 1–99, strictly increasing Trial and error
Build-id derivation from image tag Controller Go source (ComputeBuildID)
:latest produces no new version Controller Go source
workerOptions required for cloud Trial and error
Kustomize CRD name hash mismatch Kustomize internals knowledge
TemporalConnection is controller-only, needs elevated permissions Conceptual gap in docs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions