🐛 Make deployments HA-ready with configurable replica count and update upgrade-e2e test cases #2371

jianzhangbjz · 2025-12-04T03:09:32Z

Description

Make deployments HA-ready with configurable replica count

Problem

The current deployment configuration creates a deadlock during node maintenance operations:

Both operator-controller-controller-manager and catalogd-controller-manager have 1 replica (hardcoded)
Both have PodDisruptionBudgets with minAvailable: 1
When the single pod runs on a node being drained, eviction violates the PDB
This prevents node drain operations from completing

Impact: Node maintenance operations (upgrades, scaling, etc.) fail with:
Cannot evict pod as it would violate the pod's disruption budget.

Root cause: With only 1 replica and minAvailable: 1, there's no way to safely evict the pod while maintaining availability.

Solution

This PR makes the replica count configurable and sets a sensible default of 2 replicas:

Added replicas configuration to Helm values:
- options.operatorController.deployment.replicas: 2
- options.catalogd.deployment.replicas: 2
Updated Helm templates to use the configurable value instead of hardcoded 1
Regenerated all manifests with the new defaults

Benefits

✅ Enables safe node maintenance: With 2 replicas, one pod can be evicted while maintaining minAvailable: 1
✅ Zero-downtime updates: RollingUpdate with 2 replicas ensures continuous service availability
✅ Improved fault tolerance: System continues operating if one pod fails
✅ Production-ready defaults: Aligns with HA best practices for control plane components
✅ Flexible configuration: Replica count can be customized via Helm values

Compatibility

Existing deployments: When this change is applied, deployments will scale from 1 to 2 replicas:

RollingUpdate strategy ensures zero downtime during scale-up
maxSurge: 1 allows temporary 3rd pod during transition
maxUnavailable: 0 prevents service interruption

Custom deployments: Users can override the default by setting:

options:
  operatorController:
    deployment:
      replicas: <custom-value>
  catalogd:
    deployment:
      replicas: <custom-value>

Reviewer Checklist

API Go Documentation
Tests: Unit Tests (and E2E Tests, if appropriate)
Comprehensive Commit Messages
Links to related GitHub Issue(s)

Assisted-By: Claude Code

netlify · 2025-12-04T03:09:37Z

✅ Deploy Preview for olmv1 ready!

Name	Link
🔨 Latest commit	`a79efe1`
🔍 Latest deploy log	https://app.netlify.com/projects/olmv1/deploys/693a26487bfbca000764c40a
😎 Deploy Preview	https://deploy-preview-2371--olmv1.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

jianzhangbjz · 2025-12-04T03:33:40Z

/retest

jianzhangbjz · 2025-12-04T03:35:13Z

Encounter this error again.

=== RUN   TestClusterExtensionInstallRegistry/no_registry_configuration_necessary
    cluster_extension_install_test.go:57: When a cluster extension is installed from a catalog
    cluster_extension_install_test.go:58: When the extension bundle format is registry+v1
    helpers.go:263: Ensuring ClusterCatalog has Status.Condition of Progressing with a status == True and reason == Succeeded
    helpers.go:273: Checking that catalog has the expected metadata label
    helpers.go:278: Ensuring ClusterCatalog has Status.Condition of Type = Serving with status == True
    cluster_extension_install_test.go:79: It resolves the specified package with correct bundle path
    cluster_extension_install_test.go:80: By creating the ClusterExtension resource
    cluster_extension_install_test.go:83: By eventually reporting a successful resolution and bundle path
    cluster_extension_install_test.go:88: By eventually reporting progressing as True
    cluster_extension_install_test.go:89: 
        	Error Trace:	/home/runner/work/operator-controller/operator-controller/test/e2e/cluster_extension_install_test.go:94
        	            				/opt/hostedtoolcache/go/1.24.6/x64/src/runtime/asm_amd64.s:1700
        	Error:      	Not equal: 
        	            	expected: "Succeeded"
        	            	actual  : "Retrying"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-Succeeded
        	            	+Retrying
    cluster_extension_install_test.go:89: 
        	Error Trace:	/home/runner/work/operator-controller/operator-controller/test/e2e/cluster_extension_install_test.go:89
        	Error:      	Condition never satisfied
        	Test:       	TestClusterExtensionInstallRegistry/no_registry_configuration_necessary
Failed to list cluster extensions: no matches for kind "ClusterExtensionRevision" in version "olm.operatorframework.io/v1"    helpers.go:321: By deleting ClusterCatalog "test-catalog-jnwbqjlw"
    helpers.go:330: By deleting ClusterExtension "clusterextension-fqz4fls8"
    helpers.go:294: By waiting for CustomResourceDefinitions of "clusterextension-fqz4fls8" to be deleted
    helpers.go:302: By waiting for ClusterRoleBindings of "clusterextension-fqz4fls8" to be deleted
    helpers.go:310: By waiting for ClusterRoles of "clusterextension-fqz4fls8" to be deleted
    helpers.go:340: By deleting ServiceAccount "clusterextension-fqz4fls8"
    helpers.go:349: By deleting Namespace "clusterextension-fqz4fls8"

tmshort · 2025-12-04T04:18:03Z

Changing the number of replicas directly impacts the expectations of our e2es. Perhaps default it to 1 or update the tests with some additional logic.

Allowing us to configure the number of replicas and other values will give us better downstream control.

codecov · 2025-12-04T09:41:12Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.83%. Comparing base (c85bc55) to head (36c34a3).
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2371      +/-   ##
==========================================
+ Coverage   70.61%   74.83%   +4.21%     
==========================================
  Files          95       95              
  Lines        7375     7323      -52     
==========================================
+ Hits         5208     5480     +272     
+ Misses       1728     1409     -319     
+ Partials      439      434       -5

Flag	Coverage Δ
e2e	`44.99% <ø> (+0.31%)`	⬆️
experimental-e2e	`49.57% <ø> (+35.22%)`	⬆️
unit	`58.82% <ø> (+0.37%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jianzhangbjz · 2025-12-04T11:16:53Z

It failed at, more https://github.com/operator-framework/operator-controller/actions/runs/19926553979/job/57127694168?pr=2371

./testdata/build-test-registry.sh operator-controller-e2e docker-registry localhost/e2e-test-registry:devel
namespace/operator-controller-e2e created
certificate.cert-manager.io/operator-controller-e2e-registry created
deployment.apps/docker-registry created
service/docker-registry created
error: timed out waiting for the condition on deployments/docker-registry
make: *** [Makefile:267: image-registry] Error 1

jianzhangbjz · 2025-12-04T12:16:05Z

Hi @tmshort , I have updated all related e2e test cases. Could you help have a review? Thanks!

pedjak

our upstream e2e test do not need HA setup to run, please keep them lean as they were before this PR.

hack/test/pre-upgrade-setup.sh

pedjak · 2025-12-04T12:41:12Z

helm/olmv1/templates/deployment-olmv1-system-catalogd-controller-manager.yml

    type: RollingUpdate
    rollingUpdate:
-      maxSurge: 1          # Allow temporary 2 pods (1 + 1) for zero-downtime updates
+      maxSurge: 1          # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates


from operational point of view this is a no-op change - please revert it.

Updating the comment makes sense, especially since the number of pods is based on deployment configuration. But I'd suggest something else.

The comment is necessary so that others can understand the reason.

pedjak · 2025-12-04T12:41:34Z

helm/olmv1/templates/deployment-olmv1-system-operator-controller-controller-manager.yml

    type: RollingUpdate
    rollingUpdate:
-      maxSurge: 1          # Allow temporary 2 pods (1 + 1) for zero-downtime updates
+      maxSurge: 1          # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates


same as above.

pedjak · 2025-12-04T12:44:21Z

helm/olmv1/values.yaml

    enabled: true
    deployment:
      image: quay.io/operator-framework/operator-controller:devel
+      replicas: 2


I think the default should remain 1, and set podDisruptionBudget.enabled to false as default as well. >1 is required for HA setups. We should not implicitly enforce after an upgrade that OLM suddenly consumes more cluster resources (i.e. pods). Such things should not be hidden from users.

I generally agree with @pedjak here, we should be defaulting to 1 upstream, but allow the configuration to updated downstream. However, because we are allowing changes here, we will need to ensure that the e2e tests can still be run upstream and downstream.

Now thinking, maybe we want to use 1 for the default, and put 2 in experimental? That way, we can test both values.

The community user will also encounter a similar upgrade error(OCPBUGS-62517) if we disable the PDB as the default.

In my opinion, we should maintain consistency between the upstream and downstream projects. This is also one of the core principles of open source: fostering alignment, collaboration, and transparency across the entire ecosystem. When upstream and downstream diverge without clear justification, it creates unnecessary complexity, increases maintenance overhead, and makes it harder for users and contributors to understand or adopt the project.

By keeping the two sides consistent, we ensure that new features, bug fixes, and improvements flow smoothly in both directions. It also strengthens long-term sustainability, because the community can rely on a unified architecture and shared expectations. Ultimately, this consistency reinforces the value of open source itself—working together toward a common, coherent, and maintainable solution.

The community user will also encounter a similar upgrade error

As usual, it depends. IMHO, this is the downstream bug because the downstream has certain expectations about components availability during upgrade. If OLM deployment has just single pod, it is perfectly normal that the deployment becomes shortly unavailable during upgrade, and many upstream users might be ok with that, especially because OLM is a controller that performs reconciliations asynchronously (i.e. eventually). All ClusterExtensions that get created/modified during OLM upgrade are not lost and they are going to be reconciled - hence to service interruption does not happens at all.

Having said that, it is also perfectly fine that we add knobs on the upstream, so that the number of pods and pdbs can be configured/added for users that have stronger requirements, but we should not enforce them, especially if that means that suddenly running OLM requires twice as much resources.

I would still want this to be 1 for the standard setup (and tests).

We can set this to 2 in experimental.yaml to also test the multi-replica e2e without forcing the default user to have 2 replicas.

pedjak · 2025-12-04T12:44:28Z

helm/olmv1/values.yaml

    enabled: true
    deployment:
      image: quay.io/operator-framework/catalogd:devel
+      replicas: 2


same as above.

pedjak · 2025-12-04T12:45:51Z

manifests/experimental-e2e.yaml

    type: RollingUpdate
    rollingUpdate:
-      maxSurge: 1          # Allow temporary 2 pods (1 + 1) for zero-downtime updates
+      maxSurge: 1          # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates


same comment as earlier.

manifests/experimental-e2e.yaml

pedjak · 2025-12-04T12:47:37Z

test/e2e/cluster_extension_install_test.go

-	artifactName         = "operator-controller-e2e"
-	pollDuration         = time.Minute
+	artifactName = "operator-controller-e2e"
+	// pollDuration is set to 3 minutes to account for leader election time in multi-replica deployments.


our e2e test do not need to run now in HA setup, having just single replica should be fine. If some additional HA tests are needed, they should be provided downstream.

However, the upstream e2e's do run downstream as well, and we need to ensure they can work there as well.

Sure, but if I am not mistaken, OCPBUGS-62517 reports a downstream test to fail, not an upstream one?

Yes, but it's all the same downstream (OCP) configuration. We can't change the OCP configuration just to run our upstream e2e's. Our upstream e2e's are expected to be able to run mostly unmodified on OCP.

The upstream e2e will encounter 2 replicas when run in OCP (because that's the OCP configuration), but will encounter 1 replica upstream (because that's the upstream configuration).

And because we'd also allow the user to support multiple replicas upstream, our upstream e2e should accommodate that.

our upstream e2e should accommodate that.

Agree that our e2e tests should be agnostic to underlying setup, but our upstream e2e CI test jobs should run against an OLM deployment with single replicas (as they do until now).

Based on the latest update, the standard test-e2e will use a single replica. Running test-experimental-e2e will use two replicas. Installing either standard or experimental (outside of the e2e) will use a single replica.

helm/olmv1/templates/deployment-olmv1-system-catalogd-controller-manager.yml

pedjak

There are still unaddressed comment like this https://github.com/operator-framework/operator-controller/pull/2371/files#r2598622299 from the previous PR review round.

Also, please update PR title to be set to Make deployments HA-ready with configurable replica count existing in PR description.

helm/high-availability.yaml

pedjak · 2025-12-09T16:43:48Z

test/e2e/cluster_extension_install_test.go

 		require.Equal(ct, metav1.ConditionTrue, cond.Status)
 		require.Equal(ct, ocv1.ReasonSucceeded, cond.Reason)
-	}, 2*time.Minute, pollInterval)
+	}, 5*time.Minute, pollInterval)


we could use here extendedPollDuration variable.

Leader election happens right now, even in the non-HA configuration right? If so, it seems like the addition of another replica should not affect the timing. What am I missing?

joelanford · 2025-12-09T20:10:58Z

test/e2e/cluster_extension_install_test.go

+	// pollDuration is set to 3 minutes to account for leader election time in multi-replica deployments.
+	// In the worst case (previous leader crashed), leader election can take up to 163 seconds
+	// (LeaseDuration: 137s + RetryPeriod: 26s). Adding buffer for reconciliation time.
+	pollDuration = 3 * time.Minute


We originally has pollDuration = 1 * time.Minute. I think we should stick to that. And any tests that require a pod restart (and thus a leader election) can use the extended interval. If we do that, what breaks (and why)?

joelanford · 2025-12-09T20:13:23Z

test/e2e/cluster_extension_install_test.go

+	// In the worst case (previous leader crashed), leader election can take up to 163 seconds
+	// (LeaseDuration: 137s + RetryPeriod: 26s). Adding buffer for reconciliation time.


We are using LeaderReleaseOnCancel: true for both catalogd and operator-controller, which means that as long as we are not crashing, the lease should be released immediately, and the election should take no longer than the RetryPeriod (26s), right?

joelanford · 2025-12-09T20:19:04Z

test/e2e/cluster_extension_install_test.go

 	// after creating int the Namespace and ServiceAccount.
 	t.Log("By eventually installing the package successfully")
+	// Use 5 minutes for recovery tests to account for exponential backoff after repeated failures
+	// plus leader election time (up to 163s in worst case)


Why do we need to account for leader election time here? This test should not cause a restart of our controllers. The original 1m duration seems like it has been sufficient for expotential backoff concerns?

In addition, the underlying assumption of this check requires that the same controller process returns the first error and then backs off until the issue is resolved. If there is a restart of the pod during this test, that would result in a false positive because we are longer sure that the error self-resolves without requiring a pod restart.

NOTE: OLMv0 has quite a few cases where we say "have you tried restarted the catalog-operator and olm-operator pods?" This test and other tests related to automatic recovery and self-healing are here to make sure we do not require pod restarts to get things working again.

joelanford · 2025-12-09T20:20:22Z

test/e2e/cluster_extension_install_test.go

 		require.Contains(ct, cond.Message, "Installed bundle")
 		require.NotEmpty(ct, clusterExtension.Status.Install)
-	}, pollDuration, pollInterval)
+	}, extendedPollDuration, pollInterval)


Same as above:

Why do we need to account for leader election time here? This test should not cause a restart of our controllers. The original 1m duration seems like it has been sufficient for expotential backoff concerns?

jianzhangbjz · 2025-12-10T04:50:14Z

I often encounter this kind of failure, which was not introduced by this PR. It shows that reducing pollDuration from 3 minutes back to 1 minute caused a timeout. The test is expecting Reason: "Succeeded" but it's still "Retrying" after 1 minute.
The previous timeout adjustment was made to address various timeout failures. Anyway, to preserve pollDuration = time.Minute, I added catalogPollDuration = 3 * time.Minute.

=== RUN   TestClusterExtensionInstallReResolvesWhenNewCatalog
    cluster_extension_install_test.go:473: When a cluster extension is installed from a catalog
    cluster_extension_install_test.go:474: It resolves again when a new catalog is available
    cluster_extension_install_test.go:515: It resolves the specified package with correct bundle path
    cluster_extension_install_test.go:516: By creating the ClusterExtension resource
    cluster_extension_install_test.go:519: By reporting a successful resolution and bundle path
    cluster_extension_install_test.go:520: 
        	Error Trace:	/home/runner/work/operator-controller/operator-controller/test/e2e/cluster_extension_install_test.go:525
        	            				/opt/hostedtoolcache/go/1.24.6/x64/src/runtime/asm_amd64.s:1700
        	Error:      	Not equal: 
        	            	expected: "Succeeded"
        	            	actual  : "Retrying"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-Succeeded
        	            	+Retrying
    cluster_extension_install_test.go:520: 
        	Error Trace:	/home/runner/work/operator-controller/operator-controller/test/e2e/cluster_extension_install_test.go:520
        	Error:      	Condition never satisfied
        	Test:       	TestClusterExtensionInstallReResolvesWhenNewCatalog
    helpers.go:330: By deleting ClusterCatalog "test-catalog-v8j2d2kn"
    helpers.go:339: By deleting ClusterExtension "clusterextension-bqjfp4lc"
    helpers.go:303: By waiting for CustomResourceDefinitions of "clusterextension-bqjfp4lc" to be deleted
    helpers.go:311: By waiting for ClusterRoleBindings of "clusterextension-bqjfp4lc" to be deleted
    helpers.go:319: By waiting for ClusterRoles of "clusterextension-bqjfp4lc" to be deleted
    helpers.go:349: By deleting ServiceAccount "clusterextension-bqjfp4lc"
    helpers.go:358: By deleting Namespace "clusterextension-bqjfp4lc"
--- FAIL: TestClusterExtensionInstallReResolvesWhenNewCatalog (68.41s)

jianzhangbjz · 2025-12-10T05:09:14Z

We’re seeing this same issue again…

=== RUN   TestWebhookSupport
    feature_gates.go:56: Feature-gate "WebhookProviderCertManager" not found in arguments, defaulting to true
    webhook_support_test.go:31: Test support for bundles with webhooks
    webhook_support_test.go:40: By creating install namespace, and necessary rbac resources
    webhook_support_test.go:85: By creating the webhook-operator ClusterCatalog
    webhook_support_test.go:105: By waiting for the catalog to serve its metadata
    webhook_support_test.go:114: By installing the webhook-operator ClusterExtension
    webhook_support_test.go:140: By waiting for webhook-operator extension to be installed successfully
panic: test timed out after 10m0s
	running tests:
		TestWebhookSupport (3s)

goroutine 2508 [running]:
testing.(*M).startAlarm.func1()
	/opt/hostedtoolcache/go/1.24.6/x64/src/testing/testing.go:2484 +0x394
created by time.goFunc
	/opt/hostedtoolcache/go/1.24.6/x64/src/time/sleep.go:215 +0x2d

jianzhangbjz · 2025-12-10T05:57:02Z

Again...

=== RUN   TestClusterExtensionForceInstallNonSuccessorVersion
    cluster_extension_install_test.go:314: When a cluster extension is installed from a catalog
    cluster_extension_install_test.go:315: When resolving upgrade edges
    helpers.go:271: Ensuring ClusterCatalog has Status.Condition of Progressing with a status == True and reason == Succeeded
    helpers.go:281: Checking that catalog has the expected metadata label
    helpers.go:286: Ensuring ClusterCatalog has Status.Condition of Type = Serving with status == True
    cluster_extension_install_test.go:321: By creating an ClusterExtension at a specified version
    cluster_extension_install_test.go:336: By eventually reporting a successful resolution
    cluster_extension_install_test.go:337: 
        	Error Trace:	/home/runner/work/operator-controller/operator-controller/test/e2e/cluster_extension_install_test.go:342
        	            				/opt/hostedtoolcache/go/1.24.6/x64/src/runtime/asm_amd64.s:1700
        	Error:      	Not equal: 
        	            	expected: "Succeeded"
        	            	actual  : "Retrying"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-Succeeded
        	            	+Retrying
    cluster_extension_install_test.go:337: 
        	Error Trace:	/home/runner/work/operator-controller/operator-controller/test/e2e/cluster_extension_install_test.go:337
        	Error:      	Condition never satisfied
        	Test:       	TestClusterExtensionForceInstallNonSuccessorVersion
    helpers.go:333: By deleting ClusterCatalog "test-catalog-lt5hffsm"
    helpers.go:342: By deleting ClusterExtension "clusterextension-clhcbr2s"
    helpers.go:306: By waiting for CustomResourceDefinitions of "clusterextension-clhcbr2s" to be deleted
    helpers.go:314: By waiting for ClusterRoleBindings of "clusterextension-clhcbr2s" to be deleted
    helpers.go:322: By waiting for ClusterRoles of "clusterextension-clhcbr2s" to be deleted
    helpers.go:352: By deleting ServiceAccount "clusterextension-clhcbr2s"
    helpers.go:361: By deleting Namespace "clusterextension-clhcbr2s"
--- FAIL: TestClusterExtensionForceInstallNonSuccessorVersion (69.35s)

joelanford · 2025-12-10T13:32:55Z

helm/olmv1/templates/deployment-olmv1-system-catalogd-controller-manager.yml

 spec:
  minReadySeconds: 5
-  replicas: 1
+  replicas: {{ .Values.options.catalogd.deployment.replicas }}


Do we already have node anti affinity configured to make sure these replicas do not end up on the same node? If not, we need that as well (but only when replicas > 1).

However, I will point out that this may cause an issue on our single-node kind experimental-e2e tests where we have two replicas (such that we are validating that two replicas does not cause issues with the e2e tests).

Good point! I added podAntiAffinity and used the preferred rule. Besides, I created openshift/release#72395 to add SNO upgrade test for the downstream OLMv1 and OLMv0, please take a look, thanks!

podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: control-plane operator: In values: - operator-controller-controller-manager - catalogd-controller-manager topologyKey: kubernetes.io/hostname

pedjak

Given that PR is about fixing upgrade tests. I would suggest that we drop changes on test/e2e files, because they should not be affected and if needed we can always address them in a followup PR.

pedjak · 2025-12-10T15:32:21Z

test/upgrade-e2e/post_upgrade_test.go

+	for {
+		select {
+		case <-ctx.Done():
+			return nil, fmt.Errorf("timeout waiting for leader election: %w", ctx.Err())
+		case <-ticker.C:
+			var managerPods corev1.PodList
+			if err := c.List(ctx, &managerPods, client.MatchingLabelsSelector{Selector: deploymentLabelSelector}); err != nil {
+				continue
+			}
+
+			if len(managerPods.Items) == 0 {
+				continue
+			}
+
+			// If there's only one pod, it must be the leader
+			if len(managerPods.Items) == 1 {
+				return &managerPods.Items[0], nil
+			}
+
+			// For multi-replica deployments, check the Lease resource to find the leader
+			var lease coordinationv1.Lease


Given that we use testify library in this project, we could use require.Eventually() for polling instead of implementing our custom solution and improve readability.

pedjak · 2025-12-10T15:38:22Z

test/upgrade-e2e/post_upgrade_test.go

+			}
+
+			// If there's only one pod, it must be the leader
+			if len(managerPods.Items) == 1 {


we could be wrong here, if we get response from API server right before second pod is created.

Instead of first listing pods, I would suggest to fetch lease first, locate pod via lease holder and then check if that pod runs the appropriate OLM version.

tmshort · 2025-12-10T15:43:27Z

Given that PR is about fixing upgrade tests. I would suggest that we drop changes on test/e2e files, because they should not be affected and if needed we can always address them in a followup PR.

However, we will still need changes to test/e2e to validate that we can handle multiple replicas.

joelanford · 2025-12-10T15:49:08Z

I think we need to investigate further to understand:

Whether using 2 replicas actually requires different poll durations, and if so what specifically that reason is
Whether the e2e failures you are seeing are existing flakes on main, and if so what we should separately do to resolve those.

My instinct is that using multiple replicas should not affect timing, and if it does, it is a bug in the code, not in the test.

joelanford · 2025-12-10T15:52:09Z

However, we will still need changes to test/e2e to validate that we can handle multiple replicas.

If there are specific assumptions in the e2e tests that there is one replica and tests fail when there are two replicas, yes we should resolve that and remove those assumptions. But I would reiterate that I don't think anything about timing or functional test expectations themselves should be changing in this PR.

tmshort · 2025-12-10T21:28:28Z

test/upgrade-e2e/post_upgrade_test.go

-// to facilitate the mitigation put in place for https://github.com/operator-framework/operator-controller/issues/1626
-func waitForDeployment(t *testing.T, ctx context.Context, controlPlaneLabel string) *corev1.Pod {
+// i.e. no old pods remain.
+func waitForDeployment(t *testing.T, ctx context.Context, controlPlaneLabel string) {


This PR: #2379
Handles the changes for waitForDeployment, which are independent of these changes, and can be merged quicker because it only deals with multiple replicas in the

jianzhangbjz · 2025-12-11T01:29:10Z

My instinct is that using multiple replicas should not affect timing, and if it does, it is a bug in the code, not in the test.

Those timeout failures not only happened in this PR, but others. If you rerun it for times, it will pass. However, there is no retest mechanism as the downstream does. I tried increase the timeout to fix it and make this PR test pass.

jianzhangbjz · 2025-12-11T01:40:28Z

Hi @joelanford , could you help point out how to solve this general failure(#2371 (comment)) if we don't increase the timeout? Just /retest? Thanks!

Co-authored-by: Todd Short <[email protected]>

jianzhangbjz · 2025-12-11T02:38:56Z

Encounter this kind of failure again. I'd prefer to set -timeout=20m for the current go test -count=1 -v ./test/e2e/.... However, given your sensitivity to the timeout duration, I’ll leave this issue here for you to decide.

=== RUN   TestClusterExtensionRecoversFromExistingDeploymentWhenFailureFixed
    cluster_extension_install_test.go:686: When a cluster extension is installed from a catalog
    cluster_extension_install_test.go:687: When the extension bundle format is registry+v1
    helpers.go:271: Ensuring ClusterCatalog has Status.Condition of Progressing with a status == True and reason == Succeeded
    helpers.go:281: Checking that catalog has the expected metadata label
    helpers.go:286: Ensuring ClusterCatalog has Status.Condition of Type = Serving with status == True
    cluster_extension_install_test.go:710: By creating a new Deployment that can not be adopted
    cluster_extension_install_test.go:753: It resolves the specified package with correct bundle path
    cluster_extension_install_test.go:754: By creating the ClusterExtension resource
    cluster_extension_install_test.go:757: By eventually reporting Progressing == True with Reason Retrying
    cluster_extension_install_test.go:766: By eventually failing to install the package successfully due to no adoption support
    cluster_extension_install_test.go:779: By deleting the new Deployment
    cluster_extension_install_test.go:786: By eventually installing the package successfully
    cluster_extension_install_test.go:798: By eventually reporting Progressing == True with Reason Success
    helpers.go:333: By deleting ClusterCatalog "test-catalog-jmcpp5dt"
    helpers.go:342: By deleting ClusterExtension "clusterextension-2fx4dcfk"
    helpers.go:306: By waiting for CustomResourceDefinitions of "clusterextension-2fx4dcfk" to be deleted
    helpers.go:314: By waiting for ClusterRoleBindings of "clusterextension-2fx4dcfk" to be deleted
    helpers.go:322: By waiting for ClusterRoles of "clusterextension-2fx4dcfk" to be deleted
    helpers.go:352: By deleting ServiceAccount "clusterextension-2fx4dcfk"
    helpers.go:361: By deleting Namespace "clusterextension-2fx4dcfk"
panic: test timed out after 10m0s
	running tests:
		TestClusterExtensionRecoversFromExistingDeploymentWhenFailureFixed (20s)

goroutine 2094 [running]:
testing.(*M).startAlarm.func1()
	/opt/hostedtoolcache/go/1.24.6/x64/src/testing/testing.go:2484 +0x394

pedjak · 2025-12-11T12:26:25Z

test/upgrade-e2e/post_upgrade_test.go

+
+	var leaderPod *corev1.Pod
+
+	// Use require.Eventually for polling instead of custom ticker implementation


the comment states that we use it on the next line, but actually don't. Also, not sure why are mentioning custom ticker implementation, given that it is not a part of the existing code base.

pedjak · 2025-12-11T12:28:52Z

test/upgrade-e2e/post_upgrade_test.go

+
+		// Now fetch pods and find the one matching the lease holder
+		var managerPods corev1.PodList
+		if err := c.List(ctx, &managerPods, client.MatchingLabelsSelector{Selector: deploymentLabelSelector}); err != nil {


no need to list pods. After finding out the holder identity, we can directly fetch the particular pod with the given name.

tmshort · 2025-12-11T15:54:33Z

Hi @joelanford , could you help point out how to solve this general failure(#2371 (comment)) if we don't increase the timeout? Just /retest? Thanks!

You need to use standard GitHub mechanisms to rerun tests: clock on the failed test, and then there is a "Re-run Jobs" button in the upper right.

But... we shouldn't really fail that often.

tmshort · 2025-12-11T18:51:27Z

This should be rebased given the basic changes to support replicas in the tests.

jianzhangbjz requested a review from a team as a code owner December 4, 2025 03:09

openshift-ci bot requested review from tmshort and trgeiger December 4, 2025 03:09

jianzhangbjz mentioned this pull request Dec 4, 2025

OCPBUGS-62517: enable downstream pdb and update replicas to meet MCO drain openshift/operator-framework-operator-controller#576

Open

jianzhangbjz force-pushed the pdb2 branch 3 times, most recently from fa1b4f7 to 89b96d0 Compare December 4, 2025 07:23

jianzhangbjz changed the title 🐛 makes the replica count **configurable** and sets a sensible default of **2 replicas** 🐛 makes the replica count configurable and update upgrade-e2e test cases Dec 4, 2025

jianzhangbjz force-pushed the pdb2 branch 5 times, most recently from b03ff44 to 5403227 Compare December 4, 2025 09:13

jianzhangbjz force-pushed the pdb2 branch 4 times, most recently from da978dd to 1ba8c7e Compare December 4, 2025 10:56

jianzhangbjz force-pushed the pdb2 branch 3 times, most recently from b7d4b4d to 90b470a Compare December 4, 2025 11:59

pedjak requested changes Dec 4, 2025

View reviewed changes

openshift-ci bot assigned pedjak Dec 4, 2025

tmshort reviewed Dec 4, 2025

View reviewed changes

helm/olmv1/templates/deployment-olmv1-system-catalogd-controller-manager.yml Outdated Show resolved Hide resolved

pedjak reviewed Dec 9, 2025

View reviewed changes

joelanford reviewed Dec 9, 2025

View reviewed changes

jianzhangbjz force-pushed the pdb2 branch 2 times, most recently from 889c353 to f2cadda Compare December 10, 2025 03:56

jianzhangbjz changed the title ~~🐛 makes the replica count configurable and update upgrade-e2e test cases~~ 🐛 Make deployments HA-ready with configurable replica count and update upgrade-e2e test cases Dec 10, 2025

jianzhangbjz force-pushed the pdb2 branch from f2cadda to ebdb9cd Compare December 10, 2025 04:50

jianzhangbjz force-pushed the pdb2 branch from ebdb9cd to e23275b Compare December 10, 2025 05:16

jianzhangbjz force-pushed the pdb2 branch from e23275b to 36c34a3 Compare December 10, 2025 05:58

jianzhangbjz requested a review from pedjak December 10, 2025 06:34

joelanford reviewed Dec 10, 2025

View reviewed changes

pedjak reviewed Dec 10, 2025

View reviewed changes

tmshort reviewed Dec 10, 2025

View reviewed changes

Update e2e test case to support replica: 2

a79efe1

Co-authored-by: Todd Short <[email protected]>

jianzhangbjz force-pushed the pdb2 branch from 36c34a3 to a79efe1 Compare December 11, 2025 02:02

pedjak reviewed Dec 11, 2025

View reviewed changes

		// In the worst case (previous leader crashed), leader election can take up to 163 seconds
		// (LeaseDuration: 137s + RetryPeriod: 26s). Adding buffer for reconciliation time.


		var leaderPod *corev1.Pod

		// Use require.Eventually for polling instead of custom ticker implementation

🐛 Make deployments HA-ready with configurable replica count and update upgrade-e2e test cases #2371

Are you sure you want to change the base?

🐛 Make deployments HA-ready with configurable replica count and update upgrade-e2e test cases #2371

Uh oh!

Conversation

jianzhangbjz commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description