test: stabilize flaky firewallpolicyrule acquisition test #5409

gemmahou · 2025-10-10T21:45:10Z

BRIEF Change description

Intend to stabilize flaky firewallpolicyrule acquisition test
Issue: #5410

WHY do we need this change?

Special notes for your reviewer:

I was not able to reproduce the test failure locally, but I did see the flakiness and failed several times on GH presubmit job.

I've triggered the job 3 times and they all passed: https://screenshot.googleplex.com/8HST4j8bqR3gcLZ.png

Here's the root cause and why it only happens on GH job that Gemini provided:

The extra GET request is a symptom of a subtle race condition between the Kubernetes API server's state and the controller's local cache.

Here's the sequence of events that causes it:

 1. Test Confirms Parent is Ready: The test script applies the parent ComputeFirewallPolicy and then waits for it to be
ready. This "wait" function checks the resource's status by polling the API server directly. Once it sees Ready=True, the
test proceeds.

 3. Child is Created Immediately: The test immediately applies the child ComputeFirewallPolicyRule.

 4. Controller Reads from Cache: The ComputeFirewallPolicyRule controller wakes up to reconcile the new child
resource. To get the parent's required ID, it reads the ComputeFirewallPolicy resource from its local, in-memory cache,
not from the API server.

 5. The Race: There is a very small, non-zero delay for the parent's updated status to propagate from the API server to
the child controller's local cache.
     * On your fast local machine: The cache usually syncs before the child controller needs it. The controller finds the
parent's ID in the cache on the first try.
     * On a slower GitHub runner: The child controller often reads its cache before the parent's status update has arrived.
It sees a stale version of the parent resource without the necessary ID.

When the controller reads the stale cache, its logic correctly determines the dependency isn't ready and requeues the
request. The reconciliation succeeds moments later, but this requeue-and-retry cycle alters the timing of operations. This
new timing causes a different pattern of HTTP requests (the extra GET) compared to the "perfect" run that was used to
generate the golden log file, resulting in the test failure.

While I'm not familiar with the concept of "controller's local cache", any further clarifications would be appreciated.

This test started failing recently without any code changes to this resource. The only changes would be we have made
100+ presubmit jobs running in parallel, so it probably makes sense that:

On GitHub the environment is different. The underlying Linux kernel, its scheduler, and the load from other processes are
all variables. This can cause the Go runtime to schedule the goroutines in a slightly different order. All it takes is for the 
child FirewallPolicyRule controller's goroutine to be scheduled a few milliseconds before the parent's cache update is 
processed, and the bug is triggered.

Does this PR add something which needs to be 'release noted'?

Reviewer reviewed release note.

Additional documentation e.g., references, usage docs, etc.:

Intended Milestone

Please indicate the intended milestone.

Reviewer tagged PR with the actual milestone.

Tests you have done

Run make ready-pr to ensure this PR is ready for review.
Perform necessary E2E testing for changed resources.

google-oss-prow · 2025-10-10T21:45:26Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yuwenma for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemmahou · 2025-10-13T17:55:59Z

/assign @acpana
Would you mind taking a look at this PR? Thanks!

acpana · 2025-10-13T23:56:49Z

tests/e2e/testdata/scenarios/acquisition/firewallpolicyrule/script.yaml

-TEST: APPLY
+# pausing the test execution, ensuring the controller's cache has time to sync with the API server before the child
+# resource is created.
+TEST: SLEEP
+apiVersion: compute.cnrm.cloud.google.com/v1beta1
+kind: ComputeFirewallPolicyRule
+metadata:
+  name: firewallpolicyrule-${uniqueId}
+
+---


Ah I didn't know we have SLEEP test case.

I think this works for now BUT Walter did ask me to use a wait.Poll for a similar-ish race on one of my tests! e.g.:

https://github.com/GoogleCloudPlatform/k8s-config-connector/pull/5288/files#diff-880e10d431db58a0b2411d1fb8dc426dca5aa8fda01ddfecebabb0761d30d0feR298-R303

I'm going to leave it up to him on whether to use it in this case!

acpana · 2025-10-13T23:57:39Z

/lgtm

cc @cheftako for final approval!

hack: stabilize flaky firewallpolicyrule acquisition test

e2a475e

gemmahou force-pushed the fix-scenarios-firewallpolicyrule branch from da7584b to e2a475e Compare October 10, 2025 22:09

gemmahou linked an issue Oct 10, 2025 that may be closed by this pull request

Test TestE2EScript/scenarios/acquisition/firewallpolicyrule is flaky #5410

Open

google-oss-prow bot assigned acpana Oct 13, 2025

acpana reviewed Oct 13, 2025

View reviewed changes

google-oss-prow bot added the lgtm label Oct 13, 2025

gemmahou assigned cheftako Oct 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: stabilize flaky firewallpolicyrule acquisition test #5409

test: stabilize flaky firewallpolicyrule acquisition test #5409

Uh oh!

gemmahou commented Oct 10, 2025 •

edited

Loading

Uh oh!

google-oss-prow bot commented Oct 10, 2025

Uh oh!

gemmahou commented Oct 13, 2025

Uh oh!

acpana Oct 13, 2025

Uh oh!

acpana commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

test: stabilize flaky firewallpolicyrule acquisition test #5409

Are you sure you want to change the base?

test: stabilize flaky firewallpolicyrule acquisition test #5409

Uh oh!

Conversation

gemmahou commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

BRIEF Change description

WHY do we need this change?

Special notes for your reviewer:

Does this PR add something which needs to be 'release noted'?

Additional documentation e.g., references, usage docs, etc.:

Intended Milestone

Tests you have done

Uh oh!

google-oss-prow bot commented Oct 10, 2025

Uh oh!

gemmahou commented Oct 13, 2025

Uh oh!

acpana Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

acpana commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gemmahou commented Oct 10, 2025 •

edited

Loading