Skip to content

Conversation

gemmahou
Copy link
Collaborator

@gemmahou gemmahou commented Oct 10, 2025

BRIEF Change description

Intend to stabilize flaky firewallpolicyrule acquisition test
Issue: #5410

WHY do we need this change?

Special notes for your reviewer:

I was not able to reproduce the test failure locally, but I did see the flakiness and failed several times on GH presubmit job.

I've triggered the job 3 times and they all passed: https://screenshot.googleplex.com/8HST4j8bqR3gcLZ.png

Here's the root cause and why it only happens on GH job that Gemini provided:

The extra GET request is a symptom of a subtle race condition between the Kubernetes API server's state and the controller's local cache.

Here's the sequence of events that causes it:

 1. Test Confirms Parent is Ready: The test script applies the parent ComputeFirewallPolicy and then waits for it to be
ready. This "wait" function checks the resource's status by polling the API server directly. Once it sees Ready=True, the
test proceeds.

 3. Child is Created Immediately: The test immediately applies the child ComputeFirewallPolicyRule.

 4. Controller Reads from Cache: The ComputeFirewallPolicyRule controller wakes up to reconcile the new child
resource. To get the parent's required ID, it reads the ComputeFirewallPolicy resource from its local, in-memory cache,
not from the API server.

 5. The Race: There is a very small, non-zero delay for the parent's updated status to propagate from the API server to
the child controller's local cache.
     * On your fast local machine: The cache usually syncs before the child controller needs it. The controller finds the
parent's ID in the cache on the first try.
     * On a slower GitHub runner: The child controller often reads its cache before the parent's status update has arrived.
It sees a stale version of the parent resource without the necessary ID.

When the controller reads the stale cache, its logic correctly determines the dependency isn't ready and requeues the
request. The reconciliation succeeds moments later, but this requeue-and-retry cycle alters the timing of operations. This
new timing causes a different pattern of HTTP requests (the extra GET) compared to the "perfect" run that was used to
generate the golden log file, resulting in the test failure.

While I'm not familiar with the concept of "controller's local cache", any further clarifications would be appreciated.

This test started failing recently without any code changes to this resource. The only changes would be we have made
100+ presubmit jobs running in parallel, so it probably makes sense that:

On GitHub the environment is different. The underlying Linux kernel, its scheduler, and the load from other processes are
all variables. This can cause the Go runtime to schedule the goroutines in a slightly different order. All it takes is for the 
child FirewallPolicyRule controller's goroutine to be scheduled a few milliseconds before the parent's cache update is 
processed, and the bug is triggered.

Does this PR add something which needs to be 'release noted'?


  • Reviewer reviewed release note.

Additional documentation e.g., references, usage docs, etc.:


Intended Milestone

Please indicate the intended milestone.

  • Reviewer tagged PR with the actual milestone.

Tests you have done

  • Run make ready-pr to ensure this PR is ready for review.
  • Perform necessary E2E testing for changed resources.

Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yuwenma for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemmahou gemmahou force-pushed the fix-scenarios-firewallpolicyrule branch from da7584b to e2a475e Compare October 10, 2025 22:09
@gemmahou
Copy link
Collaborator Author

/assign @acpana
Would you mind taking a look at this PR? Thanks!

Comment on lines -39 to +47
TEST: APPLY
# pausing the test execution, ensuring the controller's cache has time to sync with the API server before the child
# resource is created.
TEST: SLEEP
apiVersion: compute.cnrm.cloud.google.com/v1beta1
kind: ComputeFirewallPolicyRule
metadata:
name: firewallpolicyrule-${uniqueId}

---
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I didn't know we have SLEEP test case.

I think this works for now BUT Walter did ask me to use a wait.Poll for a similar-ish race on one of my tests! e.g.:

https://github.com/GoogleCloudPlatform/k8s-config-connector/pull/5288/files#diff-880e10d431db58a0b2411d1fb8dc426dca5aa8fda01ddfecebabb0761d30d0feR298-R303

I'm going to leave it up to him on whether to use it in this case!

@acpana
Copy link
Collaborator

acpana commented Oct 13, 2025

/lgtm

cc @cheftako for final approval!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test TestE2EScript/scenarios/acquisition/firewallpolicyrule is flaky

3 participants