fix: try to get the nodes 3 times before failure #132

matttrach · 2025-10-22T05:53:25Z

Problem:
There is a part of the module which blindly reaches out to AWS for information on nodes that Rancher is deploying.
It does this to adjust the node's security groups in order to allow them to talk back to Rancher without making the Rancher node public. The blind part is the problem. We need to hit the sweet spot of after the nodes are deployed in AWS, but before the cluster generation times out waiting for the nodes to talk back. I have set the Rancher timeout to 120 minutes, it is obviously not ideal to have to wait 120 minutes to deploy a cluster, so we want to get the fastest transition we can without hitting that limit. The variables involved are things like the Rancher node latency, the AWS latency, the AWS API throttling, AWS instance deployment latency, AWS instance boot timing, cloud-init problems, etc.

Solution:
I have recently added two solutions:

After noticing that the AWS API takes longer to deploy instances if you request a bunch in a row I now wait 1min per node by default.
After noticing that 1min is often not enough for a single node, but with multiple nodes the 2+min works just fine I added a 1min buffer, meaning the wait calculation is 1min + 1min/node.
I added the resources for waiting and collecting data three times. There is no way to short circuit the logic because of the declarative nature of Terraform, but in dicey situations I figure three attempts is better than one, even if it means there is a 3min/node additional wait (the deploy time is closing in on 1hour, so 3-9min isn't a large increase).

Signed-off-by: matttrach <[email protected]>

fix: try to get the nodes 3 times before failure

b0af086

Signed-off-by: matttrach <[email protected]>

matttrach requested a review from a team as a code owner October 22, 2025 05:53

matttrach self-assigned this Oct 22, 2025

matttrach merged commit 28a1977 into rancher:main Oct 22, 2025
7 checks passed

github-actions bot mentioned this pull request Oct 22, 2025

chore(main): release 3.1.1 #129

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: try to get the nodes 3 times before failure #132

fix: try to get the nodes 3 times before failure #132

matttrach commented Oct 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: try to get the nodes 3 times before failure #132

fix: try to get the nodes 3 times before failure #132

Conversation

matttrach commented Oct 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant