Skip to content

Conversation

@matttrach
Copy link
Collaborator

Problem:
There is a part of the module which blindly reaches out to AWS for information on nodes that Rancher is deploying.
It does this to adjust the node's security groups in order to allow them to talk back to Rancher without making the Rancher node public. The blind part is the problem. We need to hit the sweet spot of after the nodes are deployed in AWS, but before the cluster generation times out waiting for the nodes to talk back. I have set the Rancher timeout to 120 minutes, it is obviously not ideal to have to wait 120 minutes to deploy a cluster, so we want to get the fastest transition we can without hitting that limit. The variables involved are things like the Rancher node latency, the AWS latency, the AWS API throttling, AWS instance deployment latency, AWS instance boot timing, cloud-init problems, etc.

Solution:
I have recently added two solutions:

  1. After noticing that the AWS API takes longer to deploy instances if you request a bunch in a row I now wait 1min per node by default.
  2. After noticing that 1min is often not enough for a single node, but with multiple nodes the 2+min works just fine I added a 1min buffer, meaning the wait calculation is 1min + 1min/node.
  3. I added the resources for waiting and collecting data three times. There is no way to short circuit the logic because of the declarative nature of Terraform, but in dicey situations I figure three attempts is better than one, even if it means there is a 3min/node additional wait (the deploy time is closing in on 1hour, so 3-9min isn't a large increase).

@matttrach matttrach requested a review from a team as a code owner October 22, 2025 05:53
@matttrach matttrach self-assigned this Oct 22, 2025
@matttrach matttrach merged commit 28a1977 into rancher:main Oct 22, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant