Skip to content

Conversation

@HaoyL666
Copy link
Member

What this PR does / why we need it:
In case of out of host capacity errors, all fault domains will be tried before backoff.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #133

@oracle-contributor-agreement
Copy link

Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA).
The following contributors of this PR have not signed the OCA:

To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application.

When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated.

If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. label Nov 19, 2025
name: "${CLUSTER_NAME}"
spec:
compartmentId: "${OCI_COMPARTMENT_ID}"
region: ${OCI_REGION}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not have this here. We have a template for alternate region already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is best to have only the relevant changes per PR if possible.

@EthanJeffery
Copy link

It looks like the oracle-contributor-agreement bot is incorrectly linking my account instead of HaoyL666. Please fix or Ethan set your git credentials locally so the author of your commits is your account instead of your macbook stack overflow thread on this

@HaoyL666 HaoyL666 force-pushed the ethan_fault-domain-retry branch from aece866 to 9d3eb62 Compare November 21, 2025 03:38
@oracle-contributor-agreement oracle-contributor-agreement bot added OCA Verified All contributors have signed the Oracle Contributor Agreement. and removed OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. labels Nov 21, 2025
.gitignore Outdated

# git worktrees
worktrees/
worktrees/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably come up with some code style tool to make sure we don't add inadvertent changes. I'm not against this change, but if we keep our PRs small and on point they are easier to review. I'm pretty sure this is a code editor change being doing automatically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I’ve reverted it so this PR only includes the intended changes. I will also see if i can set up a code style tool!

@HaoyL666
Copy link
Member Author

unit tests

ok      github.com/oracle/cluster-api-provider-oci/api/v1beta1  (cached)        coverage: 23.3% of statements
ok      github.com/oracle/cluster-api-provider-oci/api/v1beta2  (cached)        coverage: 18.0% of statements
ok      github.com/oracle/cluster-api-provider-oci/cloud/config (cached)        coverage: 86.1% of statements
        github.com/oracle/cluster-api-provider-oci/cloud/metrics                coverage: 0.0% of statements
ok      github.com/oracle/cluster-api-provider-oci/cloud/ociutil        0.544s  coverage: 29.9% of statements
ok      github.com/oracle/cluster-api-provider-oci/cloud/ociutil/ptr    (cached)        coverage: 100.0% of statements
ok      github.com/oracle/cluster-api-provider-oci/cloud/scope  (cached)        coverage: 74.8% of statements
        github.com/oracle/cluster-api-provider-oci/cloud/scope/mocks            coverage: 0.0% of statements
        github.com/oracle/cluster-api-provider-oci/cloud/services/base          coverage: 0.0% of statements
        github.com/oracle/cluster-api-provider-oci/cloud/services/base/mock_base                coverage: 0.0% of statements
?       github.com/oracle/cluster-api-provider-oci/cloud/services/compute       [no test files]
        github.com/oracle/cluster-api-provider-oci/cloud/services/compute/mock_compute          coverage: 0.0% of statements
?       github.com/oracle/cluster-api-provider-oci/cloud/services/computemanagement     [no test files]
        github.com/oracle/cluster-api-provider-oci/cloud/services/computemanagement/mock_computemanagement              coverage: 0.0% of statements
?       github.com/oracle/cluster-api-provider-oci/cloud/services/containerengine       [no test files]
        github.com/oracle/cluster-api-provider-oci/cloud/services/containerengine/mock_containerengine          coverage: 0.0% of statements
?       github.com/oracle/cluster-api-provider-oci/cloud/services/identity      [no test files]
        github.com/oracle/cluster-api-provider-oci/cloud/services/identity/mock_identity                coverage: 0.0% of statements
?       github.com/oracle/cluster-api-provider-oci/cloud/services/loadbalancer  [no test files]
        github.com/oracle/cluster-api-provider-oci/cloud/services/loadbalancer/mock_lb          coverage: 0.0% of statements
?       github.com/oracle/cluster-api-provider-oci/cloud/services/networkloadbalancer   [no test files]
        github.com/oracle/cluster-api-provider-oci/cloud/services/networkloadbalancer/mock_nlb          coverage: 0.0% of statements
?       github.com/oracle/cluster-api-provider-oci/cloud/services/vcn   [no test files]
        github.com/oracle/cluster-api-provider-oci/cloud/services/vcn/mock_vcn          coverage: 0.0% of statements
?       github.com/oracle/cluster-api-provider-oci/cloud/services/workrequests  [no test files]
        github.com/oracle/cluster-api-provider-oci/cloud/services/workrequests/mock_workrequests                coverage: 0.0% of statements
ok      github.com/oracle/cluster-api-provider-oci/cloud/util   (cached)        coverage: 60.9% of statements
ok      github.com/oracle/cluster-api-provider-oci/controllers  (cached)        coverage: 59.2% of statements
ok      github.com/oracle/cluster-api-provider-oci/exp/api/v1beta1      (cached)        coverage: 15.3% of statements
ok      github.com/oracle/cluster-api-provider-oci/exp/api/v1beta2      (cached)        coverage: 5.8% of statements
ok      github.com/oracle/cluster-api-provider-oci/exp/controllers      (cached)        coverage: 56.3% of statements
        github.com/oracle/cluster-api-provider-oci/feature              coverage: 0.0% of statements
?       github.com/oracle/cluster-api-provider-oci/version      [no test files]

@HaoyL666
Copy link
Member Author

HaoyL666 commented Dec 1, 2025

E2E test:
Screenshot 2025-12-01 at 9 35 20 AM

@HaoyL666 HaoyL666 marked this pull request as ready for review December 1, 2025 23:31
@HaoyL666 HaoyL666 changed the title [WIP] Add Fault Domain retry logic for out of host capacity error feat: Add Fault Domain retry logic for out of host capacity error Dec 2, 2025
@HaoyL666
Copy link
Member Author

HaoyL666 commented Dec 8, 2025

Test case 1:
Provisioned a cluster in phx-AD-2 (maps to AD-3) with no FD specified, the worker got automatically deployed in FD3 (only FD3 has capacity)
Screenshot 2025-12-08 at 10 02 33
Screenshot 2025-12-08 at 10 00 48

@HaoyL666
Copy link
Member Author

HaoyL666 commented Dec 8, 2025

Test case 2:
Provisioned a cluster in sydney AD-1 with FD specified as FD-1, scaled up worker nodes, triggered out of capacity error and the worker got deployed in FD3, which had capacity.
Screenshot 2025-12-08 at 10 08 32
Screenshot 2025-12-08 at 10 44 16

@HaoyL666
Copy link
Member Author

HaoyL666 commented Dec 9, 2025

new E2E:
Screenshot 2025-12-08 at 12 50 39

@HaoyL666 HaoyL666 requested a review from joekr December 9, 2025 15:29
}
resp, err := m.ComputeClient.LaunchInstance(ctx, core.LaunchInstanceRequest{
LaunchInstanceDetails: details,
OpcRetryToken: opcRetryToken,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrying with the same token across different fault domains may cause OCI to reject requests as duplicates. Can you confirm how this works?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe append the FD to the token?

return instanceCompartmentIDMatcher(request, "test")
})).Return(core.LaunchInstanceResponse{}, nil)
},
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also have tests for

  • Multiple fault domains with capacity issues
  • Non-capacity error during retry (e.g., what if FD2 returns quota error?)

if err == nil {
return false
}
return strings.Contains(strings.ToLower(err.Error()), strings.ToLower(OutOfHostCapacityErr))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if OCI SDK has error codes or typed errors for capacity issues. If not this is fine, but I would rather not use string matching.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't find it in doc. I reached out to oci_compute, waiting for their response.

return m.launchInstanceWithFaultDomainRetry(ctx, launchDetails, faultDomains)
}

func (m *MachineScope) launchInstanceWithFaultDomainRetry(ctx context.Context, baseDetails core.LaunchInstanceDetails, faultDomains []string) (*core.Instance, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add GoDoc style comments to the functions.

return nil, lastErr
}

func (m *MachineScope) buildFaultDomainLaunchList(availabilityDomain, initialFaultDomain string, retry bool) []string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to something more clear like buildFaultDomainRetryList or something like that. Since this is being used for retry logic.

return nil, err
}
lastErr = err
m.Logger.Info("Fault domain has run out of host capacity, retrying in a different domain", "faultDomain", fd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some better errors around this. I like what is here, but maybe we can have a total attempts or something

m.Logger.Info("Attempting instance launch", "faultDomain", fd, "attemptNumber", attemptCount+1, "totalAttempts", len(faultDomains))

or something like that.

}

func (m *MachineScope) buildFaultDomainLaunchList(availabilityDomain, initialFaultDomain string, retry bool) []string {
attempts := []string{initialFaultDomain}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is attempts here just a list of FDs? should we call it something more inline with that?

}

func (m *MachineScope) resolveAvailabilityAndFaultDomain() (string, string, bool, error) {
failureDomainKey := m.Machine.Spec.FailureDomain
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets stick with calling them FaultDomains here since that is what the function and all the new code calls it. Maybe even make a comment here that CAPI calls them Failure Domains, but OCI calls them FaultDomains

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could keep using Failure Domain as it contains AD/FD info. In case of single AD regions, the failure domain will be fault domain, in case of multi Ad regions, it will be AD. So we just get what we need from Failure domain. Not sure if this make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is fine then lets rename resolveAvailabilityAndFaultDomain to stick with FailureDomain. I just want to be as clear as we can be.

if len(faultDomains) == 0 {
faultDomains = []string{ociutil.DerefString(baseDetails.FaultDomain)}
}
opcRetryToken := ociutil.GetOPCRetryToken(string(m.OCIMachine.UID))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need a check if. What happens if m.OCIMAchine is nil? Maybe use ptr.ToString IDK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Verified All contributors have signed the Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

In case of out of host capacity errors, all fault domains should be tried before backoff

3 participants