-
Notifications
You must be signed in to change notification settings - Fork 40
feat: Add Fault Domain retry logic for out of host capacity error #473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA). To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application. When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated. If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public. |
templates/cluster-template.yaml
Outdated
| name: "${CLUSTER_NAME}" | ||
| spec: | ||
| compartmentId: "${OCI_COMPARTMENT_ID}" | ||
| region: ${OCI_REGION} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets not have this here. We have a template for alternate region already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is best to have only the relevant changes per PR if possible.
|
It looks like the oracle-contributor-agreement bot is incorrectly linking my account instead of HaoyL666. Please fix or Ethan set your git credentials locally so the author of your commits is your account instead of your macbook stack overflow thread on this |
aece866 to
9d3eb62
Compare
.gitignore
Outdated
|
|
||
| # git worktrees | ||
| worktrees/ | ||
| worktrees/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably come up with some code style tool to make sure we don't add inadvertent changes. I'm not against this change, but if we keep our PRs small and on point they are easier to review. I'm pretty sure this is a code editor change being doing automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I’ve reverted it so this PR only includes the intended changes. I will also see if i can set up a code style tool!
unit tests |
| } | ||
| resp, err := m.ComputeClient.LaunchInstance(ctx, core.LaunchInstanceRequest{ | ||
| LaunchInstanceDetails: details, | ||
| OpcRetryToken: opcRetryToken, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retrying with the same token across different fault domains may cause OCI to reject requests as duplicates. Can you confirm how this works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe append the FD to the token?
| return instanceCompartmentIDMatcher(request, "test") | ||
| })).Return(core.LaunchInstanceResponse{}, nil) | ||
| }, | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also have tests for
- Multiple fault domains with capacity issues
- Non-capacity error during retry (e.g., what if FD2 returns quota error?)
| if err == nil { | ||
| return false | ||
| } | ||
| return strings.Contains(strings.ToLower(err.Error()), strings.ToLower(OutOfHostCapacityErr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check if OCI SDK has error codes or typed errors for capacity issues. If not this is fine, but I would rather not use string matching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't find it in doc. I reached out to oci_compute, waiting for their response.
| return m.launchInstanceWithFaultDomainRetry(ctx, launchDetails, faultDomains) | ||
| } | ||
|
|
||
| func (m *MachineScope) launchInstanceWithFaultDomainRetry(ctx context.Context, baseDetails core.LaunchInstanceDetails, faultDomains []string) (*core.Instance, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets add GoDoc style comments to the functions.
| return nil, lastErr | ||
| } | ||
|
|
||
| func (m *MachineScope) buildFaultDomainLaunchList(availabilityDomain, initialFaultDomain string, retry bool) []string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename to something more clear like buildFaultDomainRetryList or something like that. Since this is being used for retry logic.
| return nil, err | ||
| } | ||
| lastErr = err | ||
| m.Logger.Info("Fault domain has run out of host capacity, retrying in a different domain", "faultDomain", fd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some better errors around this. I like what is here, but maybe we can have a total attempts or something
m.Logger.Info("Attempting instance launch", "faultDomain", fd, "attemptNumber", attemptCount+1, "totalAttempts", len(faultDomains))
or something like that.
| } | ||
|
|
||
| func (m *MachineScope) buildFaultDomainLaunchList(availabilityDomain, initialFaultDomain string, retry bool) []string { | ||
| attempts := []string{initialFaultDomain} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is attempts here just a list of FDs? should we call it something more inline with that?
| } | ||
|
|
||
| func (m *MachineScope) resolveAvailabilityAndFaultDomain() (string, string, bool, error) { | ||
| failureDomainKey := m.Machine.Spec.FailureDomain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets stick with calling them FaultDomains here since that is what the function and all the new code calls it. Maybe even make a comment here that CAPI calls them Failure Domains, but OCI calls them FaultDomains
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could keep using Failure Domain as it contains AD/FD info. In case of single AD regions, the failure domain will be fault domain, in case of multi Ad regions, it will be AD. So we just get what we need from Failure domain. Not sure if this make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is fine then lets rename resolveAvailabilityAndFaultDomain to stick with FailureDomain. I just want to be as clear as we can be.
| if len(faultDomains) == 0 { | ||
| faultDomains = []string{ociutil.DerefString(baseDetails.FaultDomain)} | ||
| } | ||
| opcRetryToken := ociutil.GetOPCRetryToken(string(m.OCIMachine.UID)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might need a check if. What happens if m.OCIMAchine is nil? Maybe use ptr.ToString IDK






What this PR does / why we need it:
In case of out of host capacity errors, all fault domains will be tried before backoff.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes #133