- 
                Notifications
    
You must be signed in to change notification settings  - Fork 297
 
feat:operator:kcc: MultiClusterLease integration #5420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| 
           [APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. 
Needs approval from an approver in each of these files:
 
      Approvers can indicate their approval by writing   | 
    
ed9b767    to
    ff8b3e1      
    Compare
  
    1277beb    to
    8fe8d41      
    Compare
  
    Signed-off-by: Alex Pana <[email protected]>
Signed-off-by: Alex Pana <[email protected]>
8fe8d41    to
    ee7d7b7      
    Compare
  
    64b7a95    to
    11aff87      
    Compare
  
    Signed-off-by: Alex Pana <[email protected]>
11aff87    to
    cfe58c7      
    Compare
  
    | # todo | ||
| - github.com/gke-labs/multicluster-leader-election | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will need to specify the license in https://github.com/gke-labs/multicluster-leader-election and follow up here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approved so we should be able to rebase
| // err is nil | ||
| // todo acpana add more defense in depth here | ||
| logging.ExitInfo("might've lost leader election") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
walter had suggested adding some defense in depth here where we as the kccmanager process check our leadership status and bail out if we are no longer a leader.
however, AFAICT, we get that guarantee for free if mgr.Start returns, meaning we are no longer a leader if err is nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that actually what happens with the default lease objects? If/when we lose the lease then Start simply returns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be able to listen for a cancel on the context used for leader election and exist when it is called.
| github.com/cenkalti/backoff v2.2.1+incompatible | ||
| github.com/fatih/color v1.18.0 | ||
| github.com/ghodss/yaml v1.0.0 | ||
| github.com/gke-labs/multicluster-leader-election v0.0.0-20250923220528-0bf41dc7fecc | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All go mod changes are coming out of there. Unfortunately, I think bumping the controller-runtime lib really upset the pkg/cli/preview as it introduces an interface change. I added an unimplemtend func for now for the preview cli.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds OK.  It is a good reason not to update library versions in libraries.  Also a good reason to stick to keep dependencies minimal in libraries.  I wonder if the MCL library should be using dynamic.Interface or a similar dependency-light library.  (Does one exist?)
Signed-off-by: Alex Pana <[email protected]>
| ManagerOptions: manager.Options{ | ||
| Cache: cache.Options{ | ||
| DefaultNamespaces: map[string]cache.Config{ | ||
| scopedNamespace: {}, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I see this is a copy of existing code, but I'm curious shouldn't this be scopedNamespace: scopedNamespace so it takes the flag value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually no I think this is a great catch! I tried to do away with that level of indirection (newManager) func but I will revert that refactor to keep things less confusing and make sure I don't miss things like these!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets not conflate
| 
               | 
          ||
| // Add --multi-cluster-election=true flag | ||
| args, _, _ := unstructured.NestedStringSlice(container, "args") | ||
| args = append(args, "--multi-cluster-election=true") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: just for self-education purpose, how this flag will be used to enable the feature? asking because I don't see we pass it anywhere, and seems like we check the existence of "cc.Spec.Experiments.LeaderElection" to determine whether to enable the feature(line 134-148 in kccmanager.go)
| It's recommended to use `googleServiceAccount` when running ConfigConnector in Google Kubernetes Engine (GKE) clusters with Workload Identity enabled. | ||
| This field cannot be specified together with `googleServiceAccount`. | ||
| type: string | ||
| experiments: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to be per-namespace or per-cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we should start with per-cluster level. I think namespace (or sharding) is more complicated to build, maintain, configure and understand. Things like the syncer also get more complicated. Lets ship the easier version and then work on it.
| Namespace string `json:"namespace"` | ||
| // The unique name of the global lock, which must be shared by all KCC replicas | ||
| // and workloads across all clusters that are part of the same election. | ||
| GlobalLockName string `json:"globalLockName"` | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could allow limited templating in this name, e.g. gs://my-cluster-locks/kcc/prod/{namespace}
| func (r *Reconciler) applyMultiClusterLeaderElection(ctx context.Context, obj *manifest.Objects) error { | ||
| log := log.FromContext(ctx) | ||
| for _, item := range obj.Items { | ||
| if item.Kind != "StatefulSet" || !strings.HasPrefix(item.GetName(), "cnrm-controller-manager") { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we might want to extract this if statement into a shared function, it's both a little less fragile / more maintainable and it documents more clearly what we are doing.
e.g. IsControllerManagerStatefulSet
| log.Info("enabling multi-cluster leader election for StatefulSet", "name", item.GetName()) | ||
| if err := item.MutateContainers(func(container map[string]interface{}) error { | ||
| name, found, _ := unstructured.NestedString(container, "name") | ||
| if !found || name != "manager" { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: if !found, then name will be empty, so you don't need to check found
| "sigs.k8s.io/controller-runtime/pkg/manager" | ||
| 
               | 
          ||
| // Register direct controllers | ||
| _ "github.com/GoogleCloudPlatform/k8s-config-connector/pkg/controller/direct" | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need this (?). If we don't, that's interesting and it should be a separate PR I think
| // This serves as the entry point for the in-cluster main and the Borg service main. Any changes made should be done | ||
| // with care. | ||
| func New(ctx context.Context, restConfig *rest.Config, cfg Config) (manager.Manager, error) { | ||
| krmtotf.SetUserAgentForTerraformProvider() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also looks unrelated to the MCL integration (?)
| return nil, fmt.Errorf("error adding schemes: %w", err) | ||
| } | ||
| 
               | 
          ||
| // Create a temporary client to read the ConfigConnector object for leader election config. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, this is clever but I just don't know. We should at least split it out into its own function to avoid accidentally reusing the client.
I think because we aren't going to watch this object, we should rely on the operator to manage the restarts / rollout.
Maybe invent something like a JDBC URL for leader election syntax to make it a one liner, e.g.
gs://<bucket>/<path>?timeout=15s&namespace=foo
I'm also not sure who should create the lease objects, should we simply accept a pointer to a pre-configured lease object?
| github.com/cenkalti/backoff v2.2.1+incompatible | ||
| github.com/fatih/color v1.18.0 | ||
| github.com/ghodss/yaml v1.0.0 | ||
| github.com/gke-labs/multicluster-leader-election v0.0.0-20250923220528-0bf41dc7fecc | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds OK.  It is a good reason not to update library versions in libraries.  Also a good reason to stick to keep dependencies minimal in libraries.  I wonder if the MCL library should be using dynamic.Interface or a similar dependency-light library.  (Does one exist?)
Uh oh!
There was an error while loading. Please reload this page.