Skip to content

Conversation

@acpana
Copy link
Collaborator

@acpana acpana commented Oct 13, 2025

@google-oss-prow
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign justinsb for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@acpana acpana force-pushed the acpana/integrate-mcl branch 2 times, most recently from ed9b767 to ff8b3e1 Compare October 20, 2025 19:28
@acpana acpana force-pushed the acpana/integrate-mcl branch 2 times, most recently from 1277beb to 8fe8d41 Compare October 21, 2025 22:37
@acpana acpana force-pushed the acpana/integrate-mcl branch from 8fe8d41 to ee7d7b7 Compare October 21, 2025 22:52
@acpana acpana changed the title [WIP]: feat:operator: MCL integration feat:operator: MCL integration Oct 21, 2025
@acpana acpana force-pushed the acpana/integrate-mcl branch 3 times, most recently from 64b7a95 to 11aff87 Compare October 21, 2025 23:50
Signed-off-by: Alex Pana <[email protected]>
@acpana acpana force-pushed the acpana/integrate-mcl branch from 11aff87 to cfe58c7 Compare October 22, 2025 00:00
Comment on lines +45 to +46
# todo
- github.com/gke-labs/multicluster-leader-election
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need to specify the license in https://github.com/gke-labs/multicluster-leader-election and follow up here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approved so we should be able to rebase

Comment on lines +161 to +163
// err is nil
// todo acpana add more defense in depth here
logging.ExitInfo("might've lost leader election")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

walter had suggested adding some defense in depth here where we as the kccmanager process check our leadership status and bail out if we are no longer a leader.

however, AFAICT, we get that guarantee for free if mgr.Start returns, meaning we are no longer a leader if err is nil.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that actually what happens with the default lease objects? If/when we lose the lease then Start simply returns?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be able to listen for a cancel on the context used for leader election and exist when it is called.

github.com/cenkalti/backoff v2.2.1+incompatible
github.com/fatih/color v1.18.0
github.com/ghodss/yaml v1.0.0
github.com/gke-labs/multicluster-leader-election v0.0.0-20250923220528-0bf41dc7fecc
Copy link
Collaborator Author

@acpana acpana Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All go mod changes are coming out of there. Unfortunately, I think bumping the controller-runtime lib really upset the pkg/cli/preview as it introduces an interface change. I added an unimplemtend func for now for the preview cli.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds OK. It is a good reason not to update library versions in libraries. Also a good reason to stick to keep dependencies minimal in libraries. I wonder if the MCL library should be using dynamic.Interface or a similar dependency-light library. (Does one exist?)

Signed-off-by: Alex Pana <[email protected]>
@acpana acpana marked this pull request as ready for review October 22, 2025 20:02
@acpana acpana changed the title feat:operator: MCL integration feat:operator:kcc: MCL integration Oct 22, 2025
ManagerOptions: manager.Options{
Cache: cache.Options{
DefaultNamespaces: map[string]cache.Config{
scopedNamespace: {},
Copy link
Collaborator

@gemmahou gemmahou Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I see this is a copy of existing code, but I'm curious shouldn't this be scopedNamespace: scopedNamespace so it takes the flag value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually no I think this is a great catch! I tried to do away with that level of indirection (newManager) func but I will revert that refactor to keep things less confusing and make sure I don't miss things like these!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not conflate


// Add --multi-cluster-election=true flag
args, _, _ := unstructured.NestedStringSlice(container, "args")
args = append(args, "--multi-cluster-election=true")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: just for self-education purpose, how this flag will be used to enable the feature? asking because I don't see we pass it anywhere, and seems like we check the existence of "cc.Spec.Experiments.LeaderElection" to determine whether to enable the feature(line 134-148 in kccmanager.go)

It's recommended to use `googleServiceAccount` when running ConfigConnector in Google Kubernetes Engine (GKE) clusters with Workload Identity enabled.
This field cannot be specified together with `googleServiceAccount`.
type: string
experiments:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be per-namespace or per-cluster?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we should start with per-cluster level. I think namespace (or sharding) is more complicated to build, maintain, configure and understand. Things like the syncer also get more complicated. Lets ship the easier version and then work on it.

Namespace string `json:"namespace"`
// The unique name of the global lock, which must be shared by all KCC replicas
// and workloads across all clusters that are part of the same election.
GlobalLockName string `json:"globalLockName"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could allow limited templating in this name, e.g. gs://my-cluster-locks/kcc/prod/{namespace}

func (r *Reconciler) applyMultiClusterLeaderElection(ctx context.Context, obj *manifest.Objects) error {
log := log.FromContext(ctx)
for _, item := range obj.Items {
if item.Kind != "StatefulSet" || !strings.HasPrefix(item.GetName(), "cnrm-controller-manager") {
Copy link
Collaborator

@justinsb justinsb Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we might want to extract this if statement into a shared function, it's both a little less fragile / more maintainable and it documents more clearly what we are doing.

e.g. IsControllerManagerStatefulSet

log.Info("enabling multi-cluster leader election for StatefulSet", "name", item.GetName())
if err := item.MutateContainers(func(container map[string]interface{}) error {
name, found, _ := unstructured.NestedString(container, "name")
if !found || name != "manager" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: if !found, then name will be empty, so you don't need to check found

"sigs.k8s.io/controller-runtime/pkg/manager"

// Register direct controllers
_ "github.com/GoogleCloudPlatform/k8s-config-connector/pkg/controller/direct"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need this (?). If we don't, that's interesting and it should be a separate PR I think

// This serves as the entry point for the in-cluster main and the Borg service main. Any changes made should be done
// with care.
func New(ctx context.Context, restConfig *rest.Config, cfg Config) (manager.Manager, error) {
krmtotf.SetUserAgentForTerraformProvider()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also looks unrelated to the MCL integration (?)

return nil, fmt.Errorf("error adding schemes: %w", err)
}

// Create a temporary client to read the ConfigConnector object for leader election config.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, this is clever but I just don't know. We should at least split it out into its own function to avoid accidentally reusing the client.

I think because we aren't going to watch this object, we should rely on the operator to manage the restarts / rollout.

Maybe invent something like a JDBC URL for leader election syntax to make it a one liner, e.g.

gs://<bucket>/<path>?timeout=15s&namespace=foo

I'm also not sure who should create the lease objects, should we simply accept a pointer to a pre-configured lease object?

github.com/cenkalti/backoff v2.2.1+incompatible
github.com/fatih/color v1.18.0
github.com/ghodss/yaml v1.0.0
github.com/gke-labs/multicluster-leader-election v0.0.0-20250923220528-0bf41dc7fecc
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds OK. It is a good reason not to update library versions in libraries. Also a good reason to stick to keep dependencies minimal in libraries. I wonder if the MCL library should be using dynamic.Interface or a similar dependency-light library. (Does one exist?)

@cheftako cheftako changed the title feat:operator:kcc: MCL integration feat:operator:kcc: MultiClusterLease integration Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants