Skip to content

refactor(e2e): add DAG-based concurrent task execution for cluster setup#8149

Open
r2k1 wants to merge 22 commits intomainfrom
r2k1/dag
Open

refactor(e2e): add DAG-based concurrent task execution for cluster setup#8149
r2k1 wants to merge 22 commits intomainfrom
r2k1/dag

Conversation

@r2k1
Copy link
Contributor

@r2k1 r2k1 commented Mar 21, 2026

What this PR does / why we need it:

Adds a new e2e/dag package that provides a lightweight, type-safe DAG execution library using closures and generics, then refactors prepareCluster to use it for concurrent cluster setup.

Artur Khantimirov and others added 16 commits March 20, 2026 13:54
Implements Execute() which discovers the DAG, validates for cycles,
and runs tasks concurrently respecting dependency order. Supports
CancelDependents (skip downstream on failure) and CancelAll (cancel
context) error strategies, plus MaxConcurrency semaphore limiting.
Adds integration tests mirroring the complete spec example
(CreateRG → CreateVNet → CreateSubnet → CreateCluster → RunTests →
Teardown), plus tests for transitive dependency access, mid-pipeline
failure with CancelDependents, shared tasks across independent
subgraphs, and empty graph. Removes unused dead code that triggered
go vet copylock warning.
- Replace 9-param notifyDependents with runState struct methods
- Eliminate redundant `failed` map — derive from results directly
- Collapse 3 near-identical early-exit blocks into abort() helper
- Launch dependents outside the mutex (collect-then-launch)
- Remove spurious mu.Lock after wg.Wait (all goroutines done)
- Sort DAGError.Error() output for deterministic error messages
- Add ErrorStrategy.String() for consistency with TaskStatus
- Remove dead test code (unused shared var, dead twoDepTask type)
- Extract buildSpecDAG() helper to reduce copy-paste in tests
- Reduce pre-canceled context test timeout from 2s to 100ms
Add e2e/dag package — a lightweight, type-safe DAG executor using
Go generics. Two verbs: Go (returns value) and Run (side-effect),
with numbered variants (Go1-Go3, Run1-Run3) for typed dependency
injection.

Replace the sequential prepareCluster implementation with a DAG
version that runs independent tasks (bastion, subnet, kube, identity,
firewall, garbage collection, etc.) concurrently after cluster
creation completes.

Also fix pre-existing fmt.Sprintf %%w usage in config/config.go.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the clusterSetup struct and its 12 one-liner methods with
inline closures. Each closure is 1-2 lines binding the cluster local
to the real function call. Everything reads top-to-bottom in one place.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move cluster creation back into the DAG so all tasks use typed
dependency injection. Where function signatures match exactly
(getOrCreateBastion, getClusterKubeletIdentity, collectGarbageVMSS),
pass them directly without closures.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract named helpers for conditional tasks (configureFirewall,
configureNetworkIsolation, setupACR, ensureDebugDaemonsets) and
update function signatures (getClusterSubnetID, getClusterKubeClient,
extractClusterParameters) to accept *ManagedCluster directly so they
can be passed to the DAG without closures.

prepareCluster now reads as a pure declarative DAG with no inline
anonymous functions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- addFirewallRules/addNetworkIsolatedSettings: derive location from
  cluster instead of taking it as a param. Use conditional DAG
  registration instead of runtime checks.
- addPrivateAzureContainerRegistry: derive resourceGroupName from
  cluster.Location internally.
- ensureMaintenanceConfiguration: replaces getOrCreate wrapper,
  returns error only (value was never used).
- getClusterSubnetID/getClusterKubeClient: take *ManagedCluster
  instead of extracted strings.
- GoN/RunN: accept optional extra ...Dep barrier deps so typed
  dep tasks can also wait on untyped barriers (e.g. ACR effects).

Remaining helpers (newClusterTask, addACRTask, addDebugDaemonsets)
exist because they genuinely need parameter binding that can't be
absorbed into the underlying functions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pull cluster creation out of the DAG. Use generic bind/bindRun
helpers to pass functions directly for tasks that only need cluster.

Remaining factory helpers (addACR, ensureDebugDaemonsets,
extractClusterParams) bind cluster to functions that also receive
DAG-provided values (kube, identity) — these can't be eliminated
without putting cluster back in the DAG.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@r2k1 r2k1 requested review from sulixu and surajssd as code owners March 21, 2026 20:32
@r2k1 r2k1 changed the base branch from master to main March 21, 2026 20:33
Copilot AI review requested due to automatic review settings March 21, 2026 20:35

This comment was marked as outdated.

Copilot AI review requested due to automatic review settings March 21, 2026 20:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

e2e/cluster.go Outdated
if errors.As(err, &azErr) && azErr.StatusCode == 404 {
return createNewMaintenanceConfiguration(ctx, cluster)
_, err = createNewMaintenanceConfiguration(ctx, cluster)
return err
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the 404 path, ensureMaintenanceConfiguration returns the raw error from createNewMaintenanceConfiguration without adding cluster context, while the non-404 path wraps errors with the cluster name. Wrap the 404/create error similarly so failures are attributable to the cluster being prepared.

Suggested change
return err
if err != nil {
return fmt.Errorf("failed to create maintenance configuration 'default' for cluster %q: %w", *cluster.Name, err)
}
return nil

Copilot uses AI. Check for mistakes.
Comment on lines +102 to +107
func (g *Group) recordError(err error) {
g.mu.Lock()
g.errs = append(g.errs, err)
g.mu.Unlock()
g.cancel()
}
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Group.recordError records every non-nil error, including context.Canceled/context.DeadlineExceeded from tasks that are only exiting because another task failed and canceled the group. In failure cases this can flood DAGError with cancellation noise and obscure the root cause. Consider suppressing context cancellation errors once the group has already observed a “real” failure (or only recording the first error).

Copilot uses AI. Check for mistakes.
Comment on lines +114 to +138
// launch runs fn in a new goroutine after all deps complete.
// If any dep failed or ctx is cancelled, onSkip is called instead of fn.
func (g *Group) launch(deps []Dep, fn func(), onSkip func()) {
g.wg.Add(1)
go func() {
defer g.wg.Done()

for _, d := range deps {
d.wait()
}

for _, d := range deps {
if d.failed() {
onSkip()
return
}
}

if g.ctx.Err() != nil {
onSkip()
return
}

fn()
}()
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The executor blocks on Dep.wait() with no cycle detection. If a cycle is accidentally introduced (e.g., via wiring mistakes), Wait() will hang indefinitely. Consider adding a lightweight cycle check (even just in debug/tests), or documenting this sharp edge prominently so future uses don’t create non-obvious deadlocks.

Copilot uses AI. Check for mistakes.
)

func getClusterKubeClient(ctx context.Context, resourceGroupName, clusterName string) (*Kubeclient, error) {
func getClusterKubeClient(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (*Kubeclient, error) {
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getClusterKubeClient now dereferences cluster.Location and cluster.Name without validating the input. Since the signature changed from (resourceGroupName, clusterName string) to (*ManagedCluster), it would be safer to add nil checks for cluster, cluster.Location, and cluster.Name and return a descriptive error instead of panicking.

Suggested change
func getClusterKubeClient(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (*Kubeclient, error) {
func getClusterKubeClient(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (*Kubeclient, error) {
if cluster == nil {
return nil, fmt.Errorf("managed cluster is nil")
}
if cluster.Location == nil || *cluster.Location == "" {
return nil, fmt.Errorf("managed cluster location is nil or empty")
}
if cluster.Name == nil || *cluster.Name == "" {
return nil, fmt.Errorf("managed cluster name is nil or empty")
}

Copilot uses AI. Check for mistakes.
}

func getClusterSubnetID(ctx context.Context, mcResourceGroupName string) (string, error) {
func getClusterSubnetID(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (string, error) {
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getClusterSubnetID now relies on cluster.Properties.NodeResourceGroup and dereferences it directly. With the new *ManagedCluster signature, add guards for cluster == nil, cluster.Properties == nil, and cluster.Properties.NodeResourceGroup == nil to avoid panics and return a clearer error when cluster metadata is incomplete.

Suggested change
func getClusterSubnetID(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (string, error) {
func getClusterSubnetID(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (string, error) {
if cluster == nil {
return "", fmt.Errorf("managed cluster is nil")
}
if cluster.Properties == nil {
return "", fmt.Errorf("managed cluster properties are nil")
}
if cluster.Properties.NodeResourceGroup == nil || *cluster.Properties.NodeResourceGroup == "" {
return "", fmt.Errorf("managed cluster NodeResourceGroup is nil or empty")
}

Copilot uses AI. Check for mistakes.
Comment on lines +91 to +101
if !isNetworkIsolated {
dag.Run(g, func(ctx context.Context) error { return addFirewallRules(ctx, cluster) })
}
if isNetworkIsolated {
if err := addNetworkIsolatedSettings(ctx, cluster, *cluster.Location); err != nil {
return nil, fmt.Errorf("add network isolated settings: %w", err)
}
}
if !isNetworkIsolated { // network isolated cluster blocks all egress via NSG
if err := addFirewallRules(ctx, cluster, *cluster.Location); err != nil {
return nil, fmt.Errorf("add firewall rules: %w", err)
}
dag.Run(g, func(ctx context.Context) error { return addNetworkIsolatedSettings(ctx, cluster) })
}
needACR := isNetworkIsolated || attachPrivateAcr
acrNonAnon := dag.Run2(g, kube, identity, addACR(cluster, needACR, true))
acrAnon := dag.Run2(g, kube, identity, addACR(cluster, needACR, false))
dag.Run1(g, kube, ensureDebugDaemonsets(cluster, isNetworkIsolated), acrNonAnon, acrAnon)
extract := dag.Go1(g, kube, extractClusterParams(cluster))
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepareCluster no longer enforces the previous ordering where network changes (firewall rules / network-isolated subnet+NSG updates) completed before EnsureDebugDaemonsets runs. Because the daemonset creation triggers pod scheduling/image pulls, running it concurrently with route table/NSG updates can introduce e2e flakiness. Consider capturing the firewall/NSG task(s) as Effect values and adding them as explicit dependencies to the Run1(... ensureDebugDaemonsets ...) task (and possibly any other k8s API tasks that assume steady node egress).

Copilot uses AI. Check for mistakes.
- Fix package doc to reference correct API names (Go/Run not Spawn/Do)
- Fix section comments in tests to match current API names
- Wait() now surfaces ctx.Err() when parent context is cancelled
- TestParentContextCancelled asserts error is context.Canceled
- ensureMaintenanceConfiguration 404 path wraps error with cluster context
- Firewall/NSG tasks are now dependencies of ensureDebugDaemonsets
Add comprehensive test coverage including:
- Go1-Go3/Run1-Run3 success and skip-on-dep-failure paths
- Diamond topology, transitive skip, cancellation noise
- Parent context cancellation and deadline exceeded
- DAGError formatting (single and multiple errors)
- Effect as dependency, empty group, Result.Get/MustGet
- Cycle behavior documentation (typed API compile-time safety,
  untyped API deadlock, self-dependency deadlock)

Clean up test comments to keep tests concise.
Copilot AI review requested due to automatic review settings March 22, 2026 03:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Comment on lines +454 to 456
func getClusterSubnetID(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (string, error) {
mcResourceGroupName := *cluster.Properties.NodeResourceGroup
pager := config.Azure.VNet.NewListPager(mcResourceGroupName, nil)
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new *ManagedCluster signature, this function dereferences cluster.Properties.NodeResourceGroup without checking cluster/Properties/NodeResourceGroup for nil. This can panic; add nil checks and return an error when required fields are missing.

Copilot uses AI. Check for mistakes.
e2e/dag/dag.go Outdated
Comment on lines +177 to +179
// - Inside Then/ThenDo callbacks (the scheduler guarantees deps succeeded)
// - Inside Spawn/Do callbacks when the Result is listed as a dep
// - After [Group.Wait] returned nil
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on Result.MustGet references APIs that don't exist in this package ("Then/ThenDo", "Spawn/Do"). This is misleading for users of dag. Update the comment to reference the actual APIs here (Go/Run/Go1–Go3/Run1–Run3) and when MustGet is safe.

Suggested change
// - Inside Then/ThenDo callbacks (the scheduler guarantees deps succeeded)
// - Inside Spawn/Do callbacks when the Result is listed as a dep
// - After [Group.Wait] returned nil
// - Inside task functions started via [Go] or [Run] when this Result is
// listed as a dependency (the scheduler only starts the task after all
// deps have completed successfully).
// - Inside task functions started via [Go1], [Go2], [Go3], [Run1], [Run2],
// or [Run3]; these functions are never invoked if their dependencies fail.
// - After [Group.Wait] has returned nil (all tasks succeeded).

Copilot uses AI. Check for mistakes.
// enforces correct wiring. When untyped dependencies are used (Go/Run
// with variadic deps), values are accessed via [Result.MustGet] inside
// the closure.
//
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Package docs state tasks form a directed acyclic graph, but the untyped Go/Run API can create cycles (or never-completing deps) that will deadlock Wait() (also covered by tests). Please document this limitation prominently in the package comment and/or add runtime cycle detection for the untyped API.

Suggested change
//
//
// Note: The executor conceptually treats tasks as forming a DAG, but the
// untyped Go/Run variants do not perform any static or runtime cycle
// detection. It is the caller's responsibility to ensure that dependency
// lists are acyclic and that all dependencies eventually complete. If a
// cycle is introduced, or a dependency never finishes, the affected tasks
// will never start or complete and [Group.Wait] may block indefinitely.
//

Copilot uses AI. Check for mistakes.
Comment on lines 138 to 146
@@ -152,7 +146,7 @@ func getClusterKubeletIdentity(cluster *armcontainerservice.ManagedCluster) (*ar
return kubeletIdentity, nil
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getClusterKubeletIdentity takes a ctx parameter but does not use it, which will fail compilation in Go (unused parameter). Either remove the ctx parameter from the signature or use it (e.g., for logging/telemetry) consistently across callers.

Copilot uses AI. Check for mistakes.
- Add panic recovery in launch() so panics in task functions are
  captured as errors instead of crashing the process.
- Unify onSkip/onPanic into a single onFail(error) callback.
- Update MustGet docs to reference actual API names.
- Document cycle-deadlock limitation in package comment.
- Add tests for panic in Go, Run, and downstream skip after panic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants