refactor(e2e): add DAG-based concurrent task execution for cluster setup by r2k1 · Pull Request #8149 · Azure/AgentBaker

r2k1 · 2026-03-21T20:32:50Z

What this PR does / why we need it:

Adds a new e2e/dag package that provides a lightweight, type-safe DAG execution library using closures and generics, then refactors prepareCluster to use it for concurrent cluster setup.

Implements Execute() which discovers the DAG, validates for cycles, and runs tasks concurrently respecting dependency order. Supports CancelDependents (skip downstream on failure) and CancelAll (cancel context) error strategies, plus MaxConcurrency semaphore limiting.

Adds integration tests mirroring the complete spec example (CreateRG → CreateVNet → CreateSubnet → CreateCluster → RunTests → Teardown), plus tests for transitive dependency access, mid-pipeline failure with CancelDependents, shared tasks across independent subgraphs, and empty graph. Removes unused dead code that triggered go vet copylock warning.

- Replace 9-param notifyDependents with runState struct methods - Eliminate redundant `failed` map — derive from results directly - Collapse 3 near-identical early-exit blocks into abort() helper - Launch dependents outside the mutex (collect-then-launch) - Remove spurious mu.Lock after wg.Wait (all goroutines done) - Sort DAGError.Error() output for deterministic error messages - Add ErrorStrategy.String() for consistency with TaskStatus - Remove dead test code (unused shared var, dead twoDepTask type) - Extract buildSpecDAG() helper to reduce copy-paste in tests - Reduce pre-canceled context test timeout from 2s to 100ms

Add e2e/dag package — a lightweight, type-safe DAG executor using Go generics. Two verbs: Go (returns value) and Run (side-effect), with numbered variants (Go1-Go3, Run1-Run3) for typed dependency injection. Replace the sequential prepareCluster implementation with a DAG version that runs independent tasks (bastion, subnet, kube, identity, firewall, garbage collection, etc.) concurrently after cluster creation completes. Also fix pre-existing fmt.Sprintf %%w usage in config/config.go. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace the clusterSetup struct and its 12 one-liner methods with inline closures. Each closure is 1-2 lines binding the cluster local to the real function call. Everything reads top-to-bottom in one place. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Move cluster creation back into the DAG so all tasks use typed dependency injection. Where function signatures match exactly (getOrCreateBastion, getClusterKubeletIdentity, collectGarbageVMSS), pass them directly without closures. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Extract named helpers for conditional tasks (configureFirewall, configureNetworkIsolation, setupACR, ensureDebugDaemonsets) and update function signatures (getClusterSubnetID, getClusterKubeClient, extractClusterParameters) to accept *ManagedCluster directly so they can be passed to the DAG without closures. prepareCluster now reads as a pure declarative DAG with no inline anonymous functions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- addFirewallRules/addNetworkIsolatedSettings: derive location from cluster instead of taking it as a param. Use conditional DAG registration instead of runtime checks. - addPrivateAzureContainerRegistry: derive resourceGroupName from cluster.Location internally. - ensureMaintenanceConfiguration: replaces getOrCreate wrapper, returns error only (value was never used). - getClusterSubnetID/getClusterKubeClient: take *ManagedCluster instead of extracted strings. - GoN/RunN: accept optional extra ...Dep barrier deps so typed dep tasks can also wait on untyped barriers (e.g. ACR effects). Remaining helpers (newClusterTask, addACRTask, addDebugDaemonsets) exist because they genuinely need parameter binding that can't be absorbed into the underlying functions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Pull cluster creation out of the DAG. Use generic bind/bindRun helpers to pass functions directly for tasks that only need cluster. Remaining factory helpers (addACR, ensureDebugDaemonsets, extractClusterParams) bind cluster to functions that also receive DAG-provided values (kube, identity) — these can't be eliminated without putting cluster back in the DAG. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…kages

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Copilot · 2026-03-21T20:48:08Z

e2e/cluster.go

 	if errors.As(err, &azErr) && azErr.StatusCode == 404 {
-		return createNewMaintenanceConfiguration(ctx, cluster)
+		_, err = createNewMaintenanceConfiguration(ctx, cluster)
+		return err


In the 404 path, ensureMaintenanceConfiguration returns the raw error from createNewMaintenanceConfiguration without adding cluster context, while the non-404 path wraps errors with the cluster name. Wrap the 404/create error similarly so failures are attributable to the cluster being prepared.

Suggested change

return err

if err != nil {

return fmt.Errorf("failed to create maintenance configuration 'default' for cluster %q: %w", *cluster.Name, err)

}

return nil

Copilot · 2026-03-21T20:48:08Z

e2e/dag/dag.go

+func (g *Group) recordError(err error) {
+	g.mu.Lock()
+	g.errs = append(g.errs, err)
+	g.mu.Unlock()
+	g.cancel()
+}


Group.recordError records every non-nil error, including context.Canceled/context.DeadlineExceeded from tasks that are only exiting because another task failed and canceled the group. In failure cases this can flood DAGError with cancellation noise and obscure the root cause. Consider suppressing context cancellation errors once the group has already observed a “real” failure (or only recording the first error).

Copilot · 2026-03-21T20:48:08Z

e2e/dag/dag.go

+// launch runs fn in a new goroutine after all deps complete.
+// If any dep failed or ctx is cancelled, onSkip is called instead of fn.
+func (g *Group) launch(deps []Dep, fn func(), onSkip func()) {
+	g.wg.Add(1)
+	go func() {
+		defer g.wg.Done()
+
+		for _, d := range deps {
+			d.wait()
+		}
+
+		for _, d := range deps {
+			if d.failed() {
+				onSkip()
+				return
+			}
+		}
+
+		if g.ctx.Err() != nil {
+			onSkip()
+			return
+		}
+
+		fn()
+	}()


The executor blocks on Dep.wait() with no cycle detection. If a cycle is accidentally introduced (e.g., via wiring mistakes), Wait() will hang indefinitely. Consider adding a lightweight cycle check (even just in debug/tests), or documenting this sharp edge prominently so future uses don’t create non-obvious deadlocks.

Copilot · 2026-03-21T20:48:08Z

e2e/kube.go

 )

-func getClusterKubeClient(ctx context.Context, resourceGroupName, clusterName string) (*Kubeclient, error) {
+func getClusterKubeClient(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (*Kubeclient, error) {


getClusterKubeClient now dereferences cluster.Location and cluster.Name without validating the input. Since the signature changed from (resourceGroupName, clusterName string) to (*ManagedCluster), it would be safer to add nil checks for cluster, cluster.Location, and cluster.Name and return a descriptive error instead of panicking.

Suggested change

func getClusterKubeClient(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (*Kubeclient, error) {

func getClusterKubeClient(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (*Kubeclient, error) {

if cluster == nil {

return nil, fmt.Errorf("managed cluster is nil")

}

if cluster.Location == nil || *cluster.Location == "" {

return nil, fmt.Errorf("managed cluster location is nil or empty")

}

if cluster.Name == nil || *cluster.Name == "" {

return nil, fmt.Errorf("managed cluster name is nil or empty")

}

Copilot · 2026-03-21T20:48:09Z

e2e/kube.go

 }

-func getClusterSubnetID(ctx context.Context, mcResourceGroupName string) (string, error) {
+func getClusterSubnetID(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (string, error) {


getClusterSubnetID now relies on cluster.Properties.NodeResourceGroup and dereferences it directly. With the new *ManagedCluster signature, add guards for cluster == nil, cluster.Properties == nil, and cluster.Properties.NodeResourceGroup == nil to avoid panics and return a clearer error when cluster metadata is incomplete.

Suggested change

func getClusterSubnetID(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (string, error) {

func getClusterSubnetID(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (string, error) {

if cluster == nil {

return "", fmt.Errorf("managed cluster is nil")

}

if cluster.Properties == nil {

return "", fmt.Errorf("managed cluster properties are nil")

}

if cluster.Properties.NodeResourceGroup == nil || *cluster.Properties.NodeResourceGroup == "" {

return "", fmt.Errorf("managed cluster NodeResourceGroup is nil or empty")

}

Copilot · 2026-03-21T20:48:09Z

e2e/cluster.go

+	if !isNetworkIsolated {
+		dag.Run(g, func(ctx context.Context) error { return addFirewallRules(ctx, cluster) })
 	}
 	if isNetworkIsolated {
-		if err := addNetworkIsolatedSettings(ctx, cluster, *cluster.Location); err != nil {
-			return nil, fmt.Errorf("add network isolated settings: %w", err)
-		}
-	}
-	if !isNetworkIsolated { // network isolated cluster blocks all egress via NSG
-		if err := addFirewallRules(ctx, cluster, *cluster.Location); err != nil {
-			return nil, fmt.Errorf("add firewall rules: %w", err)
-		}
+		dag.Run(g, func(ctx context.Context) error { return addNetworkIsolatedSettings(ctx, cluster) })
 	}
+	needACR := isNetworkIsolated || attachPrivateAcr
+	acrNonAnon := dag.Run2(g, kube, identity, addACR(cluster, needACR, true))
+	acrAnon := dag.Run2(g, kube, identity, addACR(cluster, needACR, false))
+	dag.Run1(g, kube, ensureDebugDaemonsets(cluster, isNetworkIsolated), acrNonAnon, acrAnon)
+	extract := dag.Go1(g, kube, extractClusterParams(cluster))


prepareCluster no longer enforces the previous ordering where network changes (firewall rules / network-isolated subnet+NSG updates) completed before EnsureDebugDaemonsets runs. Because the daemonset creation triggers pod scheduling/image pulls, running it concurrently with route table/NSG updates can introduce e2e flakiness. Consider capturing the firewall/NSG task(s) as Effect values and adding them as explicit dependencies to the Run1(... ensureDebugDaemonsets ...) task (and possibly any other k8s API tasks that assume steady node egress).

- Fix package doc to reference correct API names (Go/Run not Spawn/Do) - Fix section comments in tests to match current API names - Wait() now surfaces ctx.Err() when parent context is cancelled - TestParentContextCancelled asserts error is context.Canceled - ensureMaintenanceConfiguration 404 path wraps error with cluster context - Firewall/NSG tasks are now dependencies of ensureDebugDaemonsets

Add comprehensive test coverage including: - Go1-Go3/Run1-Run3 success and skip-on-dep-failure paths - Diamond topology, transitive skip, cancellation noise - Parent context cancellation and deadline exceeded - DAGError formatting (single and multiple errors) - Effect as dependency, empty group, Result.Get/MustGet - Cycle behavior documentation (typed API compile-time safety, untyped API deadlock, self-dependency deadlock) Clean up test comments to keep tests concise.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Copilot · 2026-03-22T03:12:43Z

e2e/kube.go

+func getClusterSubnetID(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (string, error) {
+	mcResourceGroupName := *cluster.Properties.NodeResourceGroup
 	pager := config.Azure.VNet.NewListPager(mcResourceGroupName, nil)


With the new *ManagedCluster signature, this function dereferences cluster.Properties.NodeResourceGroup without checking cluster/Properties/NodeResourceGroup for nil. This can panic; add nil checks and return an error when required fields are missing.

Copilot · 2026-03-22T03:12:44Z

e2e/dag/dag.go

+//   - Inside Then/ThenDo callbacks (the scheduler guarantees deps succeeded)
+//   - Inside Spawn/Do callbacks when the Result is listed as a dep
+//   - After [Group.Wait] returned nil


The comment on Result.MustGet references APIs that don't exist in this package ("Then/ThenDo", "Spawn/Do"). This is misleading for users of dag. Update the comment to reference the actual APIs here (Go/Run/Go1–Go3/Run1–Run3) and when MustGet is safe.

Suggested change

// - Inside Then/ThenDo callbacks (the scheduler guarantees deps succeeded)

// - Inside Spawn/Do callbacks when the Result is listed as a dep

// - After [Group.Wait] returned nil

// - Inside task functions started via [Go] or [Run] when this Result is

// listed as a dependency (the scheduler only starts the task after all

// deps have completed successfully).

// - Inside task functions started via [Go1], [Go2], [Go3], [Run1], [Run2],

// or [Run3]; these functions are never invoked if their dependencies fail.

// - After [Group.Wait] has returned nil (all tasks succeeded).

Copilot · 2026-03-22T03:12:44Z

e2e/dag/dag.go

+// enforces correct wiring. When untyped dependencies are used (Go/Run
+// with variadic deps), values are accessed via [Result.MustGet] inside
+// the closure.
+//


Package docs state tasks form a directed acyclic graph, but the untyped Go/Run API can create cycles (or never-completing deps) that will deadlock Wait() (also covered by tests). Please document this limitation prominently in the package comment and/or add runtime cycle detection for the untyped API.

Suggested change

//

//

// Note: The executor conceptually treats tasks as forming a DAG, but the

// untyped Go/Run variants do not perform any static or runtime cycle

// detection. It is the caller's responsibility to ensure that dependency

// lists are acyclic and that all dependencies eventually complete. If a

// cycle is introduced, or a dependency never finishes, the affected tasks

// will never start or complete and [Group.Wait] may block indefinitely.

//

Copilot · 2026-03-22T03:12:44Z

e2e/cluster.go

@@ -152,7 +146,7 @@ func getClusterKubeletIdentity(cluster *armcontainerservice.ManagedCluster) (*ar
 	return kubeletIdentity, nil


getClusterKubeletIdentity takes a ctx parameter but does not use it, which will fail compilation in Go (unused parameter). Either remove the ctx parameter from the signature or use it (e.g., for logging/telemetry) consistently across callers.

- Add panic recovery in launch() so panics in task functions are captured as errors instead of crashing the process. - Unify onSkip/onPanic into a single onFail(error) callback. - Update MustGet docs to reference actual API names. - Document cycle-deadlock limitation in package comment. - Add tests for panic in Go, Run, and downstream skip after panic.

Artur Khantimirov and others added 16 commits March 20, 2026 13:54

Add initial design spec for type-safe DAG execution library

71a4102

Rename taskflow to tasks, use named Deps structs throughout

9efee3e

tasks: add core types — Task interface, Config, DAGError, TaskStatus

1e5f8bf

tasks: add graph discovery via reflection

39b21c3

tasks: add cycle detection via topological sort

b493f46

polish: improve naming, comments, and error wrapping in dag/tasks pac…

d75c164

…kages

refactor(e2e): remove unused tasks package in favor of dag package

5fea3e4

r2k1 requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, cameronmeissner, djsly, ganeshkumarashok, junjiezhang1997, lilypan26, phealy, timmy-wright, yagmurbaydogan, yewmsft and zachary-bailey as code owners March 21, 2026 20:32

r2k1 requested review from sulixu and surajssd as code owners March 21, 2026 20:32

r2k1 changed the base branch from master to main March 21, 2026 20:33

r2k1 requested review from calvin197, mxj220 and pdamianov-dev as code owners March 21, 2026 20:33

chore: remove design spec docs, revert config.go changes

4ad379a

Copilot AI review requested due to automatic review settings March 21, 2026 20:35

r2k1 temporarily deployed to test March 21, 2026 20:35 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 March 21, 2026 20:35 View session

This comment was marked as outdated.

Sign in to view

refactor(e2e): remove bind/bindRun helpers, use inline closures

388a96a

r2k1 temporarily deployed to test March 21, 2026 20:41 — with GitHub Actions Inactive

docs(e2e): add note to keep prepareCluster minimal

8718b71

Copilot AI review requested due to automatic review settings March 21, 2026 20:43

r2k1 temporarily deployed to test March 21, 2026 20:43 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 March 21, 2026 20:44 View session

Copilot AI reviewed Mar 21, 2026

View reviewed changes

r2k1 temporarily deployed to test March 21, 2026 21:03 — with GitHub Actions Inactive

Copilot AI review requested due to automatic review settings March 22, 2026 03:08

r2k1 temporarily deployed to test March 22, 2026 03:08 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 March 22, 2026 03:09 View session

Copilot AI reviewed Mar 22, 2026

View reviewed changes

r2k1 temporarily deployed to test March 22, 2026 04:59 — with GitHub Actions Inactive

-func getClusterKubeClient(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (*Kubeclient, error) {
+func getClusterKubeClient(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (*Kubeclient, error) {
+	if cluster == nil {
+		return nil, fmt.Errorf("managed cluster is nil")
+	}
+	if cluster.Location == nil || *cluster.Location == "" {
+		return nil, fmt.Errorf("managed cluster location is nil or empty")
+	}
+	if cluster.Name == nil || *cluster.Name == "" {
+		return nil, fmt.Errorf("managed cluster name is nil or empty")
+	}

-func getClusterSubnetID(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (string, error) {
+func getClusterSubnetID(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (string, error) {
+	if cluster == nil {
+		return "", fmt.Errorf("managed cluster is nil")
+	}
+	if cluster.Properties == nil {
+		return "", fmt.Errorf("managed cluster properties are nil")
+	}
+	if cluster.Properties.NodeResourceGroup == nil || *cluster.Properties.NodeResourceGroup == "" {
+		return "", fmt.Errorf("managed cluster NodeResourceGroup is nil or empty")
+	}

-//   - Inside Then/ThenDo callbacks (the scheduler guarantees deps succeeded)
-//   - Inside Spawn/Do callbacks when the Result is listed as a dep
-//   - After [Group.Wait] returned nil
+//   - Inside task functions started via [Go] or [Run] when this Result is
+//     listed as a dependency (the scheduler only starts the task after all
+//     deps have completed successfully).
+//   - Inside task functions started via [Go1], [Go2], [Go3], [Run1], [Run2],
+//     or [Run3]; these functions are never invoked if their dependencies fail.
+//   - After [Group.Wait] has returned nil (all tasks succeeded).

-//
+//
+// Note: The executor conceptually treats tasks as forming a DAG, but the
+// untyped Go/Run variants do not perform any static or runtime cycle
+// detection. It is the caller's responsibility to ensure that dependency
+// lists are acyclic and that all dependencies eventually complete. If a
+// cycle is introduced, or a dependency never finishes, the affected tasks
+// will never start or complete and [Group.Wait] may block indefinitely.
+//

		@@ -152,7 +146,7 @@ func getClusterKubeletIdentity(cluster armcontainerservice.ManagedCluster) (ar
		return kubeletIdentity, nil

Conversation

r2k1 commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

r2k1 commented Mar 21, 2026 •

edited

Loading