Datapath tests for Long running clusters. #4142

sivakami-projects · 2025-11-24T16:53:34Z

Pipeline to run repeated tests on long running Swiftv2 AKS clusters.

Test pipeline - Tests are scheduled to run every 3 hours on central us euap. Link to Pipeline
Recent test run

Testing Approach
Test Lifecycle (per stage):
Create 8 pod scenarios with PodNetwork, PodNetworkInstance, Pods
Run 9 connectivity tests (HTTP-based)
Run private endpoint tests (storage access)
Delete all resources (Phase 1: Pods, Phase 2: PNI/PN/Namespaces)

Node Selection:
Tests filter by workload-type=$WORKLOAD_TYPE AND nic-capacity labels
Ensures isolation between different workload type stages
Currently: WORKLOAD_TYPE=swiftv2-linux

Files Changed
Pipeline Configuration
pipeline.yaml: Main pipeline with schedule trigger
long-running-pipeline-template.yaml: Stage definitions with VM SKU constants

Setup Scripts
create_aks.sh: AKS cluster creation with node labeling
create_vnets.sh: Customer VNet creation
create_peerings.sh: VNet peering mesh
create_storage.sh: Storage accounts with public access disabled (SA1 only)
create_nsg.sh: NSG rule application with retry logic
create_pe.sh: Private endpoint and DNS zone setup

Test Code
datapath.go: Enhanced with node label filtering, private endpoint testing
datapath_create_test.go: Resource creation scenarios
datapath_connectivity_test.go: HTTP connectivity validation
datapath_private_endpoint_test.go: Private endpoint access/isolation tests
datapath_delete_test.go: Resource cleanup

Documentation
README.md:

Reason for Change:

Issue Fixed:

Requirements:

uses conventional commit messages
includes documentation
adds unit tests
relevant PR labels added

Notes:

- Implemented scheduled pipeline running every 1 hour with persistent infrastructure - Split test execution into 2 jobs: Create (with 20min wait) and Delete - Added 8 test scenarios across 2 AKS clusters, 4 VNets, different subnets - Implemented two-phase deletion strategy to prevent PNI ReservationInUse errors - Added context timeouts on kubectl commands with force delete fallbacks - Resource naming uses RG name as BUILD_ID for uniqueness across parallel setups - Added SkipAutoDeleteTill tags to prevent automatic resource cleanup - Conditional setup stages controlled by runSetupStages parameter - Auto-generate RG name from location or allow custom names for parallel setups - Added comprehensive README with setup instructions and troubleshooting - Node selection by agentpool labels with usage tracking to prevent conflicts - Kubernetes naming compliance (RFC 1123) for all resources fix ginkgo flag. Add datapath tests. Delete old test file. Add testcases for provate endpoint. Ginkgo run specs only on specified files. update pipeline params. Add ginkgo tags Add datapath tests. Add ginkgo build tags. remove wait time. set namespace. update pod image. Add more nsg rules to block subnets s1 and s2 test change. Change delegated subnet address range. Use delegated interface for network connectivity tests. Datapath test between clusters. test. test private endpoints. fix private endpoint tests. Set storage account names in putput var. set storage account name. fix pn names. update pe update pe test. update sas token generation. Add node labels for sw2 scenario, cleanup pods on any test failure. enable nsg tests. update storage. Add rules to nsg. disable private endpoint negative test. disable public network access on storage account with private endpoint. wait for default nsg to be created. disable negative test on private endpoint. private endpoint depends on aks cluster vnets, change pipeline job dependencies. Add node labels for each workload type and nic capacity. make sku constant. Update readme, set schedule for long running cluster on test branch.

Copilot

Pull request overview

This PR adds a comprehensive long-running test pipeline for SwiftV2 pod networking on Azure AKS. The pipeline creates persistent infrastructure (2 AKS clusters, 4 VNets, storage accounts with private endpoints, NSGs) and runs scheduled tests every 3 hours to validate pod-to-pod connectivity, network security group isolation, and private endpoint access across multi-tenant scenarios.

Key Changes:

Adds scheduled pipeline with conditional infrastructure setup (runSetupStages parameter)
Implements 8 pod test scenarios across 2 clusters and 4 VNets with different NIC capacities
Includes 9 connectivity tests and 5 private endpoint tests with tenant isolation validation

Reviewed changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
.pipelines/swiftv2-long-running/pipeline.yaml	Main pipeline with 3-hour scheduled trigger and runSetupStages parameter
.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml	Two-stage template: setup (conditional) and datapath tests with 4 jobs
.pipelines/swiftv2-long-running/scripts/*.sh	Infrastructure setup scripts for AKS, VNets, storage, NSGs, and private endpoints
test/integration/swiftv2/longRunningCluster/datapath*.go	Test implementation split into create, connectivity, private endpoint, and delete tests
test/integration/swiftv2/helpers/az_helpers.go	Azure CLI and kubectl helper functions for resource management
test/integration/manifests/swiftv2/long-running-cluster/*.yaml	Kubernetes resource templates for PodNetwork, PNI, and Pods
go.mod, go.sum	Updates to support Ginkgo v2 testing framework
hack/aks/Makefile	Updates for SwiftV2 cluster creation with multi-tenancy tags
.pipelines/swiftv2-long-running/README.md	Comprehensive documentation of pipeline architecture and test scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

go.mod

Copilot · 2025-11-24T17:14:47Z

.pipelines/swiftv2-long-running/scripts/create_vnets.sh

+    cmd_delegator_curl="'curl -X PUT http://localhost:8080/DelegatedSubnet/$modified_custsubnet'"
+    cmd_containerapp_exec="az containerapp exec -n subnetdelegator-westus-u3h4j -g subnetdelegator-westus --subscription 9b8218f9-902a-4d20-a65c-e98acec5362f --command $cmd_delegator_curl"


Hardcoded credentials and subscription IDs in the script. The script contains a hardcoded subscription ID 9b8218f9-902a-4d20-a65c-e98acec5362f and references a specific container app subnetdelegator-westus-u3h4j in resource group subnetdelegator-westus. These hardcoded values make the script non-portable and could expose sensitive information. Consider parameterizing these values or using environment variables.

Copilot · 2025-11-24T17:14:48Z

.pipelines/swiftv2-long-running/scripts/create_aks.sh

+    responseFile="response.txt"
+    modified_vnet="${vnet_id//\//%2F}"
+    cmd_stamp_curl="'curl -v -X PUT http://localhost:8080/VirtualNetwork/$modified_vnet/stampcreatorservicename'"
+    cmd_containerapp_exec="az containerapp exec -n subnetdelegator-westus-u3h4j -g subnetdelegator-westus --subscription 9b8218f9-902a-4d20-a65c-e98acec5362f --command $cmd_stamp_curl"


Same hardcoded credentials issue. The script contains hardcoded subscription ID 9b8218f9-902a-4d20-a65c-e98acec5362f and references to subnetdelegator-westus-u3h4j container app. Consider parameterizing these values.

.pipelines/swiftv2-long-running/scripts/create_storage.sh

test/integration/swiftv2/longRunningCluster/datapath_create_test.go

test/integration/swiftv2/longRunningCluster/datapath_delete_test.go

test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go

test/integration/swiftv2/longRunningCluster/datapath_delete_test.go

.pipelines/swiftv2-long-running/scripts/create_aks.sh

test/integration/manifests/swiftv2/long-running-cluster/pod.yaml

test/integration/manifests/swiftv2/long-running-cluster/podnetwork.yaml

test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go

test/integration/swiftv2/longRunningCluster/datapath_delete_test.go

Co-authored-by: Copilot <[email protected]> Signed-off-by: sivakami-projects <[email protected]>

…st.go Co-authored-by: Copilot <[email protected]> Signed-off-by: sivakami-projects <[email protected]>

…ity_test.go Co-authored-by: Copilot <[email protected]> Signed-off-by: sivakami-projects <[email protected]>

…st.go Co-authored-by: Copilot <[email protected]> Signed-off-by: sivakami-projects <[email protected]>

sivakami-projects · 2025-12-12T01:06:14Z

/azp run Azure Container Networking PR

azure-pipelines · 2025-12-12T01:06:28Z

Azure Pipelines successfully started running 1 pipeline(s).

sivakami-projects · 2025-12-12T03:10:55Z

/azp run Azure Container Networking PR

azure-pipelines · 2025-12-12T03:11:07Z

Azure Pipelines successfully started running 1 pipeline(s).

sivakami-projects · 2025-12-12T06:43:00Z

/azp run Azure Container Networking PR

azure-pipelines · 2025-12-12T06:43:12Z

Azure Pipelines successfully started running 1 pipeline(s).

sivakami-projects · 2025-12-12T17:19:28Z

/azp run Azure Container Networking PR

azure-pipelines · 2025-12-12T17:19:40Z

Azure Pipelines successfully started running 1 pipeline(s).

jpayne3506

We need to do a pass on other supporting files not included into this PR.

go.mod

hack/aks/Makefile

.pipelines/swiftv2-long-running/scripts/create_storage.sh

jpayne3506 · 2025-12-12T17:58:58Z

.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml

+              echo "Waiting 2 minutes for pods to fully start and HTTP servers to be ready..."
+              sleep 120
+              echo "Wait period complete, proceeding with connectivity tests"


Historically when we have relied on sleep it has resulted in CI/CD failures.

Removing this.

jpayne3506 · 2025-12-12T18:00:24Z

.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml

      # Job 3: Networking & Storage
      # ------------------------------------------------------------
      - job: NetworkingAndStorage
+        timeoutInMinutes: 0


So, you never want this job to timeout? Max is 6 hours no matter what you set.

jpayne3506 · 2025-12-12T18:05:47Z

.pipelines/swiftv2-long-running/README.md

+Run locally against existing infrastructure:
+
+```bash
+export RG="sv2-long-run-centraluseuap"  # Match your resource group


This RG naming is way beyond the max cap I typically see. Is there not a managed cluster that gets paired with this?

I dont think this is more than the max limit.

Its because the managed cluster that is created goes beyond the 80 character limit. IIRC the breaking point is close to 20~ characters due to how the name is duplicated when creating the managed RG. If it works, awesome!

.pipelines/swiftv2-long-running/pipeline.yaml

jpayne3506 · 2025-12-12T18:16:13Z

.pipelines/swiftv2-long-running/scripts/create_aks.sh

+      echo "Provisioning finished with state: $state"
+      break
+    fi
+    sleep 6


Can we look at leveraging another option besides sleep

- Remove service connection from pipeline parameters - Update tests to use ginkgo v1 - Replace ginkgo CLI with go test - Remove fixed sleep timers - Add MTPNC and pod status verification in test code - Remove skip delete tags - Clean up long-running pipeline template

sivakami-projects · 2025-12-13T02:03:10Z

/azp run Azure Container Networking PR

azure-pipelines · 2025-12-13T02:03:21Z

Azure Pipelines successfully started running 1 pipeline(s).

Set pipeline vars for delegator app. Replace fixed and infinite sleeps with bounded retry loops Optimize kubeconfig management by fetching once and reusing across jobs add retry for Private endpoint ip to be available. Remove unnecessary validation. cleanup. change kubeconfig paths. Set kubeconfig.

sivakami-projects · 2025-12-16T06:24:29Z

/azp run Azure Container Networking PR

azure-pipelines · 2025-12-16T06:24:40Z

Azure Pipelines successfully started running 1 pipeline(s).

jpayne3506 · 2025-12-16T17:10:06Z

.pipelines/swiftv2-long-running/README.md

+- **Example**: PR validation run with Build ID 12345 → `sv2-long-run-12345`
+
+**Important Notes**:
+-  Always follow the naming pattern for scheduled runs on master: `sv2-long-run-<region>`


Is there any instance where you can see multiple runs leveraging this test? If so we also need to add a unique identifier, i.e. build ID to the RG + Cluster naming

Purpose of this pipeline is to schedule periodic test runs on a single cluster. But yes people could manually trigger multiple runs in which case build id will be used in the RG name.

jpayne3506 · 2025-12-16T17:11:35Z

.pipelines/swiftv2-long-running/README.md

+Every 3 hour, the pipeline:
+1. Skips setup stages (infrastructure already exists)
+2. **Job 1 - Create Resources**: Creates 8 test scenarios (PodNetwork, PNI, Pods with HTTP servers on port 8080)
+3. **Job 2 - Connectivity Tests**: Tests HTTP connectivity between pods (9 test cases), then waits 20 minutes


Curious. What is the purpose of waiting for 20 minutes?

Updated the readme to reflect the latest tests.

jpayne3506 · 2025-12-16T17:25:05Z

.pipelines/swiftv2-long-running/scripts/manage_storage_rbac.sh

+elif [ "$ACTION" == "delete" ]; then
+  echo "Removing Storage Blob Data Contributor role from service principal"
+
+  for SA in $STORAGE_ACCOUNTS; do
+    echo "Processing storage account: $SA"
+    SA_SCOPE="/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}/providers/Microsoft.Storage/storageAccounts/${SA}"
+
+    ASSIGNMENT_ID=$(az role assignment list \
+      --assignee "$SP_OBJECT_ID" \
+      --role "Storage Blob Data Contributor" \
+      --scope "$SA_SCOPE" \
+      --query "[0].id" -o tsv 2>/dev/null || echo "")
+
+    if [ -z "$ASSIGNMENT_ID" ]; then
+      echo "[OK] No role assignment found for $SA (already deleted or never existed)"
+      continue
+    fi
+
+    az role assignment delete --ids "$ASSIGNMENT_ID" --output none \
+      && echo "[OK] Role removed from service principal for $SA" \
+      || echo "[WARNING] Failed to remove role for $SA (may not exist)"
+  done
+fi
+
+echo "RBAC management completed successfully."


Can you do a sanity check at the end of the delete to confirm everything was deleted properly. Ideally anything that would help us go through the CI/CD to confirm that rbac is not leaking would be beneficial.

jpayne3506 · 2025-12-16T17:37:26Z

test/integration/swiftv2/longRunningCluster/datapath.go

+func RunConnectivityTest(test ConnectivityTest) error {
+	// Get kubeconfig for the source cluster
+	sourceKubeconfig := getKubeconfigPath(test.Cluster)
+
+	// Get kubeconfig for the destination cluster (default to source cluster if not specified)
+	destKubeconfig := sourceKubeconfig
+	if test.DestCluster != "" {
+		destKubeconfig = getKubeconfigPath(test.DestCluster)
+	}
+
+	// Get destination pod's eth1 IP (delegated subnet IP for cross-VNet connectivity)
+	// This is the IP that is subject to NSG rules, not the overlay eth0 IP
+	destIP, err := helpers.GetPodDelegatedIP(destKubeconfig, test.DestNamespace, test.DestinationPod)
+	if err != nil {
+		return fmt.Errorf("failed to get destination pod delegated IP: %w", err)
+	}
+
+	fmt.Printf("Testing connectivity from %s/%s (cluster: %s) to %s/%s (cluster: %s, eth1: %s) on port 8080\n",
+		test.SourceNamespace, test.SourcePod, test.Cluster,
+		test.DestNamespace, test.DestinationPod, test.DestCluster, destIP)
+
+	// Run curl command from source pod to destination pod using eth1 IP
+	// Using -m 3 for 3 second timeout (short because netcat closes connection immediately)
+	// Using --interface eth1 to force traffic through delegated subnet interface
+	// Using --http0.9 to allow HTTP/0.9 responses from netcat (which sends raw text without proper HTTP headers)
+	// Exit code 28 (timeout) is OK if we received data, since netcat doesn't properly close the connection
+	curlCmd := fmt.Sprintf("curl --http0.9 --interface eth1 -m 3 http://%s:8080/", destIP)
+
+	output, err := helpers.ExecInPod(sourceKubeconfig, test.SourceNamespace, test.SourcePod, curlCmd)
+	// Check if we received data even if curl timed out (exit code 28)
+	// Netcat closes the connection without proper HTTP close, causing curl to timeout
+	// But if we got the expected response, the connectivity test is successful


I'd recommend looking at https://github.com/kubernetes/kubernetes/blob/c180d6762d7ac5059d9b50457cafb0d7f4cf74a9/test/e2e/framework/network/utils.go#L329-L373. Exec-ing + curl is not going to work 100% of the time. I would put up some reasonable guardrails for retries on these flaky operations.

Yes http curl is unreliable. Switched to tcp netcat tests. Also pods have tcp server running on port 8080.

jpayne3506 · 2025-12-16T17:39:01Z

go.sum

 sigs.k8s.io/structured-merge-diff/v6 v6.3.0/go.mod h1:M3W8sfWvn2HhQDIbGWj3S099YozAsymCo/wrT5ohRUE=
 sigs.k8s.io/yaml v1.6.0 h1:G8fkbMSAFqgEFgh4b1wmtzDnioxFCUgTZhlbj5P9QYs=
-sigs.k8s.io/yaml v1.6.0/go.mod h1:796bPqUfzR/0jLAl6XjHl3Ck7MiyVv8dbTdyT3/pMf4=
+sigs.k8s.io/yaml v1.6.0/go.mod h1:796bPqUfzR/0jLAl6XjHl3Ck7MiyVv8dbTdyT3/pMf4=


git checkout master -- go.sum to revert and not have to deal with your IDE.

Fetch storage account names for tests run without setup stage.

sivakami-projects · 2025-12-16T19:55:01Z

/azp run Azure Container Networking PR

azure-pipelines · 2025-12-16T19:55:13Z

Azure Pipelines successfully started running 1 pipeline(s).

sivakami added 2 commits November 24, 2025 08:38

Update readme file.

873c05e

sivakami-projects marked this pull request as ready for review November 24, 2025 17:07

sivakami-projects requested a review from a team as a code owner November 24, 2025 17:07

sivakami-projects requested review from Copilot and ravisraju November 24, 2025 17:07

Copilot started reviewing on behalf of sivakami-projects November 24, 2025 17:07 View session

Copilot finished reviewing on behalf of sivakami-projects November 24, 2025 17:09

Copilot AI reviewed Nov 24, 2025

View reviewed changes

sivakami added 2 commits November 24, 2025 09:37

fix syntax for pe test.

3395415

Create NSG rules with unique priority.

b34b332

kmurudi reviewed Dec 2, 2025

View reviewed changes

sivakami-projects and others added 18 commits December 5, 2025 10:43

Update go.mod

e9f50e6

Co-authored-by: Copilot <[email protected]> Signed-off-by: sivakami-projects <[email protected]>

Update test/integration/swiftv2/longRunningCluster/datapath_create_te…

8364bf5

…st.go Co-authored-by: Copilot <[email protected]> Signed-off-by: sivakami-projects <[email protected]>

Update test/integration/swiftv2/longRunningCluster/datapath_delete_te…

1d2ed59

…st.go Co-authored-by: Copilot <[email protected]> Signed-off-by: sivakami-projects <[email protected]>

Update test/integration/swiftv2/longRunningCluster/datapath_connectiv…

efbfb02

…ity_test.go Co-authored-by: Copilot <[email protected]> Signed-off-by: sivakami-projects <[email protected]>

Update test/integration/swiftv2/longRunningCluster/datapath_delete_te…

04a22a0

…st.go Co-authored-by: Copilot <[email protected]> Signed-off-by: sivakami-projects <[email protected]>

Error handling for private endpoint tests.

56fbeb2

Private endpoint tests.

4d29aec

update private endpoint test.

a1baf08

update pod.yaml

0945c2c

Check if mtpnc is cleaned up after pods are deleted.

7df1c79

Update vnet names.

b37b033

add container readiness check.

2672caa

update pod.yaml

9d27d43

Update pod.yaml

95ff010

Update connectivity test.

c1bd2e6

Update netcat curl test.

7bdf1b0

Enable delete pods.

a27aa52

Remove test changes.

4f32773

Lint fix.

066ba2c

sivakami added 2 commits December 11, 2025 22:07

reset package name.

6688685

fix package name.

cf7173b

sivakami-projects changed the title ~~Add SwiftV2 long-running pipeline with scheduled tests~~ Datapath tests for Long running clusters. Dec 12, 2025

sivakami-projects self-assigned this Dec 12, 2025

sivakami-projects added the swift-v2 label Dec 12, 2025

jpayne3506 requested changes Dec 12, 2025

View reviewed changes

sivakami added 2 commits December 15, 2025 20:45

Make dockerfiles

a55deb9

jpayne3506 requested changes Dec 16, 2025

View reviewed changes

sivakami added 5 commits December 16, 2025 09:53

fetch go.sum from master branch.

ff25707

Check if rbac roles are cleaned up after delete.

c5b3000

TCP netcat connectivity test improvements.

c100abc

Fetch storage account names for tests run without setup stage.

lint fix.

c379c32

workload inputs are validated skip g204 check.

7bc70dc

		cmd_delegator_curl="'curl -X PUT http://localhost:8080/DelegatedSubnet/$modified_custsubnet'"
		cmd_containerapp_exec="az containerapp exec -n subnetdelegator-westus-u3h4j -g subnetdelegator-westus --subscription 9b8218f9-902a-4d20-a65c-e98acec5362f --command $cmd_delegator_curl"

Datapath tests for Long running clusters. #4142

Are you sure you want to change the base?

Datapath tests for Long running clusters. #4142

Uh oh!

Conversation

sivakami-projects commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sivakami-projects commented Dec 12, 2025

Uh oh!

azure-pipelines bot commented Dec 12, 2025

Uh oh!

sivakami-projects commented Dec 12, 2025

Uh oh!

azure-pipelines bot commented Dec 12, 2025

Uh oh!

sivakami-projects commented Dec 12, 2025

Uh oh!

azure-pipelines bot commented Dec 12, 2025

Uh oh!

sivakami-projects commented Dec 12, 2025

Uh oh!

azure-pipelines bot commented Dec 12, 2025

Uh oh!

jpayne3506 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sivakami-projects commented Dec 13, 2025

Uh oh!

azure-pipelines bot commented Dec 13, 2025

Uh oh!

sivakami-projects commented Dec 16, 2025

Uh oh!

azure-pipelines bot commented Dec 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

sivakami-projects commented Nov 24, 2025 •

edited

Loading