Skip to content

Commit 99a8727

Browse files
authored
Merge pull request #2292 from AndrewSirenko/scale-tests-pre-allocated
Create EBS CSI Driver scale-test tool
2 parents fd7a6d1 + 3acb83d commit 99a8727

9 files changed

+570
-0
lines changed

hack/ebs-scale-test/README.md

+77
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# EBS CSI Driver Scalability Tests
2+
3+
EBS uses EBS CSI Driver scalability tests to validate that each release of our driver can manage EBS volume lifecycle for large-scale clusters.
4+
5+
Setup and run an EBS CSI Driver scalability test with our `scale-test` tool:
6+
7+
```shell
8+
# Set scalability parameters
9+
export CLUSTER_TYPE="pre-allocated"
10+
export TEST_TYPE="scale-sts"
11+
export REPLICAS="1000"
12+
13+
# Setup an EKS scalability cluster and install EBS CSI Driver.
14+
./scale-test setup
15+
16+
# Run a scalability test and export results.
17+
./scale-test run
18+
19+
# Cleanup all AWS resources related to scalability cluster.
20+
./scale-test cleanup
21+
```
22+
23+
Results will be exported to a local directory (`$EXPORT_DIR`) and an S3 Bucket in your AWS account (`$S3_BUCKET`).
24+
25+
Note: Any `ebs-csi-controller` pod(s) will be restarted at the beginning of every scale run to clear metrics/logs.
26+
27+
## Pre-requisites
28+
29+
You will need access to an AWS account role where you have [eksctl's minimum IAM policies](https://eksctl.io/usage/minimum-iam-policies/) and have permission to sync your `$S3_BUCKET`.
30+
31+
Additionally, please install the following commandline tools:
32+
- [gomplate](https://github.com/hairyhenderson/gomplate) - used to render configuration files based on environment variables.
33+
- [aws cli v2](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)
34+
- [eksctl](https://eksctl.io/installation/)
35+
- [kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl)
36+
37+
## Overridable parameters
38+
39+
You can modify the kind of scalability cluster test run, or the names of script artifacts, through environment variables.
40+
41+
Note: The environment variables set when you run `scale-test setup` must remain the same for future `scale-test run`/`scale-test clean` commands on that scalability cluster.
42+
43+
```sh
44+
# Affect test
45+
CLUSTER_TYPE # Type of scalability cluster to create.
46+
TEST_TYPE # Type of scale test to run.
47+
REPLICAS # Number of StatefulSet replicas to create.
48+
DRIVER_VALUES_FILEPATH # Custom values file passed to EBS CSI Driver Helm chart.
49+
50+
# Names
51+
CLUSTER_NAME # Base name used by `eksctl` to create AWS resources.
52+
EXPORT_DIR # Where to export scale test metrics/logs locally.
53+
S3_BUCKET # Name of S3 bucket used for holding scalability run results.
54+
SCALABILITY_TEST_RUN_NAME # Name of test run. Used as name of directory for adding run results in $S3_BUCKET.
55+
56+
# Find default values at top of `scale-test` script.
57+
```
58+
59+
## Types of scalability tests
60+
61+
Set the `CLUSTER_TYPE` and `TEST_TYPE` environment variables to set up and run different scalability tests.
62+
63+
- `CLUSTER_TYPE` dictates what type of scalability cluster `scale-test` creates and which nodes are used during a scalability test run. Options include:
64+
- 'pre-allocated': Additional worker nodes are created during cluster setup. By default, we pre-allocate 1 `m7a.48xlarge` EC2 instance for every 100 StatefulSet replicas.
65+
66+
- `TEST_TYPE` dictates what type of scalability test we want to run. Options include:
67+
- 'scale-sts': Scales a StatefulSet to `$REPLICAS`. Waits for all pods to be ready. Delete Sts. Waits for all PVs to be deleted. Exercises the complete dynamic provisioning lifecycle for block volumes.
68+
69+
You can mix and match `CLUSTER_TYPE` and `TEST_TYPE`.
70+
71+
## Contributing scalability tests
72+
73+
`scale-test` parses arguments and wraps scripts and configuration files in the `helpers` directory. These helper scripts manage the scalability cluster and test runs.
74+
75+
The `helpers` directory includes:
76+
- `/helpers/cluster-setup`: Holds scripts and configuration for cluster setup/cleanup.
77+
- `/helpers/scale-test`: Holds directory for each scale test. Also holds utility scripts used by every test (like exporting logs/metrics to S3).
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
#!/bin/bash
2+
# Copyright 2025 The Kubernetes Authors.
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
### Helper script to create/delete eks ebs-scale-test clusters and install add-ons.
17+
18+
set -euo pipefail
19+
20+
# We expect this helper script is sourced from hack/ebs-scale-test
21+
path_to_cluster_setup_dir="${BASE_DIR}/helpers/cluster-setup/"
22+
23+
## Cluster
24+
25+
create_cluster() {
26+
if eksctl get cluster --name "$CLUSTER_NAME" --region "$AWS_REGION" >/dev/null 2>&1; then
27+
echo "EKS cluster '$CLUSTER_NAME' already up in $AWS_REGION."
28+
aws eks update-kubeconfig --name "$CLUSTER_NAME" --region "$AWS_REGION"
29+
else
30+
echo "Deploying EKS cluster. See configuration in $EXPORT_DIR/cluster-config.yaml"
31+
gomplate -f "$path_to_cluster_setup_dir/scale-cluster-config.yaml" -o "$EXPORT_DIR/cluster-config.yaml"
32+
eksctl create cluster -f "$EXPORT_DIR/cluster-config.yaml"
33+
fi
34+
}
35+
36+
cleanup_cluster() {
37+
eksctl delete cluster "$CLUSTER_NAME"
38+
}
39+
40+
## Misc
41+
42+
check_lingering_volumes() {
43+
lingering_vol_count=$(aws ec2 describe-volumes \
44+
--filters "Name=tag-key,Values=ebs-scale-test" \
45+
--query 'length(Volumes[*])' \
46+
--output text)
47+
48+
[[ lingering_vol_count -ne 0 ]] && echo "WARNING: detected $lingering_vol_count lingering ebs-scale-test EBS volumes in $AWS_ACCOUNT_ID. Please run \`aws ec2 describe-volumes --filters 'Name=tag-key,Values=ebs-scale-test'\` and audit their AWS resource tags. Note these volumes may belong to a different scalability run than $SCALABILITY_TEST_RUN_NAME"
49+
}
50+
51+
## EBS CSI Driver
52+
53+
deploy_ebs_csi_driver() {
54+
path_to_chart="${BASE_DIR}/../../charts/aws-ebs-csi-driver"
55+
echo "Deploying EBS CSI driver from chart $path_to_chart"
56+
57+
# We use helm install instead of upgrade to ensure the release does not already exist
58+
helm install aws-ebs-csi-driver \
59+
--namespace kube-system \
60+
--values "$DRIVER_VALUES_FILEPATH" \
61+
--wait \
62+
--timeout 15m \
63+
"$path_to_chart"
64+
}
65+
66+
(return 0 2>/dev/null) || (
67+
echo "This script is not meant to be run directly, only sourced as a helper!"
68+
exit 1
69+
)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# Copyright 2025 The Kubernetes Authors.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
apiVersion: eksctl.io/v1alpha5
16+
kind: ClusterConfig
17+
metadata:
18+
name: {{ .Env.CLUSTER_NAME }}
19+
version: {{ .Env.K8S_VERSION }}
20+
region: {{ .Env.AWS_REGION }}
21+
tags:
22+
karpenter.sh/discovery: {{ .Env.CLUSTER_NAME }}
23+
24+
iam:
25+
withOIDC: true
26+
podIdentityAssociations:
27+
- namespace: kube-system
28+
serviceAccountName: ebs-csi-controller-sa
29+
wellKnownPolicies:
30+
ebsCSIController: true
31+
32+
managedNodeGroups:
33+
{{- if eq ( getenv "CLUSTER_TYPE" ) "pre-allocated" }}
34+
- instanceType: m7a.48xlarge
35+
amiFamily: AmazonLinux2
36+
name: pre-allocated-ng
37+
desiredCapacity: {{ .Env.PRE_ALLOCATED_NODES }}
38+
{{- end }}
39+
40+
addons:
41+
- name: eks-pod-identity-agent
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Copyright 2025 The Kubernetes Authors.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
# Default values.yaml for ebs-scale-test installation of aws-ebs-csi-driver
16+
image:
17+
pullPolicy: Always
18+
controller:
19+
logLevel: 7
20+
# Having one controller simplifies metrics collection and helps track restarts
21+
replicaCount: 1
22+
enableMetrics: true
23+
sidecars:
24+
provisioner:
25+
additionalArgs: ["--http-endpoint=:8081"]
26+
resizer:
27+
additionalArgs: ["--http-endpoint=:8082"]
28+
attacher:
29+
additionalArgs: ["--http-endpoint=:8084"]
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
#!/bin/bash
2+
# Copyright 2025 The Kubernetes Authors.
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
### Helper script for exporting EBS CSI Driver metrics to S3 bucket
17+
18+
set -euo pipefail
19+
20+
collect-and-export-metrics() {
21+
export CONTROLLER_POD_NAME
22+
CONTROLLER_POD_NAME=$(kubectl get pod -n kube-system -l app=ebs-csi-controller -o jsonpath='{.items[0].metadata.name}')
23+
export METRICS_FILEPATH="$EXPORT_DIR/metrics.txt"
24+
25+
collect_metrics
26+
clean_metrics
27+
28+
echo "Collecting ebs-plugin logs"
29+
kubectl logs "$CONTROLLER_POD_NAME" -n kube-system >"$EXPORT_DIR/ebs-plugin-logs.txt"
30+
31+
echo "Collecting ebs-csi-controller Deployment and ebs-csi-node Daemonset yaml"
32+
kubectl get deployment ebs-csi-controller -n kube-system -o yaml >"$EXPORT_DIR/ebs-csi-controller.yaml"
33+
kubectl get daemonset ebs-csi-node -n kube-system -o yaml >"$EXPORT_DIR/ebs-csi-node.yaml"
34+
35+
echo "Exporting everything in $EXPORT_DIR to S3 bucket s3://$S3_BUCKET/$SCALABILITY_TEST_RUN_NAME"
36+
37+
aws s3 sync "$EXPORT_DIR" "s3://$S3_BUCKET/$SCALABILITY_TEST_RUN_NAME"
38+
echo "Metrics exported to s3://$S3_BUCKET/$SCALABILITY_TEST_RUN_NAME/"
39+
}
40+
41+
collect_metrics() {
42+
echo "Port-forwarding ebs-csi-controller containers"
43+
kubectl port-forward "$CONTROLLER_POD_NAME" 3301:3301 -n kube-system &
44+
kubectl port-forward "$CONTROLLER_POD_NAME" 8081:8081 -n kube-system &
45+
kubectl port-forward "$CONTROLLER_POD_NAME" 8082:8082 -n kube-system &
46+
kubectl port-forward "$CONTROLLER_POD_NAME" 8084:8084 -n kube-system &
47+
48+
echo "Collecting metrics"
49+
for port in 3301 8081 8082 8084; do
50+
curl "http://localhost:${port}/metrics" >>"$METRICS_FILEPATH" && continue
51+
echo "Failed to collect metrics from port ${port}, retrying after 5s..."
52+
sleep 5
53+
curl "http://localhost:${port}/metrics" >>"$METRICS_FILEPATH" && continue
54+
echo "Failed to collect metrics from port ${port} AGAIN, retrying after 10s..."
55+
sleep 10
56+
curl "http://localhost:${port}/metrics" >>"$METRICS_FILEPATH" && continue
57+
echo "Failed to collect metrics from port ${port} THRICE, retrying one more time after 20s..."
58+
sleep 20
59+
echo "WARNING: Could not collect metrics from port ${port}. Something may be wrong in cluster."
60+
done
61+
}
62+
63+
clean_metrics() {
64+
echo "Generating clean version of exported data at $EXPORT_DIR/cleaned_data.txt"
65+
cat "$METRICS_FILEPATH" |
66+
grep -e "+Inf" -e "total" |
67+
grep -v "workqueue" |
68+
grep -v "go_" |
69+
grep -v "Identity" |
70+
grep -v "Capabili" |
71+
grep -v "TYPE" |
72+
grep -v "HELP" |
73+
grep -v "cloudprovider" |
74+
grep -v "promhttp" |
75+
grep -v "registered_metrics" >"$EXPORT_DIR/cleaned_data.txt"
76+
}
77+
78+
(return 0 2>/dev/null) || (
79+
echo "This script is not meant to be run directly, only sourced as a helper!"
80+
exit 1
81+
)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
#!/bin/bash
2+
# Copyright 2025 The Kubernetes Authors.
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
# Checks/creates $S3_BUCKET and that there hasn't been a run with $SCALABILITY_TEST_RUN_NAME
17+
pre_test_validation() {
18+
if ! aws s3 ls "s3://$S3_BUCKET"; then
19+
aws s3 mb "s3://$S3_BUCKET" --region "${AWS_REGION}"
20+
fi
21+
22+
result=$(aws s3api list-objects-v2 --bucket "$S3_BUCKET" --prefix "$SCALABILITY_TEST_RUN_NAME" --query 'Contents[]')
23+
if [[ "$result" != "null" ]]; then
24+
echo "ERROR: Your S3 bucket already contains directory with name \$SCALABILITY_TEST_RUN_NAME: 's3://$S3_BUCKET/$SCALABILITY_TEST_RUN_NAME'. Please pick a unique SCALABILITY_TEST_RUN_NAME."
25+
exit 1
26+
fi
27+
28+
echo "Updating kubeconfig and restarting ebs-csi-controller pod"
29+
aws eks update-kubeconfig --name "$CLUSTER_NAME"
30+
kubectl delete pod -n kube-system -l app=ebs-csi-controller
31+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
#!/bin/bash
2+
# Copyright 2025 The Kubernetes Authors.
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
### Helper script for running EBS-backed StatefulSet scaling test
17+
18+
# We expect this helper script is sourced from hack/ebs-scale-test
19+
path_to_scale_test_dir="${BASE_DIR}/helpers/scale-test/scale-sts-test"
20+
21+
sts_scale_test() {
22+
manifest_path="$path_to_scale_test_dir/scale-sts.yaml"
23+
export_manifest_path="$EXPORT_DIR/scale-manifest.yaml"
24+
25+
echo "Applying $manifest_path. Exported to $export_manifest_path"
26+
gomplate -f "$manifest_path" -o "$export_manifest_path"
27+
kubectl apply -f "$export_manifest_path"
28+
29+
echo "Scaling StatefulSet $REPLICAS replicas"
30+
kubectl scale sts --replicas "$REPLICAS" ebs-scale-test
31+
kubectl rollout status statefulset ebs-scale-test
32+
33+
echo "Deleting StatefulSet"
34+
kubectl delete -f "$export_manifest_path"
35+
36+
echo "Waiting for all PVs to be deleted"
37+
wait_for_pvs_to_delete
38+
}
39+
40+
wait_for_pvs_to_delete() {
41+
while true; do
42+
pv_count=$(kubectl get pv --no-headers | wc -l)
43+
if [ "$pv_count" -eq 0 ]; then
44+
echo "No PVs exist in the cluster, proceeding..."
45+
break
46+
else
47+
echo "$pv_count PVs still exist, waiting..."
48+
sleep 5
49+
fi
50+
done
51+
}
52+
53+
(return 0 2>/dev/null) || (
54+
echo "This script is not meant to be run directly, only sourced as a helper!"
55+
exit 1
56+
)

0 commit comments

Comments
 (0)