Description
/kind bug
What happened?
I have two clusters, one v1.27 named eks-poc
and another new one created recently on v1.28 called eks-dev
. On eks-dev
, I noticed a problem that despite setting a securityContext to force the fsGroup user of the mounted volume to be 1035 [1], it doesn't respect this and instead sets it at 1999 (one off the upper limit set in the storage class [2]).
We didn't have this problem on eks-poc
, but we updated it this morning to v1.28 and the problem appeared, so to me it seems like the issue is related to Kubernetes 1.28.
I installed the EFS driver manually on eks-poc
and using the EKS Add-on on eks-dev
. The image of the efs-plugin container in the efs-csi-controller pod on eks-poc
is 602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/aws-efs-csi-driver:v1.5.7
and the image on eks-dev
is 602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/aws-efs-csi-driver:v1.7.1
- so different versions, but the common thing that makes it stop working is Kubernetes 1.28.
Some other thing's we've tried: rolling back the eks-dev
EFS driver version to 1.5.7
- still happens, but the POSIX user is now 1000 rather than 199x. I haven't checked the change-log but I assume it was switched around to count down from the maximum gId, as evidenced by the log line "Allocator found GID which is already in use: -1 - trying next one."
What you expected to happen?
mounted volume to be owned by user 1035 from the securityContext, not the one set by the provisioner
How to reproduce it (as minimally and precisely as possible)?
[1] Statefulset YAML (also happening on other deployments, also note the command to override the entrypoint):
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
minReadySeconds: 10
persistentVolumeClaimRetentionPolicy:
whenDeleted: Retain
whenScaled: Retain
podManagementPolicy: OrderedReady
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: postgres
serviceName: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- command:
- sleep
- infinity
env:
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
image: timescale/timescaledb:2.12.1-pg14
imagePullPolicy: IfNotPresent
name: postgres
ports:
- containerPort: 5432
protocol: TCP
resources: {}
volumeMounts:
- mountPath: /var/lib/postgresql/data
name: postgresdb
subPath: pgdata
restartPolicy: Always
securityContext:
fsGroup: 1035
fsGroupChangePolicy: Always
runAsNonRoot: true
runAsUser: 1035
terminationGracePeriodSeconds: 30
updateStrategy:
rollingUpdate:
partition: 0
type: RollingUpdate
volumeClaimTemplates:
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgresdb
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: efs-sc
volumeMode: Filesystem
[2] StorageClass:
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
storageclass.kubernetes.io/is-default-class: "true"
name: efs-sc
parameters:
basePath: /dynamic_provisioning
directoryPerms: "700"
fileSystemId: fs-redacted
gidRangeEnd: "2000"
gidRangeStart: "1000"
provisionerID: efs.csi.aws.com
provisioningMode: efs-ap
provisioner: efs.csi.aws.com
reclaimPolicy: Retain
volumeBindingMode: Immediate
This outputs (with 1.5.7) - note the 1000 user, not 1035
postgres-0:/$ ls -lah /var/lib/postgresql/
total 4K
drwxr-xr-x 1 postgres postgres 18 Oct 6 01:04 .
drwxr-xr-x 1 root root 24 Oct 6 01:04 ..
drwx------ 2 1000 1000 6.0K Nov 29 13:23 data
Environment
- Kubernetes version (use
kubectl version
): see above - Driver version: see above
Please also attach debug logs to help us better diagnose
Attached
csi-provisioner.txt
efs-plugin.txt