Skip to content

[WIP] CNTRLPLANE-371: Update to Kubernetes v1.33 #2261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2,381 commits into
base: master
Choose a base branch
from

Conversation

bertinatto
Copy link
Member

This is a temporary PR create to get the initial steps of the kube bump ready.

This will be closed before starting the payload-testing phase.

munnerz and others added 30 commits March 20, 2025 20:19
…ImgChangeE2E

Add e2e test for Regular Container image change
Optimize DS Controller Performance: Reduce Work Duration Time & Minimize Cache Locking.
test: switch gotestsum quiet output format
DRA device taints: fix some race conditions
[PodLevelResources] Pod Level Hugepage Resources
…opagation

APIServerTracing: Respect trace context only for privileged users
CI integration scripts: reduce log noise from installing etcd
KEP-4742: Copy topology labels from Node objects to Pods upon binding/scheduling
…d_of_caching_cluster_events_in_binding

Call queue.Done() before PreBind phase, removing the pod in binding from inFlightPods to save memory
[KEP-2371] add test about container metrics from cadvisor
…size

disable in-place pod vertical scaling for swap enabled pods
Remove general available feature-gate CPUManager
…ularContainerImgChangeE2E

Revert "Add e2e test for Regular Container image change"
The defaulting of TimeAdded randomly broke some of the tests:

   TestList:
       resttest.go:1393: expected:
       []runtime.Object{(*resource.DeviceTaintRule)(0xc000b83080), (*resource.DeviceTaintRule)(0xc000b831e0)},
       got:
       []runtime.Object{(*resource.DeviceTaintRule)(0xc0003db608), (*resource.DeviceTaintRule)(0xc0003db750)}
       ...

   TestCreate:
    resttest.go:346: unexpected obj: &resource.DeviceTaintRule{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"foo2", GenerateName:"", Namespace:"", SelfLink:"", UID:"18d3084d-7d11-4575-8730-4650b81cf1a7", ResourceVersion:"8", Generation:1, CreationTimestamp:time.Date(2025, time.March, 21, 8, 27, 23, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:resource.DeviceTaintRuleSpec{DeviceSelector:(*resource.DeviceTaintSelector)(nil), Taint:resource.DeviceTaint{Key:"example.com/taint", Value:"", Effect:"NoExecute", TimeAdded:time.Date(2025, time.March, 21, 8, 27, 23, 0, time.Local)}}}, expected &resource.DeviceTaintRule{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"foo2", GenerateName:"", Namespace:"", SelfLink:"", UID:"18d3084d-7d11-4575-8730-4650b81cf1a7", ResourceVersion:"8", Generation:1, CreationTimestamp:time.Date(2025, time.March, 21, 8, 27, 23, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:resource.DeviceTaintRuleSpec{DeviceSelector:(*resource.DeviceTaintSelector)(nil), Taint:resource.DeviceTaint{Key:"example.com/taint", Value:"", Effect:"NoExecute", TimeAdded:time.Date(2025, time.March, 21, 8, 27, 24, 0, time.Local)}}}

Failure rate before: 3m40s: 1332 runs so far, 7 failures (0.53%)

It's not obvious from the test failure, but the difference is the
TimeAdded. Setting it beforehand to a value that can be encoded (i.e. truncated
to seconds) fixes the flake.

Failure rate after: 5m0s: 1825 runs so far, 0 failures
soltysh and others added 26 commits April 28, 2025 18:05
Adding a new mutation plugin that handles the following:

1. In case of `workload.openshift.io/enable-shared-cpus` request, it
   adds an annotation to hint runtime about the request. runtime
   is not aware of extended resources, hence we need the annotation.
2. It validates the pod's QoS class and return an error if it's not a
   guaranteed QoS class
3. It validates that no more than a single resource is being request.
4. It validates that the pod deployed in a namespace that has mixedcpus
   workloads allowed annotation.

For more information see - openshift/enhancements#1396

Signed-off-by: Talor Itzhak <[email protected]>

UPSTREAM: <carry>: Update management webhook pod admission logic

Updating the logic for pod admission to allow a pod creation with workload partitioning annotations to be run in a namespace that has no workload allow annoations.

The pod will be stripped of its workload annotations and treated as if it were normal, a warning annoation will be placed to note the behavior on the pod.

Signed-off-by: ehila <[email protected]>

UPSTREAM: <carry>: add support for cpu limits into management workloads

Added support to allow workload partitioning to use the CPU limits for a container, to allow the runtime to make better decisions around workload cpu quotas we are passing down the cpu limit as part of the cpulimit value in the annotation. CRI-O will take that information and calculate the quota per node. This should support situations where workloads might have different cpu period overrides assigned.

Updated kubelet for static pods and the admission webhook for regular to support cpu limits.

Updated unit test to reflect changes.

Signed-off-by: ehila <[email protected]>
…ject openshift feature gates into pkg/features

Signed-off-by: Swarup Ghosh <[email protected]>
This is a short term fix, once we improve the cert rotation logic
in library-go that does not depend on this hack, then we can
remove this carry patch.

squash with the previous PR during the rebase
openshift#1924

squash with the previous PRs during the rebase
openshift#1924
openshift#1929
…phase and graceful termination phase

This reverts commit 85f0f2c.

UPSTREAM: <carry>: fix request Host storing in openshift.io/during-graceful audit log annotation

request URL doesn't contain the host used in the request, instead it
should be fetched from request headers

Note for rebase: squash it into the following commit
vrutkovs@a83d289 UPSTREAM: <carry>: annotate audit events for requests during unready phase and graceful termination phase (openshift#2077)

When audit message is being processed https://github.com/openshift/kubernetes/blob/309f240e18f1da87bbe86c18746774d6d302f8ef/staging/src/k8s.io/apimachinery/pkg/util/proxy/transport.go#L136-L174 may strip `Host` from `r.URL`, however `r.Host` is always filled in. This value may be different for proxy requests, but in most cases `r.Host` should be used instead of `r.URL.Host`
…navailable errors for the etcd health checker client

UPSTREAM: <carry>: replace newETCD3ProberMonitor with etcd3RetryingProberMonitor
This commit fixes bug 1919737.

https://bugzilla.redhat.com/show_bug.cgi?id=1919737

* pkg/proxy/iptables/proxier.go (syncProxyRules): Prefer a local endpoint
for the cluster DNS service.
similarly to what we do for the managed CPU (aka workload partitioning)
feature, introduce a master configuration file
`/etc/kubernetes/openshift-llc-alignment` which needs to be present for
the LLC alignment feature to be activated, in addition to the policy
option being required.

Note this replace the standard upstream feature gate check.

This can be dropped when the feature per  KEP
kubernetes/enhancements#4800 goes beta.

Signed-off-by: Francesco Romani <[email protected]>
Explicitly exclude etcd and etcd-readiness checks (OCPBUGS-48177)
and have etcd operator take responsibility for properly reporting etcd readiness.
Justification: kube-apiserver instances get removed from a load balancer when etcd starts
to report not ready (as will KA's /readyz). Client connections can withstand etcd unreadiness
longer than the readiness timeout is. Thus, it is not necessary to drop connections
in case etcd resumes its readiness before a client connection times out naturally.
This is a downstream patch only as OpenShift's way of using etcd is unique.
The existing patch retried any etcd error returned from storage with the code "Unavailable". Writes
can only be safely retried if the client can be absolutely sure that the initial attempt ended
before persisting any changes. The "Unavailable" code includes errors like "timed out" that can't be
safely retried for writes.
Signed-off-by: Peter Hunt <[email protected]>

UPSTREAM: <carry>: authorization: add minimumkubeletversion package

MinimumKubeletVersion is a way for an admin to declare that nodes any older than the
minimum version cannot authorize with the apiserver. This effectively prevents them from joining.

Doing so means the apiservers can trust newer features are usable on clusters with version skews

Signed-off-by: Peter Hunt <[email protected]>

UPSTREAM: <carry>: authorizer: move mininum kubelet version authorizer to pkg/kubeapiserver and add authorization mode

this does require a line of code be moved from the enablement package to stop a cyclical import

Signed-off-by: Peter Hunt <[email protected]>

UPSTREAM: <carry>: crdvalidation: move latency profile file to be agnostic of field

Signed-off-by: Peter Hunt <[email protected]>

UPSTREAM: <carry>: features: add MinimumKubeletVersion feature

Signed-off-by: Peter Hunt <[email protected]>
Upstream enables volume group snapshots by editing yaml files in a shell
script [1]. We can't use this script in openshift-tests.

Create a brand new, OCP specific test driver based on csi-driver-hostpath,
only with the --feature-gate=VolumeGroupSnapshot on external-snapshotter command line.

We will need to carry this patch until the feature graduates to GA. I've
chosen to create brand new files in this carry patch, so it can't conflict
with the existing ones.

1: https://github.com/kubernetes/kubernetes/blob/91d6fd3455c4a071408df20c7f48df221f2b6d30/test/e2e/testing-manifests/storage-csi/external-snapshotter/volume-group-snapshots/run_group_snapshot_e2e.sh
@openshift-ci-robot
Copy link

@bertinatto: the contents of this pull request could not be automatically validated.

The following commits are valid:

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@bertinatto
Copy link
Member Author

/test integration

Copy link

openshift-ci bot commented Apr 29, 2025

@bertinatto: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-crun-wasm 9cd691d link true /test e2e-aws-crun-wasm
ci/prow/e2e-aws-ovn-serial 9cd691d link true /test e2e-aws-ovn-serial
ci/prow/e2e-agnostic-ovn-cmd 9cd691d link false /test e2e-agnostic-ovn-cmd
ci/prow/okd-scos-e2e-aws-ovn 9cd691d link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-gcp 9cd691d link true /test e2e-gcp
ci/prow/e2e-aws-ovn-cgroupsv2 9cd691d link true /test e2e-aws-ovn-cgroupsv2

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. vendor-update Touching vendor dir or related files
Projects
None yet
Development

Successfully merging this pull request may close these issues.