-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Description
Description
I was looking into the test failure at https://github.com/project-chip/connectedhomeip/actions/runs/18296436055/attempts/3?pr=41235 (and in particular https://github.com/project-chip/connectedhomeip/actions/runs/18296436055/job/52165072259?pr=41235) and the problem seems to be that the test job was losing timeslice for about a second at a time over and over again, so that things ended up timing out.
From the logs for that job, showing two timeskips (but there are many more in that log):
2025-10-07T16:46:29.5974210Z 2025-10-07 16:46:22.820734+0000 xctest[31743:128747] [DataManagement] AttributeDataIB =
2025-10-07T16:46:29.5974330Z 2025-10-07 16:46:23.816516+0000 xctest[31743:128747] [DataManagement] {
...
2025-10-07T16:46:29.6046700Z 2025-10-07 16:46:23.820045+0000 xctest[31743:128747] [DataManagement]
2025-10-07T16:46:29.6046810Z 2025-10-07 16:46:24.805417+0000 xctest[31743:128747] [DataManagement] },
(the second timestamps are the "real" ones; the first ones are just when the job as a whole failed). That shows that it took just about a second to print a few characters to stdout; normally the corresponding log lines look more like this:
2025-10-07T16:46:29.6189490Z 2025-10-07 16:46:25.760699+0000 xctest[31743:128747] [DataManagement] AttributeDataIB =
2025-10-07T16:46:29.6189710Z 2025-10-07 16:46:25.760709+0000 xctest[31743:128747] [DataManagement] {
with the logs coming about 10 microseconds apart, not 1 second apart. As a result, something that normally takes a few milliseconds (parsing and logging a 1KB message) took over 11 second to complete.
A different thread from the same process running at the same time also shows a 1-second time skip:
2025-10-07T16:46:29.5974050Z 2025-10-07 16:46:22.820677+0000 xctest[31743:134280] [Default] <MTRDevice_Concrete: 0x10aded200, node: 69497B43CD974FD5-0000000012344322 (305414946), VID: 65521, PID: 32769, WiFi: NO, Thread: NO, state: Resubscribing, last subscription attempt wait: 0s, queued work: 0, last report: 2025-10-07 16:45:56 +0000 (27s ago), last subscription failure: 2025-10-07 16:46:11 +0000 (11s ago), controller: 4C76CAD5-6A29-4625-B6B5-84D8B199F918> callFirstDelegateSynchronouslyWithBlock: successfully called <MTRDeviceDelegateInfo: 0x10ae92370 delegate value 0x10be09360 interested attribute paths count 0 event paths count 0>
2025-10-07T16:46:29.5976310Z 2025-10-07 16:46:23.817107+0000 xctest[31743:134280] [Default] <MTRDevice_Concrete: 0x10aded200, node: 69497B43CD974FD5-0000000012344322 (305414946), VID: 65521, PID: 32769, WiFi: NO, Thread: NO, state: Resubscribing, last subscription attempt wait: 0s, queued work: 0, last report: 2025-10-07 16:45:56 +0000 (28s ago), last subscription failure: 2025-10-07 16:46:11 +0000 (12s ago), controller: 4C76CAD5-6A29-4625-B6B5-84D8B199F918> report <MTRAttributePath endpoint 1 cluster 0x80 (128, BooleanStateConfiguration) attribute 0x1 (1, SupportedSensitivityLevels)> value {
and as far as I can tell those were the only things running at the time, so it's not that there was contention from other parts of this test job. As far as I can tell, the entire runner lost timeslice.
This is not unique to this job; we retriggered the job multiple times, and each time it fails because of timeslice losses randomly happening at various points.
Platforms affected
- Azure DevOps
- GitHub Actions - Standard Runners
- GitHub Actions - Larger Runners
Runner images affected
- Ubuntu 22.04
- Ubuntu 24.04
- macOS 13
- macOS 13 Arm64
- macOS 14
- macOS 14 Arm64
- macOS 15
- macOS 15 Arm64
- macOS 26 Arm64
- Windows Server 2019
- Windows Server 2022
- Windows Server 2025
Image version and build link
Image: macos-14-arm64
Version: 20250928.1833
Included Software: https://github.com/actions/runner-images/blob/macos-14-arm64/20250928.1833/images/macos/macos-14-arm64-Readme.md
Image Release: https://github.com/actions/runner-images/releases/tag/macos-14-arm64%2F20250928.1833
Is it regression?
Sort of. The macos-13 (Intel) runners are not showing this behavior, and our attempt to move forward to the macos-14 runners is blocked on this
Expected behavior
Not randomly take 11s to do something that normally takes a few milliseconds.
Actual behavior
Multiple timeslice losses, all almost exactly one second, adding up to about 11s.
Repro steps
Pretty much any run of the above-linked job on this runner type shows this behavior.