Skip to content

macos-14 runners randomly lose timeslice for a very long time #13145

@bzbarsky-apple

Description

@bzbarsky-apple

Description

I was looking into the test failure at https://github.com/project-chip/connectedhomeip/actions/runs/18296436055/attempts/3?pr=41235 (and in particular https://github.com/project-chip/connectedhomeip/actions/runs/18296436055/job/52165072259?pr=41235) and the problem seems to be that the test job was losing timeslice for about a second at a time over and over again, so that things ended up timing out.

From the logs for that job, showing two timeskips (but there are many more in that log):

2025-10-07T16:46:29.5974210Z 2025-10-07 16:46:22.820734+0000 xctest[31743:128747] [DataManagement] 			AttributeDataIB =
2025-10-07T16:46:29.5974330Z 2025-10-07 16:46:23.816516+0000 xctest[31743:128747] [DataManagement] 			{
...
2025-10-07T16:46:29.6046700Z 2025-10-07 16:46:23.820045+0000 xctest[31743:128747] [DataManagement] 
2025-10-07T16:46:29.6046810Z 2025-10-07 16:46:24.805417+0000 xctest[31743:128747] [DataManagement] 		},

(the second timestamps are the "real" ones; the first ones are just when the job as a whole failed). That shows that it took just about a second to print a few characters to stdout; normally the corresponding log lines look more like this:

2025-10-07T16:46:29.6189490Z 2025-10-07 16:46:25.760699+0000 xctest[31743:128747] [DataManagement] 			AttributeDataIB =
2025-10-07T16:46:29.6189710Z 2025-10-07 16:46:25.760709+0000 xctest[31743:128747] [DataManagement] 			{

with the logs coming about 10 microseconds apart, not 1 second apart. As a result, something that normally takes a few milliseconds (parsing and logging a 1KB message) took over 11 second to complete.

A different thread from the same process running at the same time also shows a 1-second time skip:

2025-10-07T16:46:29.5974050Z 2025-10-07 16:46:22.820677+0000 xctest[31743:134280] [Default] <MTRDevice_Concrete: 0x10aded200, node: 69497B43CD974FD5-0000000012344322 (305414946), VID: 65521, PID: 32769, WiFi: NO, Thread: NO, state: Resubscribing, last subscription attempt wait: 0s, queued work: 0, last report: 2025-10-07 16:45:56 +0000 (27s ago), last subscription failure: 2025-10-07 16:46:11 +0000 (11s ago), controller: 4C76CAD5-6A29-4625-B6B5-84D8B199F918> callFirstDelegateSynchronouslyWithBlock: successfully called <MTRDeviceDelegateInfo: 0x10ae92370 delegate value 0x10be09360 interested attribute paths count 0 event paths count 0>
2025-10-07T16:46:29.5976310Z 2025-10-07 16:46:23.817107+0000 xctest[31743:134280] [Default] <MTRDevice_Concrete: 0x10aded200, node: 69497B43CD974FD5-0000000012344322 (305414946), VID: 65521, PID: 32769, WiFi: NO, Thread: NO, state: Resubscribing, last subscription attempt wait: 0s, queued work: 0, last report: 2025-10-07 16:45:56 +0000 (28s ago), last subscription failure: 2025-10-07 16:46:11 +0000 (12s ago), controller: 4C76CAD5-6A29-4625-B6B5-84D8B199F918> report <MTRAttributePath endpoint 1 cluster 0x80 (128, BooleanStateConfiguration) attribute 0x1 (1, SupportedSensitivityLevels)> value {

and as far as I can tell those were the only things running at the time, so it's not that there was contention from other parts of this test job. As far as I can tell, the entire runner lost timeslice.

This is not unique to this job; we retriggered the job multiple times, and each time it fails because of timeslice losses randomly happening at various points.

Platforms affected

  • Azure DevOps
  • GitHub Actions - Standard Runners
  • GitHub Actions - Larger Runners

Runner images affected

  • Ubuntu 22.04
  • Ubuntu 24.04
  • macOS 13
  • macOS 13 Arm64
  • macOS 14
  • macOS 14 Arm64
  • macOS 15
  • macOS 15 Arm64
  • macOS 26 Arm64
  • Windows Server 2019
  • Windows Server 2022
  • Windows Server 2025

Image version and build link

Image: macos-14-arm64
Version: 20250928.1833
Included Software: https://github.com/actions/runner-images/blob/macos-14-arm64/20250928.1833/images/macos/macos-14-arm64-Readme.md
Image Release: https://github.com/actions/runner-images/releases/tag/macos-14-arm64%2F20250928.1833

Is it regression?

Sort of. The macos-13 (Intel) runners are not showing this behavior, and our attempt to move forward to the macos-14 runners is blocked on this

Expected behavior

Not randomly take 11s to do something that normally takes a few milliseconds.

Actual behavior

Multiple timeslice losses, all almost exactly one second, adding up to about 11s.

Repro steps

Pretty much any run of the above-linked job on this runner type shows this behavior.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions