-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
Problem Description
We've encountered a critical performance issue with the .NET 9.0 thread pool (using SDK 9.0.303, a stable release) in our production/high-load environment.
Under heavy load, the application becomes extremely slow or unresponsive. Upon diagnosis with dotnet-stack, we observed a pathological state:
The ThreadPool's global work item queue (QueueSize) was backlogged with thousands of items (in our log, the count was 6759).
Simultaneously, dozens of ThreadPool worker threads (PortableThreadPool+WorkerThread) were completely idle, waiting on LowLevelLifoSemaphore.Wait.
Expected Behavior:
When there are pending work items in the global queue, idle worker threads should be woken up immediately to process them.
Actual Behavior:
Work items are severely backlogged in the global queue while a large number of worker threads remain idle, leading to effective thread pool starvation.
Analysis
This issue appears to be related to the internal scheduling or signaling/wakeup mechanism of the new Portable Thread Pool introduced in .NET 9. The logs strongly suggest a failure in the mechanism responsible for waking up idle threads despite a massive number of pending tasks.
Our application is an Orleans-based service that uses ASP.NET Core Kestrel, MongoDB, Nacos, OpenTelemetry, and other libraries, involving a high degree of asynchronous I/O.
Evidence: dotnet-stack Trace
Below are the key parts of the dotnet-stack log we captured. The full log file can be provided if needed.
Log Summary:
QueueSize: 6759
Typical stack trace for the numerous idle threads:
Thread (0x3D6B):
CPU_TIME
System.Private.CoreLib!System.Threading.LowLevelLifoSemaphore.WaitNative(...)
System.Private.CoreLib!System.Threading.LowLevelLifoSemaphore.WaitForSignal(int32)
System.Private.CoreLib!System.Threading.LowLevelLifoSemaphore.Wait(int32,bool)
System.Private.CoreLib!System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()
At the same time, at least one worker thread was active, processing an HTTP/2 frame write in Kestrel.
We believe this is a critical bug within the .NET 9.0 runtime. We hope this information is helpful for the investigation.