Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application freezes during GC #108144

Open
FlorianRoudautOrchestrade opened this issue Sep 23, 2024 · 22 comments
Open

Application freezes during GC #108144

FlorianRoudautOrchestrade opened this issue Sep 23, 2024 · 22 comments
Labels
area-GC-coreclr tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Milestone

Comments

@FlorianRoudautOrchestrade

Description

One of our Winforms applications, with workstation gc, randomly freezes. The application is running on laptops and doing multithreaded (around 24 or 32 threads) calculations that involves a high memory allocation and deallocation.

Freezes could happens 2 or 3 times a day. The desktop application is running almost 24/7 on the user machine. Ona a usual cycle, we can see with DotMemory that around 35% of the time is spent in GC.

During the freeze we see in the Task Manager that the application is taking around 5GB, while total used is around 40GBand the machine has 64GB available. The Task Manager also shows CPU activity (around 4% on a machine with 24 virtual CPUs).

We have taken dumps and used various tools. Every tool indicates that the process is stuck in a GC cycle. We can analyze the dump using dotnet-dump and the output of clrstack -all is in the file attached.

clrstack.txt

Reproduction Steps

No controlled reproduction available.

Expected behavior

We expect the Garbage colletion to continue and not freeze the application.

Actual behavior

The application is unresponsive

Regression?

No response

Known Workarounds

We are decreasing the number of threads launched. With 8 threads, the freeze happens much less often.

Configuration

.NET Runtime 8.0.8
Windows 11
x64

Other information

No response

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Sep 23, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Sep 23, 2024
@mangod9 mangod9 added this to the Future milestone Sep 23, 2024
@mangod9
Copy link
Member

mangod9 commented Sep 23, 2024

Hello @FlorianRoudautOrchestrade, can you please indicate which thread you feel is causing the deadlock? I dont see any indication in the stacks attached. Also is the Freeze transient or once you hit it it never unfreezes?

@FlorianRoudautOrchestrade
Copy link
Author

Hello,

We have seen no clue indicating a deadlock. Most of the threads are shown as "Active" when opening the dump in visual studio. It's just that they are all waiting for GC. When we take other dumps, a bit later ( could be 10 seconds or 10 minutes later), all the threads are in the exact same stack trace.

The freeze is not stopping, we've left the application running for hours and did not unfreeze

@markples
Copy link
Member

Can you please elaborate on "Every tool indicates that the process is stuck in a GC cycle"? Are you seeing a single GC spin effectively forever, or are there repeated GCs occurring? To get more insight into the GC behavior, a gc collect only trace would be very helpful; please see https://github.com/Maoni0/mem-doc/blob/master/doc/.NETMemoryPerformanceAnalysis.md#how-to-collect-top-level-gc-metrics.

@FlorianRoudautOrchestrade
Copy link
Author

To reply to the second question : " Are you seeing a single GC spin effectively forever, or are there repeated GCs occurring ".
We have reduced the number of threads used by the process this week end (from 32 to 8). Since then we do not have freezes anymore and we are not capable of reproducing on our machines. My point being, that I cannot use perfview to collect more information now. The only thing we can see if from the dumps of previous freezes, and when doing mutliple dumps of the same freezing process across multiple point in times, the threads are in the exact same position.

To reply "Every tool indicates that the process is stuck in a GC cycle" , I've attached the debug diag analysis
dump_20240913_171719_CrashHangAnalysis.pdf

We can see numerous threads in :

ntdll!NtWaitForSingleObject+14
KERNELBASE!WaitForSingleObjectEx+8e
coreclr!CLREventWaitHelper+20 [D:\a_work\1\s\src\coreclr\vm\synch.cpp @ 397 + 6]
coreclr!WKS::GCHeap::WaitUntilGCComplete+2e [D:\a_work\1\s\src\coreclr\gc\gcee.cpp @ 287 + 18]
coreclr!Thread::RareDisablePreemptiveGC+1c5 [D:\a_work\1\s\src\coreclr\vm\threadsuspend.cpp @ 2172]

or

ntdll!NtWaitForSingleObject+14
KERNELBASE!WaitForSingleObjectEx+8e
coreclr!CLREventWaitHelper+20 [D:\a_work\1\s\src\coreclr\vm\synch.cpp @ 397 + 6]
D:\a_
coreclr!WKS::GCHeap::WaitUntilGCComplete+2e [D:\a_work\1\s\src\coreclr\gc\gcee.cpp @ 287 + 18]
D:\a_
coreclr!Thread::RareDisablePreemptiveGC+1c5 [D:\a_work\1\s\src\coreclr\vm\threadsuspend.cpp @ 2172]
D:\a_
coreclr!Thread::UserSleep+10a [D:\a_work\1\s\src\coreclr\vm\threads.cpp @ 4294 + 4d]
D:\a_
coreclr!ThreadNative::Sleep+9b [D:\a_work\1\s\src\coreclr\vm\comsynchronizable.cpp @ 469 + 24]
D:\a_
System.Threading.Thread.Sleep(Int32)+15
OrchestradeCommon.MarketData.Streaming.

or

ntdll!NtWaitForSingleObject+14
KERNELBASE!WaitForSingleObjectEx+8e
coreclr!CLREventWaitHelper+20 [D:\a_work\1\s\src\coreclr\vm\synch.cpp @ 397 + 6]
coreclr!WKS::gc_heap::wait_for_gc_done+45 [D:\a_work\1\s\src\coreclr\gc\gc.cpp @ 13072 + 15]
9/13/24, 12:50 PM dump_20240913_171719_CrashHangAnalysis.mht
file:///C:/Users/andy.farlie/OneDrive - Orchestrade Financial Systems/Documents/DebugDiag/Reports/dump_20240913_171719_CrashHangAnalysi… 19/44
coreclr!WKS::gc_heap::try_allocate_more_space+5b4b1 [D:\a_work\1\s\src\coreclr\gc\gc.cpp @ 16882]
coreclr!WKS::gc_heap::allocate_more_space+31 [D:\a_work\1\s\src\coreclr\gc\gc.cpp @ 17450 + 11]
coreclr!WKS::GCHeap::Alloc+84 [D:\a_work\1\s\src\coreclr\gc\gc.cpp @ 43681 + 17]
coreclr!JIT_New+318 [D:\a_work\1\s\src\coreclr\vm\jithelpers.cpp @ 2324 + 237

Each time, there is a function in the stack mentionning waiting for GC to be completed

@markples
Copy link
Member

I don't see any thread actually performing GC, so this seems to be in thread suspension for GC or something interfering with that. One idea could be to look at the threads that are not waiting for GC to see if there is anything in the app. I suspect that for us to find anything more, we would need to look at the dumps if that is possible - if so, could you please share some from a few different freezes via https://developercommunity.visualstudio.com/dotnet? This would let us look for more details, and seeing more than one might expose things in common between them.

@Maoni0
Copy link
Member

Maoni0 commented Sep 24, 2024

the recommendation from DebugDiag just says

When a GC is running the .NET objects are not in a valid state and the reported analysis may be inaccurate. Also,
the thread that triggered the Garbage collection may or may not be a problematic thread. Too many garbage
collections in a process are bad for the performance of the application. Too many GC's in a process may indicate a
memory pressure or symptoms of fragmenation.

this is not saying "this process is stuck in a GC cycle". it just says too many GCs are bad for perf in a process.

it's very understandable you'd see threads waiting for GC to complete because a GC probably did just happen - you have 65 threads, most of which are managed threads, and one GC heap since you are using Workstation GC. so you probably were triggering lots of GCs (but DebugDiag does not have this knowledge because it doesn't have a way to know that from a dump).

and when you were capturing the dump, the profiler (JetBrains) is trying to enumerate threads in this process and it's waiting on the thread store lock which is probably still held by the thread that last triggered a GC -

ntdll!NtWaitForAlertByThreadId+14
ntdll!RtlpWaitOnAddressWithTimeout+81
ntdll!RtlpWaitOnAddress+ae
ntdll!RtlpWaitOnCriticalSection+fd
ntdll!RtlpEnterCriticalSectionContended+1c4
ntdll!RtlEnterCriticalSection+42
coreclr!CrstBase::Enter+5a [D:\a\_work\1\s\src\coreclr\vm\crst.cpp @ 328] D:\a\_work\1\s\src\coreclr\vm\crst.cpp
coreclr!ThreadSuspend::LockThreadStore+90 [D:\a\_work\1\s\src\coreclr\vm\threadsuspend.cpp @ 1885] D:\a\_work\1\s\src\coreclr\vm\threads
coreclr!ProfilerThreadEnum::Init+25 [D:\a\_work\1\s\src\coreclr\vm\profilingenumerators.cpp @ 527 + 13] D:\a\_work\1\s\src\coreclr\vm\profilin
coreclr!ProfToEEInterfaceImpl::EnumThreads+136 [D:\a\_work\1\s\src\coreclr\vm\proftoeeinterfaceimpl.cpp @ 10158 + 8] D:\a\_work\1\s\src\coreclr\vm\proftoe
JetBrains_Profiler_Core!DllGetClassObject+1ffd3
JetBrains_Profiler_Core!DllGetClassObject+1f2c3
JetBrains_Profiler_Core!DllGetClassObject+1fcf1
JetBrains_Profiler_Core!DllGetClassObject+1ecd7
coreclr!EEToProfInterfaceImpl::ProfilerAttachComplete+d1 [D:\a\_work\1\s\src\coreclr\vm\eetoprofinterfaceimpl.cpp @ 2859 + 14] 

this could mean you happened to have just triggered a GC which landed you in this situation, or you were triggering lots of GCs (which was most likely your case since you mentioned your time in GC is so high) so it's very likely that every time you capture a dump you will see something similar. this is where a trace would be useful if you still have the repro. but from just these callstacks I don't see anything unusual, especially for a process that is doing lots of GCs.

@FlorianRoudautOrchestrade
Copy link
Author

Thank you both for having a look.

@markples I've created this case and uplaoded the dumps on it https://developercommunity.visualstudio.com/t/Application-freezes-during-GC-github-ca/10753082

@Maoni0 We see that having a lot of GC is bad. Our code is allocating a fair memory, especially when doing computations on 24 or 32 threads. But that's the best way we found to minimize computation time. It can run fine for days or hours.

The issue is that in this case the winforms application is Not Responding, and when we take dumps at different time, the stacks are exactly the same, so we have excluded that GC has worked, calculation resumed and a new GC happened when we took the new dump.

We tried to trace what is happening during the freeze but dotnet-trace gave a tcp timeout exception and DotTrace was not able to attach to the process either in Timeline or Sampling mode.

Do you have an idea on what tool we could use to analyze further?

@markples
Copy link
Member

The most useful thing would be a trace so that the history up to the freeze can be seen. From the link above, you can use the command perfview /GCCollectOnly /AcceptEULA /nogui collect. Launch it before or early in the process, then stop it with s after the process has been frozen for a while. The file will be smaller than a dump. Note that by default this will capture events from all processes on the machine, so things like process names will be shared.

@FlorianRoudautOrchestrade
Copy link
Author

Hello,

Thank you. I'll check with the users to set that up. It may take a couple of weeks

@jkotas
Copy link
Member

jkotas commented Sep 26, 2024

In dumps that you have shared, the thread that is expected to complete the GC is blocked by:

    Image path: C:\Program Files\SentinelOne\Sentinel Agent 23.2.3.358\InProcessClient64.dll
    Image name: InProcessClient64.dll
    Timestamp:        Mon Sep 18 04:27:23 2023 (6508341B)
    File version:     23.2.3.358
    Product version:  23.2.3.358
    Information from resource tables:
        CompanyName:      Sentinel Labs, Inc.
        ProductName:      Sentinel Agent

The code in this .dll intercepted execution of the GC thread and it is spinning in some sort of spinlock.

This looks like a bug in SentinelOne software. To validate this theory, you can try disabling the SentinelOne software and see whether the hang will go away. You may want to contact SentinelOne support to diagnose it further.

@jkotas jkotas added the tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly label Sep 26, 2024
@FlorianRoudautOrchestrade
Copy link
Author

Thank you very much Jan. How did you spot that?

@jkotas
Copy link
Member

jkotas commented Sep 26, 2024

To find the thread that is expected to execute the GC, open the crash dump in windbg debugger, run dt coreclr!ThreadStore::s_pThreadStore command. m_HoldingThread is the thread that is expected to execute the GC.

The stacktrace of the thread that is expected to execute the GC as shown by the windbg debugger is incomprehensible since there are no symbols and stackunwinding information for InProcessClient64:

0:047> k
 # Child-SP          RetAddr               Call Site
00 00000069`3a57bfa0 00007ffe`5489c941     InProcessClient64+0x6d082
01 00000069`3a57bfe0 00007ffe`5489b6a8     InProcessClient64+0x6c941
02 00000069`3a57c020 00007ffe`549d5332     InProcessClient64+0x6b6a8
03 00000069`3a57c0c0 00007ffe`577d0d14     InProcessClient64+0x1a5332
04 00000069`3a57c0c8 00000069`3a57c110     ntdll!ZwSuspendThread+0x14
05 00000069`3a57c0d0 00000001`ffffffff     0x00000069`3a57c110
06 00000069`3a57c0d8 00000000`00000000     0x00000001`ffffffff

@FlorianRoudautOrchestrade
Copy link
Author

Thank you very much Jan

@FlorianRoudautOrchestrade
Copy link
Author

Hello,

We have told the client to whitelist the application, so that it's not scanned by Sentinel. They still had the freeze.

In the latest dump, we still see the UI thread in WaitUntilGCComplete.

Image

I tried to look up the m_HoldingThread, but I could not find any thread with this id.

Regards,
Florian

@jkotas
Copy link
Member

jkotas commented Oct 9, 2024

m_HoldingThread is the runtime Thread pointer. Try to run !threads command and match the m_HoldingThread value with ThreadOBJ column.

@rickbrew
Copy link
Contributor

rickbrew commented Nov 13, 2024

Looks like this is affecting Paint.NET 5.1 on .NET 9.0.0. I'm getting reports of it hanging at app startup, and the dump I looked at shows that it's parked on PeekMessageW which my code is calling, not the runtime. SentinelOne's DLL shows up in the list of loaded modules. One report included a screenshot of Task Manager and PDN is burning at least 1 core of CPU (so probably livelock/spinlock).

Is there any known reliable workaround, like disabling Tiered Compilation or ReadyToRun or something?

@rickbrew
Copy link
Contributor

Link to discussion about this over on the Paint.NET forum: https://forums.getpaint.net/topic/131939-paintnet-51-not-starting-properly/

@jkotas
Copy link
Member

jkotas commented Nov 14, 2024

the dump I looked at shows that it's parked on PeekMessageW which my code is calling, not the runtime. SentinelOne's DLL shows up in the list of loaded modules

Did you see a thread spinning inside InProcessClient64.dll with a callstack like #108144 (comment)?

Is there any known reliable workaround, like disabling Tiered Compilation or ReadyToRun or something?

My guess is that the hang in SentinelOne got exposed by changes in GC thread suspension. There have been quite a few changes around GC thread suspension to make it simpler, more scalable and compatible with CET.

For example, #99076 may be related (just a speculation). If you would like to go on an adventure to prove or disprove this theory, you can try setting s_pfnQueueUserAPC2Proc to NULL (either by reaching into coreclr.dll and setting the static variable to zero or by detouring GetProcAddress) to see whether it will make the hang to stop reproducing.

@rickbrew
Copy link
Contributor

rickbrew commented Nov 14, 2024

Did you see a thread spinning inside InProcessClient64.dll with a callstack like #108144 (comment)?

Yeah, I see on the UI thread: (this is from VS looking at the DMP)

 	win32u.dll!NtUserPeekMessage()	Unknown
 	user32.dll!_PeekMessage()	Unknown
 	user32.dll!PeekMessageW()	Unknown
 	InProcessClient64.dll!00007ff857c60b79()	Unknown
 	InProcessClient64.dll!00007ff857c60be8()	Unknown
 	InProcessClient64.dll!00007ff857c60abc()	Unknown
 	[Managed to Native Transition]	
	PaintDotNet.SystemLayer.dll!PaintDotNet.SystemLayer.UIHelpers.BeginModalUIFrameCore(System.Windows.Forms.IWin32Window owner, bool useWaitCursor, System.Action<PaintDotNet.SystemLayer.IModalUIFrame> continuationFn) Line 1084	C#

The BeginModalUIFrameCore method at the bottom is calling into a P/Invoke for PeekMessageW.

This might not even be related to GC changes in .NET 9 -- I say this because you mention QueueUserAPC2. I recently started using QueueUserAPC (non-2) for some Task awaiters (for a rather specific reason), so I wonder if Sentinel One borked both of those detours.

Oh, and the ".NET System Events" thread has this, although I can't tell if it's also snagged or doing just fine:

	win32u.dll!NtUserGetMessage()	Unknown
 	user32.dll!GetMessageW()	Unknown
 	InProcessClient64.dll!00007ff857c5f116()	Unknown
 	InProcessClient64.dll!00007ff857c5f065()	Unknown
 	Microsoft.Win32.SystemEvents.dll!00007ff801026979()	Unknown
 	[Managed to Native Transition]	
 	Microsoft.Win32.SystemEvents.dll!Microsoft.Win32.SystemEvents.WindowThreadProc()	Unknown
 	[Native to Managed Transition]	
 	kernel32.dll!BaseThreadInitThunk()	Unknown
 	ntdll.dll!RtlUserThreadStart()	Unknown

@rickbrew
Copy link
Contributor

QueueUserAPC2 requires Win11, so I don't think that's the issue. I'm getting reports from all supported Windows versions (Win10 v21H2+).

@rickbrew
Copy link
Contributor

Well, cross your fingers

https://x.com/ATGardner/status/1857192263564296646

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-GC-coreclr tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Projects
None yet
Development

No branches or pull requests

6 participants