Weird issue with Completion + CancelRequests #336
-
I am not sure if there is something I am not doing correctly. basically seems that Completion and CancelRequests must be strongly serialized to mitigate an issue in the current version (LLMUnity v2.5.0). I am issuing commands from MainThread which should serialize operations already. when I call cancel the current completion blocks inside the native method:
the issue disappears if I change the code to this (which I know already is not how it is supposed to work, but it touches the hotspot):
without enforcing a strong serialization it seems that the Task Run executes in another thread overlapping the execution of LLM_Cancel without this guard the editor hangs forever after stopping which calls Destroy; Destroy operation completes correctly but the Completion Task does not return. this leads to the classic "Reloding Domain" stuck issue in Unity editor I noticed that Cancel implementation in the native side is trivial, however completion is not and may take several frames to complete, which explains the intention behind await Task.Run() use lock (staticLock) as I did to test, makes no sense, when you call cancel you want the completion to terminate asynchronously. otherwise a cancel will generate a stutter which is not acceptable what do you think? Am I the only one seeing this? Am I using wrong version? side notes: I am using multiple LLM, but I ruled out this as a factor, it happens also with one single LLM object. let me know what information can help reproduce this. also: the code that I am using seems to work way better, besides the exception handler in the native side is using a global which would require a strong static mutex serialization like staticLock in my test. I believe sigjmp_buf point should be part of the LLM class in undreamai.h so that each call would not share this symbol for the case of concurrent slot / concurrent LLM objects in case they fail simultaneously or a catastrophic error will happen, just my 2 cents contribution I know it is a rare condition. I can do some tests on an empty project for a test case |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 8 replies
-
sorry for bringing up sigjmp_buf point, I thought just to mention that for record keeping; I see is far from trivial to do a correct implementation for a fully reentrant code. I am currently interrested in understanding the first issue, why that LLM_Completion does not return after cancel is being called. |
Beta Was this translation helpful? Give feedback.
-
Thank you, wow, that's a really deep investigation ⭐ ! |
Beta Was this translation helpful? Give feedback.
I just fixed it and released a new version.
The solution is exactly what you suggested, using the same slot release process for both handle_cancel_action and stop_service (link)!
Yes please, I would love such contributions because these things get a bit out of my depth to be honest.
Again I really appreciate this issue and your investigation!