Concurent requests thows memory exception problem #1033
Replies: 2 comments 5 replies
-
|
Creating a new context doesn't load the entire model again:
Re-using the same context for multiple conversations is slightly better (there is some overhead per-context). The problem with your implementation is you're never clearing the context, so everything that has previously been said is still in the KV cache. Adding in |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the clarification, so I have to options:
For the second case. I have done the necessary changes and I keep as a static the weights and I create new context each time, but I observed the following: Server starts with 0.0GB of GPU and 17.0GB or RAM Of course if a request is finish the reserved capacity of GPU and RAM for this request is released when using the context_llama.NativeHandle.KvCacheClear() in the end as you said. According to my results the overhead is to big which means that we have very limited parallel requests according to GPU an RAM, unless I am doing something wrong. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I have a Blazor Hosted WebAssembly app in .NET8 and testing LLamaSharp with Phi3.5mini. First results seems fine, a question gets it's answer. But then some problems occurs in more complex schenarios:
Attempted to read or write protected memory. This is often an indication that other memory is corruptI use the same static context because seems each new context loads again the model and multiplies the consuption of RAM and GPU Memory. Below is a sample code:
Beta Was this translation helpful? Give feedback.
All reactions