Skip to content

Conversation

@samfundev
Copy link

@samfundev samfundev commented Jul 13, 2025

This is a very rough version of a proxy for kobold so that it can swap models for each request.

  • Specify models by name
  • Multimodal support
  • Try using reload_config instead of subprocessing
  • Possibly split out into a separate file?

Only text models are supported but that'll be fixed as well. First step towards #1623.

This can currently be used w/ things like open webui to chat with multiple models.

@samfundev samfundev marked this pull request as draft July 13, 2025 20:12
@LostRuins
Copy link
Owner

Hmm I think

  1. A proxy, if added, should be in a separate external file, instead of directly in KoboldCpp.py
  2. Why use subprocesses to open/close every time? Just use the admin api which is designed for switching models already.

@henk717
Copy link
Collaborator

henk717 commented Jul 14, 2025

Would a seperate file play nice with the pyinstallers?

@LostRuins
Copy link
Owner

LostRuins commented Jul 15, 2025

If communicating solely thru API I don't see why not.

  1. Proxy accepts request, stalls the user
  2. Proxy calls /api/admin/reload_config to switch model and waits until model switched
  3. Proxy calls /api/v1/generate with request (or v1/chat/completions for openai mode)
  4. Proxy receives reply from KoboldCpp
  5. Proxy sends reply back to original requestor transparently.

This should be doable cleanly as an entirely separate program. However, SSE streaming will be more challenging.

@henk717
Copy link
Collaborator

henk717 commented Jul 16, 2025

Either way it should be the regular koboldcpp setting it up. I dont want the mess of users having to start seperate things for this feature. However we do it should be a thing the main koboldcpp launcher / cli starts when admin mode is in use

Personally I do think integrating it into koboldcpp.py makes sense.

@samfundev
Copy link
Author

I've updated this to use the model name. I also implemented support for getting the list of models instead of the currently active model. This was enough to get it working in open webui. I've updated the original comment with a checklist.

@henk717
Copy link
Collaborator

henk717 commented Aug 7, 2025

The idea is that it would show the configs the admin api already shows. You get multi modal for free since it accepts kcpps files.

@samfundev
Copy link
Author

I could be missing something, but I don't think that config files give us multi modal for free. If the goal is to keep at most one model loaded for each modality, having a config file either comes with a drawback (implementation 1) or requires splitting configs for each modality (implementation 2).

There's a two ways I can think to implement multi modality:

  1. Have one server and if that server doesn't have the right model, swap to one that does. This has the drawback that if you have multiple modalities loaded, you have to load/unload multiple models for each swap of the server.
  2. We can split each modality into a separate server, then each modality can be swapped without effecting the others. This has the drawback of more overhead since multiple servers are running. But avoids the drawback of implementation 1.

@samfundev
Copy link
Author

@LostRuins: Why use subprocesses to open/close every time? Just use the admin api which is designed for switching models already.

I've just tried to implement it using this way but I ran into a problem. The API responds before the server has switched over. If I try to connect to the server right after the server responds, the old server is still active so it errors out when the connection gets closed. I added a sleep to try to wait until the old server closes but it feels like this method is unreliable.

@pqnet
Copy link

pqnet commented Sep 13, 2025

@LostRuins: Why use subprocesses to open/close every time? Just use the admin api which is designed for switching models already.

I've just tried to implement it using this way but I ran into a problem. The API responds before the server has switched over. If I try to connect to the server right after the server responds, the old server is still active so it errors out when the connection gets closed. I added a sleep to try to wait until the old server closes but it feels like this method is unreliable.

For this particular feature you could poll the model endpoint until it matches the model you want (or you get "no model loaded" in case of an error)

@samfundev
Copy link
Author

Based on @pqnet suggested, I swapped over to the admin api.

@LostRuins LostRuins added the enhancement New feature or request label Oct 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants