-
-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Is your feature request related to a problem? Please describe.
Currently, there are 2 separate networked modes - federated (load balanced) and distributed (llama-cpp distributed inference). This split is quite wasteful, as it forces user to setup separate load balanced cluster for non-llama-cpp workloads for HA, and then separate cluster for distributed inference for llama-cpp. Another issue is that switching between federated and distributed modes requires restart of entire cluster - it's startup defined.
Describe the solution you'd like
There should be just one networked mode that by default does load balancing for workloads, with toggle switching llama-cpp backend between load balancing and distributed inferencing.
This way:
- GPUs can be shared between backends even when llama-cpp is in distributed mode
- llama-cpp mode can be changed without restarting all nodes
- there is only one configuration path for networked mode - currently there are separate instructions for federated and distributed modes (and they are outdated)
I'm putting aside issue that currently distributed mode is broken and doesn't work at all.