Distributed and federated modes should be just one networked mode

**Is your feature request related to a problem? Please describe.**

Currently, there are 2 separate networked modes - federated (load balanced) and distributed (llama-cpp distributed inference). This split is quite wasteful, as it forces user to setup separate load balanced cluster for non-llama-cpp workloads for HA, and then separate cluster for distributed inference for llama-cpp. Another issue is that switching between federated and distributed modes requires restart of entire cluster - it's startup defined.

**Describe the solution you'd like**

There should be just one networked mode that by default does load balancing for workloads, with toggle switching llama-cpp backend between load balancing and distributed inferencing.

This way:
- GPUs can be shared between backends even when llama-cpp is in distributed mode
- llama-cpp mode can be changed without restarting all nodes
- there is only one configuration path for networked mode - currently there are separate instructions for federated and distributed modes (and they are outdated) 

I'm putting aside issue that currently distributed mode is broken and doesn't work at all. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Distributed and federated modes should be just one networked mode #7083

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Distributed and federated modes should be just one networked mode #7083

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions