tps slower and slower #16262

sl0sh · 2025-09-25T20:22:39Z

sl0sh
Sep 25, 2025

Hi,
after running llama-server -m model.gguf -nkvo -fa

I can get the model i want to load fully but with every query tp/s
slowly diminsh.

1st query 20 tps, 2nd query 18 tps, 3rd query 16 tps and so on.

Any idea how to fix it? Is this normal?

Thanks

JohannesGaessler · 2025-09-26T08:04:32Z

JohannesGaessler
Sep 26, 2025
Collaborator

That is to be expected. The computation for a single token has a part with with constant runtime (weights) and a part with a runtime proportional to context depth (attention). So as the context fills up more and more computation needs to be done for each token and the speed decreases. Not using --no-kv-offload would probably help because as it is you're keeping the KV cache, which is used for attention in RAM.

0 replies

sl0sh · 2025-09-26T11:40:48Z

sl0sh
Sep 26, 2025
Author

I have been unable to load the model fully without using -nkvo and -fa. I get a malloc error with a qwen model that is ctx 40960. llama-server will load the model with up to about -c 16384 but says the entire model will not be used. I am loading it across 2 16GB gpus, and i guess the KVs in cpu ram.

I am new to this. Is there a way to get it to load entirely in the GPUs to solve it? Another command line switch config?

Thanks

1 reply

JohannesGaessler Sep 26, 2025
Collaborator

By default the entire model will be loaded into the GPUs, with a lower context size the maximum length of text that the model will work on be shorter but it will work the same below that limit.

sl0sh · 2025-10-03T12:30:26Z

sl0sh
Oct 3, 2025
Author

Is there a way to get the KV cache to reset at a certain threshold? Some way to get llama-server to without starting a new conversation or restarting llama-server?

0 replies

sl0sh · 2025-10-04T22:31:36Z

sl0sh
Oct 4, 2025
Author

i think i figured it out but i'm not %100 sure. I am seeing massive level in tp/s increase per query now.
i did it by adding --no-perf --numa isolate --no-mmap -dt 0 -ctk q8_0 -fa -ctv q8_0 which the model is a quan 8

1 reply

abc-nix Oct 5, 2025

Was the low tp/s issue related to the -nkvo option? I don't use it for the same issue, too low tps.

sl0sh · 2025-10-07T06:12:01Z

sl0sh
Oct 7, 2025
Author

On an official qwen with the ctx 40960 at bf16 i still see slower and slower responses. but this model ::::::: https://huggingface.co/DavidAU/Qwen3-The-Xiaolong-Omega-Directive-22B-uncensored-abliterated-GGUF ,its like the online models, its instant reply. hope this helps someone and wish there was more discussions for newbs about this kind of stuff.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tps slower and slower #16262

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

tps slower and slower #16262

Uh oh!

sl0sh Sep 25, 2025

Replies: 5 comments · 2 replies

Uh oh!

JohannesGaessler Sep 26, 2025 Collaborator

Uh oh!

sl0sh Sep 26, 2025 Author

Uh oh!

JohannesGaessler Sep 26, 2025 Collaborator

Uh oh!

sl0sh Oct 3, 2025 Author

Uh oh!

sl0sh Oct 4, 2025 Author

Uh oh!

abc-nix Oct 5, 2025

Uh oh!

sl0sh Oct 7, 2025 Author

sl0sh
Sep 25, 2025

Replies: 5 comments 2 replies

JohannesGaessler
Sep 26, 2025
Collaborator

sl0sh
Sep 26, 2025
Author

JohannesGaessler Sep 26, 2025
Collaborator

sl0sh
Oct 3, 2025
Author

sl0sh
Oct 4, 2025
Author

sl0sh
Oct 7, 2025
Author