Replies: 5 comments 2 replies
-
That is to be expected. The computation for a single token has a part with with constant runtime (weights) and a part with a runtime proportional to context depth (attention). So as the context fills up more and more computation needs to be done for each token and the speed decreases. Not using |
Beta Was this translation helpful? Give feedback.
-
I have been unable to load the model fully without using -nkvo and -fa. I get a malloc error with a qwen model that is ctx 40960. llama-server will load the model with up to about -c 16384 but says the entire model will not be used. I am loading it across 2 16GB gpus, and i guess the KVs in cpu ram. I am new to this. Is there a way to get it to load entirely in the GPUs to solve it? Another command line switch config? Thanks |
Beta Was this translation helpful? Give feedback.
-
Is there a way to get the KV cache to reset at a certain threshold? Some way to get llama-server to without starting a new conversation or restarting llama-server? |
Beta Was this translation helpful? Give feedback.
-
i think i figured it out but i'm not %100 sure. I am seeing massive level in tp/s increase per query now. |
Beta Was this translation helpful? Give feedback.
-
On an official qwen with the ctx 40960 at bf16 i still see slower and slower responses. but this model ::::::: https://huggingface.co/DavidAU/Qwen3-The-Xiaolong-Omega-Directive-22B-uncensored-abliterated-GGUF ,its like the online models, its instant reply. hope this helps someone and wish there was more discussions for newbs about this kind of stuff. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
after running llama-server -m model.gguf -nkvo -fa
I can get the model i want to load fully but with every query tp/s
slowly diminsh.
1st query 20 tps, 2nd query 18 tps, 3rd query 16 tps and so on.
Any idea how to fix it? Is this normal?
Thanks
Beta Was this translation helpful? Give feedback.
All reactions