How to make token generation deterministic? #16428

ThiloteE · 2025-10-05T08:50:45Z

ThiloteE
Oct 5, 2025

For testing/benchmarking purposes, I want to have the same response everytime.

I tried with following commands:

llama-server 8080 --top-k 1 --n-predict 128 --reasoning-budget 0 --threads -1 --jinja --flash-attn auto --cache-type-k q8_0 --cache-type-v q8_0
llama-server 8080 --temp 0 --n-predict 128 --reasoning-budget 0 --threads -1 --jinja --flash-attn auto --cache-type-k q8_0 --cache-type-v q8_0

I was hoping that forcing top-k to 1 or temp to 0 would do the trick.

Unfortunately, once I press the "regenerate" button in the llama-server GUI, it gives me varying responses.

System info:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
 Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
build: 6683 (946f71ed) with MSVC 19.41.34120.0 for x64
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

I found some related (old) issues:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to make token generation deterministic? #16428

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to make token generation deterministic? #16428

Uh oh!

Uh oh!

ThiloteE Oct 5, 2025

Replies: 0 comments

ThiloteE
Oct 5, 2025