-
Notifications
You must be signed in to change notification settings - Fork 284
Description
Environment
- OS: Ubuntu 24.04
- OpenVINO GenAI: 2025.3
- CPU: Intel Core Ultra 5 125H
Steps to Reproduce
four_prompt_mutations_benchmark.py
Summary
I would like clarification on two points:
- Is prefix-cache enabled by default in 2025.3, and does it behave differently from traditional KV-Cache with additional optimizations?
- Is the observed drop in average inference time after several runs an expected mechanism of OpenVINO GenAI?
Question 1: Prefix-cache default behavior
From my observation, it seems that prefix-cache
is enabled by default in OpenVINO GenAI 2025.3, because in the following scenarios I can still see cache being reused:
- Removing the first n tokens
- Adding extra characters at the end
- Removing n characters in the middle
Result: the first run takes longer, while subsequent runs are noticeably faster → this differs from my understanding of the traditional KV-Cache (where any input mismatch leads to a new cache being created).
👉 Question: Does GenAI's KV-Cache apply additional optimizations? For example, reusing based on prefix similarity?
Additionally, if I delete characters randomly, then each run becomes slower as expected, but I can still sometimes observe the first run being significantly slower. I cannot fully understand this behavior.
Question 2: Change in average inference time
I also wrote a script to run the same system_prompt
+ user_input
50 times to measure the average:
- On the very first script execution: average ~ 1.28s
- After several executions: suddenly drops to ~ 1.05s, and then consistently stays around 1.0x s
👉 Question: Is this behavior due to OpenVINO GenAI’s internal mechanism (e.g., warm-up, optimization path, JIT compilation, or lower-level CPU caching effects)?
Results (Plots)
- Timing curve plots




