Prefix-cache default behavior and inference latency drop in OpenVINO GenAI 2025.3

---

### Environment

* OS: **Ubuntu 24.04**
* OpenVINO GenAI: **2025.3**
* CPU: **Intel Core Ultra 5 125H**

---

### Steps to Reproduce

[four_prompt_mutations_benchmark.py](https://github.com/user-attachments/files/22675651/four_prompt_mutations_benchmark.py)

---

### Summary

I would like clarification on two points:

1. **Is prefix-cache enabled by default in 2025.3, and does it behave differently from traditional KV-Cache with additional optimizations?**
2. **Is the observed drop in average inference time after several runs an expected mechanism of OpenVINO GenAI?**

---
### Question 1: Prefix-cache default behavior

From my observation, it seems that `prefix-cache` is **enabled by default** in **OpenVINO GenAI 2025.3**, because in the following scenarios I can still see cache being reused:

1. Removing the first n tokens
2. Adding extra characters at the end
3. Removing n characters in the middle

Result: the first run takes longer, while subsequent runs are noticeably faster → this differs from my understanding of the traditional **KV-Cache** (where any input mismatch leads to a new cache being created).
👉 Question: **Does GenAI's KV-Cache apply additional optimizations?** For example, reusing based on prefix similarity?

Additionally, if I delete characters randomly, then each run becomes slower as expected, but I can still sometimes observe the first run being significantly slower. I cannot fully understand this behavior.

---

### Question 2: Change in average inference time

I also wrote a script to run the same `system_prompt` + `user_input` 50 times to measure the average:

* On the very first script execution: average ~ **1.28s**
* After several executions: suddenly drops to ~ **1.05s**, and then consistently stays around **1.0x s**

👉 Question: Is this behavior due to **OpenVINO GenAI’s internal mechanism** (e.g., warm-up, optimization path, JIT compilation, or lower-level CPU caching effects)?

---

### Results (Plots)

* Timing curve plots



<img width="2034" height="1407" alt="Image" src="https://github.com/user-attachments/assets/ec7531fe-ad00-450f-b95a-5ab7e3ff83f3" />
<img width="2074" height="1407" alt="Image" src="https://github.com/user-attachments/assets/9a1da91d-a2cb-4ad9-bd73-778ad7de2c88" />
<img width="2100" height="1407" alt="Image" src="https://github.com/user-attachments/assets/38190191-1881-459d-89f0-185db8771680" />
<img width="2074" height="1407" alt="Image" src="https://github.com/user-attachments/assets/73be3098-c29f-41db-91e5-ce8a10183346" />
<img width="2074" height="1407" alt="Image" src="https://github.com/user-attachments/assets/961fd984-2cc8-4f55-93ca-4b7047d58c1a" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prefix-cache default behavior and inference latency drop in OpenVINO GenAI 2025.3 #2790

Environment

Steps to Reproduce

Summary

Question 1: Prefix-cache default behavior

Question 2: Change in average inference time

Results (Plots)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Prefix-cache default behavior and inference latency drop in OpenVINO GenAI 2025.3 #2790

Description

Environment

Steps to Reproduce

Summary

Question 1: Prefix-cache default behavior

Question 2: Change in average inference time

Results (Plots)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions