Skip to content

Prefix-cache default behavior and inference latency drop in OpenVINO GenAI 2025.3 #2790

@e950280

Description

@e950280

Environment

  • OS: Ubuntu 24.04
  • OpenVINO GenAI: 2025.3
  • CPU: Intel Core Ultra 5 125H

Steps to Reproduce

four_prompt_mutations_benchmark.py


Summary

I would like clarification on two points:

  1. Is prefix-cache enabled by default in 2025.3, and does it behave differently from traditional KV-Cache with additional optimizations?
  2. Is the observed drop in average inference time after several runs an expected mechanism of OpenVINO GenAI?

Question 1: Prefix-cache default behavior

From my observation, it seems that prefix-cache is enabled by default in OpenVINO GenAI 2025.3, because in the following scenarios I can still see cache being reused:

  1. Removing the first n tokens
  2. Adding extra characters at the end
  3. Removing n characters in the middle

Result: the first run takes longer, while subsequent runs are noticeably faster → this differs from my understanding of the traditional KV-Cache (where any input mismatch leads to a new cache being created).
👉 Question: Does GenAI's KV-Cache apply additional optimizations? For example, reusing based on prefix similarity?

Additionally, if I delete characters randomly, then each run becomes slower as expected, but I can still sometimes observe the first run being significantly slower. I cannot fully understand this behavior.


Question 2: Change in average inference time

I also wrote a script to run the same system_prompt + user_input 50 times to measure the average:

  • On the very first script execution: average ~ 1.28s
  • After several executions: suddenly drops to ~ 1.05s, and then consistently stays around 1.0x s

👉 Question: Is this behavior due to OpenVINO GenAI’s internal mechanism (e.g., warm-up, optimization path, JIT compilation, or lower-level CPU caching effects)?


Results (Plots)

  • Timing curve plots
Image Image Image Image Image

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions