Vector index merge time regression in 8.16.0 onwards, with MGLRU enabled in Linux kernel #124499
Labels
>bug
:Performance
All issues related to Elasticsearch performance including regressions and investigations
Team:Performance
Meta label for performance team
Elasticsearch Version
8.16.0+
Installed Plugins
No response
Java Version
bundled
OS Version
Linux kernel 6.3+
Problem Description
Starting from 8.16.0, with MGLRU enabled in Linux kernel (
cat /sys/kernel/mm/lru_gen/enabled
returns0x0007
), and Linux kernel version 6.3+, vector index merge process might take a very long time to complete. This applies to non-quantized vector format like in example below. Note that default vector type is quantized (int8_hnsw
) since version 8.14.The most characteristic symptom is the Lucene merge thread is constantly calculating vector dot product.
(note: below outputs for ES 8.17.2 and Linux kernel 6.8)
Hot threads show high I/O wait, notice
other=88.1%
:From OS perspective, high level of major page faults...
... and I/O wait, see
wa
column:(note: that's on a reproduction system that has nothing else to do, just the merge, on busy systems CPU I/O wait might be "hidden")
... and I/O PSI metrics:
The exact level of major page faults will depend on the system. The above was seen in a system with 2 concurrent Lucene merge threads, with a NVMe SSD disk with 100 us read latency (
r_await
). With higher read latency, major page faults might be proportionally lower.The regression is a combination of:
MADV_RANDOM
madvise for VEC filesMADV_RANDOM
in Linux kernel with MGLRU, specifically after torvalds/linux@8788f67Steps to Reproduce
The following steps reproduce the problem reliably on hosts with 4 or 8 CPU cores and 16 GiB of RAM, and Intel processors supporting AVX-512 extension. The problem might be present elsewhere, but that's how the repros were run. The data is based on SO Vector Rally track which creates a single
vectors
index with 2 shards. This specific payload useshnsw
vector index type which is not the default since 8.14.0.Logs (if relevant)
~10 minutes after force merge initiation, the graph build process starts. Here, each 1000-vector chunk is processed in around a second:
But later the process degrades, and each chunk can take way more time:
(note the process ends at 1M vectors in each shard)
Workaround
Disable MGLRU:
The text was updated successfully, but these errors were encountered: