Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector index merge time regression in 8.16.0 onwards, with MGLRU enabled in Linux kernel #124499

Open
gbanasiak opened this issue Mar 10, 2025 · 1 comment
Labels
>bug :Performance All issues related to Elasticsearch performance including regressions and investigations Team:Performance Meta label for performance team

Comments

@gbanasiak
Copy link
Contributor

gbanasiak commented Mar 10, 2025

Elasticsearch Version

8.16.0+

Installed Plugins

No response

Java Version

bundled

OS Version

Linux kernel 6.3+

Problem Description

Starting from 8.16.0, with MGLRU enabled in Linux kernel (cat /sys/kernel/mm/lru_gen/enabled returns 0x0007), and Linux kernel version 6.3+, vector index merge process might take a very long time to complete. This applies to non-quantized vector format like in example below. Note that default vector type is quantized (int8_hnsw) since version 8.14.

        "titleVector" : {
          "type" : "dense_vector",
          "dims" : 768,
          "index" : true,
          "similarity" : "dot_product",
          "index_options" : {
            "type" : "hnsw", <--- HERE
            "m" : 16,
            "ef_construction" : 100
          }
        },

The most characteristic symptom is the Lucene merge thread is constantly calculating vector dot product.

(note: below outputs for ES 8.17.2 and Linux kernel 6.8)

$ ./jdk/bin/jstack -l -e $(./jdk/bin/jps | grep Elasticsearch | awk '{print $1}') | grep "Lucene Merge Thread" -A35
"elasticsearch[<REDACTED>][[vectors][0]: Lucene Merge Thread #0]" #85 [14473] daemon prio=5 os_prio=0 cpu=2086890.49ms elapsed=9886.33s allocated=11887M defined_classes=212 tid=0x00007f865c0188f0 nid=14473 runnable  [0x00007f86f4afc000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport.dotProductBody(org.apache.lucene.core@9.12.0/PanamaVectorUtilSupport.java:122)
        at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport.dotProduct(org.apache.lucene.core@9.12.0/PanamaVectorUtilSupport.java:102)
        at org.apache.lucene.util.VectorUtil.dotProduct(org.apache.lucene.core@9.12.0/VectorUtil.java:66)
        at org.apache.lucene.index.VectorSimilarityFunction$2.compare(org.apache.lucene.core@9.12.0/VectorSimilarityFunction.java:55)
        at org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatScoringSupplier$1.score(org.apache.lucene.core@9.12.0/DefaultFlatVectorScorer.java:149)
        at org.apache.lucene.util.hnsw.HnswGraphSearcher.searchLevel(org.apache.lucene.core@9.12.0/HnswGraphSearcher.java:237)
        at org.apache.lucene.util.hnsw.HnswGraphBuilder.addGraphNode(org.apache.lucene.core@9.12.0/HnswGraphBuilder.java:265)
        at org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder$ConcurrentMergeWorker.addGraphNode(org.apache.lucene.core@9.12.0/HnswConcurrentMergeBuilder.java:200)
        at org.apache.lucene.util.hnsw.HnswGraphBuilder.addVectors(org.apache.lucene.core@9.12.0/HnswGraphBuilder.java:197)
        at org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder$ConcurrentMergeWorker.run(org.apache.lucene.core@9.12.0/HnswConcurrentMergeBuilder.java:180)
        at org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder.lambda$build$0(org.apache.lucene.core@9.12.0/HnswConcurrentMergeBuilder.java:86)
        at org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder$$Lambda/0x00007f86a06496a8.call(org.apache.lucene.core@9.12.0/Unknown Source)
        at java.util.concurrent.FutureTask.run(java.base@23/FutureTask.java:317)
        at org.apache.lucene.search.TaskExecutor$TaskGroup$1.run(org.apache.lucene.core@9.12.0/TaskExecutor.java:120)
        at org.apache.lucene.search.TaskExecutor$TaskGroup.invokeAll(org.apache.lucene.core@9.12.0/TaskExecutor.java:176)
        at org.apache.lucene.search.TaskExecutor.invokeAll(org.apache.lucene.core@9.12.0/TaskExecutor.java:84)
        at org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder.build(org.apache.lucene.core@9.12.0/HnswConcurrentMergeBuilder.java:90)
        at org.apache.lucene.util.hnsw.IncrementalHnswGraphMerger.merge(org.apache.lucene.core@9.12.0/IncrementalHnswGraphMerger.java:147)
        at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsWriter.mergeOneField(org.apache.lucene.core@9.12.0/Lucene99HnswVectorsWriter.java:377)
        at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsWriter.mergeOneField(org.apache.lucene.core@9.12.0/PerFieldKnnVectorsFormat.java:121)
        at org.apache.lucene.codecs.KnnVectorsWriter.merge(org.apache.lucene.core@9.12.0/KnnVectorsWriter.java:106)
        at org.apache.lucene.index.SegmentMerger.mergeVectorValues(org.apache.lucene.core@9.12.0/SegmentMerger.java:271)
        at org.apache.lucene.index.SegmentMerger$$Lambda/0x00007f86a0638000.merge(org.apache.lucene.core@9.12.0/Unknown Source)
        at org.apache.lucene.index.SegmentMerger.mergeWithLogging(org.apache.lucene.core@9.12.0/SegmentMerger.java:314)
        at org.apache.lucene.index.SegmentMerger.merge(org.apache.lucene.core@9.12.0/SegmentMerger.java:158)
        at org.apache.lucene.index.IndexWriter.mergeMiddle(org.apache.lucene.core@9.12.0/IndexWriter.java:5292)
        at org.apache.lucene.index.IndexWriter.merge(org.apache.lucene.core@9.12.0/IndexWriter.java:4758)
        at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(org.apache.lucene.core@9.12.0/IndexWriter.java:6581)
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(org.apache.lucene.core@9.12.0/ConcurrentMergeScheduler.java:660)
        at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(org.elasticsearch.server@8.17.2/ElasticsearchConcurrentMergeScheduler.java:103)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(org.apache.lucene.core@9.12.0/ConcurrentMergeScheduler.java:721)

   Locked ownable synchronizers:
        - None

Hot threads show high I/O wait, notice other=88.1%:

$ curl -u <REDACTED> "http://localhost:9200/_nodes/hot_threads?threads=10000" -s | grep "\[vectors\]\[0\]: Lucene Merge Thread #0" -A25
   100.0% [cpu=11.9%, other=88.1%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[ip-172-31-32-115][[vectors][0]: Lucene Merge Thread #0]'
     10/10 snapshots sharing following 31 elements
       app/org.apache.lucene.core@9.12.0/org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport.dotProductBody(PanamaVectorUtilSupport.java:122)
       app/org.apache.lucene.core@9.12.0/org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport.dotProduct(PanamaVectorUtilSupport.java:102)
       app/org.apache.lucene.core@9.12.0/org.apache.lucene.util.VectorUtil.dotProduct(VectorUtil.java:66)
       app/org.apache.lucene.core@9.12.0/org.apache.lucene.index.VectorSimilarityFunction$2.compare(VectorSimilarityFunction.java:55)
       app/org.apache.lucene.core@9.12.0/org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatScoringSupplier$1.score(DefaultFlatVectorScorer.java:149)
       app/org.apache.lucene.core@9.12.0/org.apache.lucene.util.hnsw.HnswGraphSearcher.searchLevel(HnswGraphSearcher.java:237)
       app/org.apache.lucene.core@9.12.0/org.apache.lucene.util.hnsw.HnswGraphBuilder.addGraphNode(HnswGraphBuilder.java:265)
       app/org.apache.lucene.core@9.12.0/org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder$ConcurrentMergeWorker.addGraphNode(HnswConcurrentMergeBuilder.java:200)
       app/org.apache.lucene.core@9.12.0/org.apache.lucene.util.hnsw.HnswGraphBuilder.addVectors(HnswGraphBuilder.java:197)
[..]

From OS perspective, high level of major page faults...

$ sar -B 1 5
Linux 6.8.0-1021-aws (<REDACTED>)         03/10/25        _x86_64_        (4 CPU)

13:30:01     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
13:30:02     77240.00     60.00  19313.00  19310.00  25420.00  25408.00      0.00  50816.00    200.00
13:30:03     77640.00      0.00  19410.00  19410.00     11.00      0.00      0.00      0.00      0.00
13:30:04     78244.00      0.00  19561.00  19561.00  25419.00  25408.00      0.00  50816.00    200.00
13:30:05     78756.00      0.00  19689.00  19689.00  25420.00  25408.00      0.00  50816.00    200.00
13:30:06     78344.00      0.00  19586.00  19586.00  25418.00  25408.00      0.00  50816.00    200.00
Average:     78044.80     12.00  19511.80  19511.20  20337.60  20326.40      0.00  40652.80    200.00

... and I/O wait, see wa column:

(note: that's on a reproduction system that has nothing else to do, just the merge, on busy systems CPU I/O wait might be "hidden")

$ vmstat -a -SM 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free  inact active   si   so    bi    bo   in   cs us sy id wa st
 2  1      0    331   8646   6168    0    0  6662   656   96   96  3  2 82 13  0
 0  2      0    264   8715   6169    0    0 71296     0 20083 35919  1  5 51 43  0
 0  2      0    287   8690   6169    0    0 75572     0 21292 38060  1  5 51 43  0
 0  2      0    311   8659   6169    0    0 69932     0 19752 35268  2  4 51 43  0
 0  2      0    350   8632   6169    0    0 74724     8 21119 37623  2  5 51 42  0

... and I/O PSI metrics:

$ cat /proc/pressure/io
some avg10=54.40 avg60=51.46 avg300=51.83 total=15287830871
full avg10=54.39 avg60=51.41 avg300=51.79 total=15251110144

The exact level of major page faults will depend on the system. The above was seen in a system with 2 concurrent Lucene merge threads, with a NVMe SSD disk with 100 us read latency (r_await). With higher read latency, major page faults might be proportionally lower.

$ iostat -xdz 1
[..]
Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme0n1       19402.00  77612.00     0.00   0.00    0.09     4.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.73  98.40


Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme0n1       19681.00  78720.00     0.00   0.00    0.09     4.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.75  98.30
[..]

The regression is a combination of:

Steps to Reproduce

The following steps reproduce the problem reliably on hosts with 4 or 8 CPU cores and 16 GiB of RAM, and Intel processors supporting AVX-512 extension. The problem might be present elsewhere, but that's how the repros were run. The data is based on SO Vector Rally track which creates a single vectors index with 2 shards. This specific payload uses hnsw vector index type which is not the default since 8.14.0.

# make sure you have 6.3+ Linux kernel
uname -a

# make sure MGLRU is enabled (should return 0x0007)
cat /sys/kernel/mm/lru_gen/enabled

# make sure AVX-512 extensions are available
grep avx512 /proc/cpuinfo

# make sure you have 16 GiB of memory
grep MemTotal /proc/meminfo

# for fancy download
sudo apt-get update
sudo apt install aria2

# download (607 MiB) and unpack Elasticsearch version 8.17.2
aria2c -x 4 https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.17.2-linux-x86_64.tar.gz
tar zxf elasticsearch-8.17.2-linux-x86_64.tar.gz
cd elasticsearch-8.17.2

# download Lucene 9.12.0 with modified logging
mv lib/lucene-core-9.12.0.jar lib/lucene-core-9.12.0.jar.orig
wget https://storage.googleapis.com/so-vector-test-public/lucene-core-9.12.0.jar -O lib/lucene-core-9.12.0.jar

# configure Elasticsearch user/password
./bin/elasticsearch-users useradd test -p test123 -r superuser

# set heap size to 4 GiB
echo "-Xms4g" >> config/jvm.options
echo "-Xmx4g" >> config/jvm.options

# set force merge thread pool size to 2
echo "thread_pool.force_merge.size: 2" >> config/elasticsearch.yml

# download the SO Vector Rally track data with "hnsw" vector index type ingested into ES (31 GiB) and untar it
aria2c -x 4 https://storage.googleapis.com/so-vector-test-public/8.17.2-data.tar
tar -xf 8.17.2-data.tar

# start tmux session
tmux

# start Elasticsearch
./bin/elasticsearch

# wait for this log line:
# [2025-03-10T10:12:57,968][INFO ][o.e.c.r.a.AllocationService] [<REDACTED>] current.health="GREEN" message="Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[vectors][0]]])." previous.health="RED" reason="shards started [[vectors][0]]"

# --- in a new window

# list indices
curl -u test:test123 "http://localhost:9200/_cat/indices?v"

# --- example output ---
# health status index   uuid                   pri rep docs.count docs.deleted store.size pri.store.size dataset.size
# green  open   vectors zJfxtEMfR5-V8o84Fsr2-w   2   0    2000000            0     30.7gb         30.7gb       30.7gb

# initiate force merge of the "vectors" index
curl -u test:test123 -X POST "http://localhost:9200/vectors/_forcemerge?max_num_segments=1&wait_for_completion=false"

# --- example output ---
# {"task":"KbsRAY4bScCt_8EcwpO1bw:212"}

Logs (if relevant)

~10 minutes after force merge initiation, the graph build process starts. Here, each 1000-vector chunk is processed in around a second:

[2025-03-10T10:47:17,579][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] addVectors [0 2048]
[2025-03-10T10:47:17,672][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 0 in 92/92 ms
[2025-03-10T10:47:17,869][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] addVectors [0 2048]
[2025-03-10T10:47:18,098][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 0 in 228/228 ms
[2025-03-10T10:47:19,121][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 1000 in 1449/1541 ms
[2025-03-10T10:47:20,290][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 2000 in 1168/2710 ms
[2025-03-10T10:47:20,346][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] addVectors [2048 4096]
[2025-03-10T10:47:21,435][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 3000 in 1089/1089 ms
[2025-03-10T10:47:22,613][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 4000 in 1178/2267 ms

But later the process degrades, and each chunk can take way more time:

[2025-03-10T11:15:22,990][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] addVectors [659456 661504]
[2025-03-10T11:15:27,159][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] addVectors [591872 593920]
[2025-03-10T11:15:30,313][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 592000 in 3154/3154 ms
[2025-03-10T11:15:36,239][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 660000 in 13249/13249 ms
[2025-03-10T11:15:54,574][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 593000 in 24260/27414 ms
[2025-03-10T11:15:59,629][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 661000 in 23389/36639 ms

(note the process ends at 1M vectors in each shard)

Workaround

Disable MGLRU:

sudo sh -c 'echo 0 > /sys/kernel/mm/lru_gen/enabled'
@gbanasiak gbanasiak added >bug needs:triage Requires assignment of a team area label :Performance All issues related to Elasticsearch performance including regressions and investigations labels Mar 10, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Performance Meta label for performance team label Mar 10, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-perf (Team:Performance)

@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Mar 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Performance All issues related to Elasticsearch performance including regressions and investigations Team:Performance Meta label for performance team
Projects
None yet
Development

No branches or pull requests

2 participants