Threadpool merge scheduler #120869

albertzaharovits · 2025-01-26T13:59:14Z

This adds a new merge scheduler implementation that uses a (new) dedicated thread pool to run the merges. This way the number of concurrent merges is limited to the number of threads in the pool (i.e. the number of allocated processors to the ES JVM).

It implements dynamic IO throttling (the same target IO rate for all merges, roughly, with caveats) that's adjusted based on the number of currently active (queued + running) merges.
Smaller merges are always preferred to larger ones, irrespective of the index shard that they're coming from.
The implementation also supports the per-shard "max thread count" and "max merge count" settings, the later being used today for indexing throttling.
Note that IO throttling, max merge count, and max thread count work similarly, but not identical, to their siblings in the ConcurrentMergeScheduler.

The per-shard merge statistics are not affected, and the thread-pool statistics should reflect the merge ones (i.e. the completed thread pool stats reflects the total number of merges, across shards, per node).

…ake-2

henningandersen

Looks good, though I wonder if there is test instability problem and I'd like to avoid the static variables in the new test.

henningandersen · 2025-03-17T10:53:56Z

...nalClusterTest/java/org/elasticsearch/index/engine/ThreadPoolMergeSchedulerStressTestIT.java

+public class ThreadPoolMergeSchedulerStressTestIT extends ESSingleNodeTestCase {
+
+    private static final AtomicReference<ThreadPoolMergeExecutorService> MERGE_EXECUTOR_SERVICE_REFERENCE = new AtomicReference<>();
+    private static final Set<OneMerge> ENQUEUED_MERGES_SET = ConcurrentCollections.newConcurrentSet();


I think we can avoid the statics here by putting them on the plugin instead. You can grab the plugin using getInstance and then get to the variables. I prefer to avoid non-constant statics.

I agree... when I embarked on the static variables strategy here there were 1-2 variables... will use the getInstance for the plugin instance.

Refactored it in e4216d2

...nalClusterTest/java/org/elasticsearch/index/engine/ThreadPoolMergeSchedulerStressTestIT.java

henningandersen · 2025-03-17T11:03:49Z

...nalClusterTest/java/org/elasticsearch/index/engine/ThreadPoolMergeSchedulerStressTestIT.java

+        assertAllSuccessful(indicesAdmin().prepareRefresh("index").get());
+        var segmentsAfter = getSegmentsCountForAllShards("index");
+        // there should be way fewer segments after merging completed
+        assertThat(segmentsBefore, greaterThan(segmentsAfter));


I am not exactly sure we are guaranteed this. In the worst case, all merging concluded before we grab segmentsBefore?

In the worst case, all merging concluded before we grab segmentsBefore?

No, I don't think that's possible. There should be at least 50 merges blocked (because we grab segmentsBefore before the semaphore release), where each merge does between 2 and 3 segments and the segments that are covered by these blocked merges at the semaphore are not available to any other merges (IndexWriter handles this). Do you think it's possible that the merge policy selects merges of 2 segments which are then not reduced down to 1 segment? (there are no deletes).

I think the frequently if-statement around acquire means that we risk no acquire at all - or just a few like the up to 5 initially released. But we could still have enough merge build up that the assert busy wait above (waiting for enough ENQUEUED_MERGES_SET) is satisfied.

Or did I misunderstand that?

Ah, I see you're thinking that it is possible that ENQUEUED_MERGES_SET gets up (to 50 at least) before thread pool's merge threads are blocked at the SEMAPHORE. Will also assertBusy that the SEMAPHORE is queueing threads.

Pushed 0e300a7

I am not convinced it does, since the threads may have no more iterations/work to do regardless of whether you capture it before or after they complete - and merging may complete between the assertbusy and capturing the set of segments.

I wonder if we could instead check that we merged as far as we should. I.e., once all merging is done, grab the segments. Then do a force-merge (with no options/no max-segments, i.e., just trigger maybeMerge) and check that it did not do anything, i.e., did not invoke the scheduler (make it fail in that case).

I've pushed a3f73b7 .

Here's how I thought about it: The test design is to accumulate (enqueue or backlog) outstanding merges, but without actually pausing merging completely (because I think this is more realistic). So there is some amount of merging completing all while most merges are stopped at a semaphore. When there's a specific limit of enqueued/backlogged merges, new merges won't stop at the semaphore anymore. In this state, the amount of enqueued/backlogged merges will oscillate below the specific limit. I don't think it's worth it to pause merging at the limit: it's artificial and not elegant to code.

It is true that, as you pointed out, the "oscillation" can in theory be so large that at some point in time the amount of enqueued/backlogged merges is actually 0. I think this is very very unlikely (given the limit value of 50), but if we measure the number of segments at this exact point the test will fail.
In order to account for this "oscillation", I've put the sampling of the number of segments in an assert busy and asserted a minimum value for it (given the limit of outstanding merges and the number of segments in a single merge). This way, we know that we're looking at a plausible number of segments to be merged. Later, when the test releases the merging semaphore, we expect these segments to be merged away.

I wonder if we could instead check that we merged as far as we should. I.e., once all merging is done, grab the segments. Then do a force-merge (with no options/no max-segments, i.e., just trigger maybeMerge) and check that it did not do anything, i.e., did not invoke the scheduler (make it fail in that case).

I've pushed b187e1c

As discussed, I've removed the merge count before merging caught up, 1a7db64 .

henningandersen

LGTM.

Left one final comment on the stress test that needs sorting.

henningandersen · 2025-03-17T15:33:42Z

...nalClusterTest/java/org/elasticsearch/index/engine/ThreadPoolMergeSchedulerStressTestIT.java

+        assertAllSuccessful(indicesAdmin().prepareRefresh("index").get());
+        var segmentsAfter = getSegmentsCountForAllShards("index");
+        // there should be way fewer segments after merging completed
+        assertThat(segmentsBefore, greaterThan(segmentsAfter));


I wonder if we could instead check that we merged as far as we should. I.e., once all merging is done, grab the segments. Then do a force-merge (with no options/no max-segments, i.e., just trigger maybeMerge) and check that it did not do anything, i.e., did not invoke the scheduler (make it fail in that case).

…ake-2

fressi-elastic · 2025-03-21T10:02:58Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

+            // NEVER do this on a merge thread since we acquire some locks blocking here and if we concurrently rollback the
+            // writer
+            // we deadlock on engine#close for instance.
+            engineConfig.getThreadPool().executor(ThreadPool.Names.FLUSH).execute(new AbstractRunnable() {


In case the queue in the target thread pool is full we will wait for other flushes are completed. Do it makes sense to queue multiple flush requests if they are all identical? It may be we should schedule for flush only when no other flush has been scheduled yet, so I wonder why we do actually need a thread pool for it (a simple thread checking pre-condition and flushing time to time would serve as well).

Scheduling a new flush will not wait, the underlying queue is unbounded. Also, this path is not a hot path at all, since it is only entered for write-idle shards. In case multiple flushes should enter the queue, only the first one is likely to do real work.

We do prefer a thread pool to ensure we can do other merges while the flush happens.

This behavior is unchanged by this PR.

This adds a new merge scheduler implementation that uses a (new) dedicated thread pool to run the merges. This way the number of concurrent merges is limited to the number of threads in the pool (i.e. the number of allocated processors to the ES JVM). It implements dynamic IO throttling (the same target IO rate for all merges, roughly, with caveats) that's adjusted based on the number of currently active (queued + running) merges. Smaller merges are always preferred to larger ones, irrespective of the index shard that they're coming from. The implementation also supports the per-shard "max thread count" and "max merge count" settings, the later being used today for indexing throttling. Note that IO throttling, max merge count, and max thread count work similarly, but not identical, to their siblings in the ConcurrentMergeScheduler. The per-shard merge statistics are not affected, and the thread-pool statistics should reflect the merge ones (i.e. the completed thread pool stats reflects the total number of merges, across shards, per node).

…ep up with the merge load (#125654) Fixes an issue where indexing throttling kicks in while disk IO is throttling. Instead disk IO should first unthrottle, and only then, if we still can't keep up with the merging load, start throttling indexing. Fixes elastic/elasticsearch-benchmarks#2437 Relates #120869

albertzaharovits and others added 30 commits January 16, 2025 17:27

ExecutorMergeScheduler

bf557d2

Merge branch 'main' into threadpool-merge-scheduler

a3f87df

[CI] Auto commit changes from spotless

f5a1a8d

wrap for merge in the executor merge scheduler

f0b72fe

spotless

9b03950

Merge branch 'main' into threadpool-merge-scheduler

26e4043

Fix InternalEngineTests

aba69d0

Merge branch 'main' into threadpool-merge-scheduler

52796b5

implemented Throttling

c0667bf

Merge branch 'main' into threadpool-merge-scheduler

2da753f

[CI] Auto commit changes from spotless

2c8dc7f

Checkstyle

81cc0f1

Fix threadpool size for SnapshotResiliencyTests

f58120f

Spotless

5ca992d

Nit

3c203cb

Implemented max thread setting

6c21654

Throttling ?

68079d9

Checkstyle

7b68ba9

Indexing throttling !

9e467a1

Better throttling logging

a8f5297

Merge branch 'main' into threadpool-merge-scheduler

928fd32

Don't wrap errors during merging

3f5b4a8

Merge branch 'main' into threadpool-merge-scheduler

0e714a1

Merge branch 'main' into threadpool-merge-scheduler

0297cce

Refresh config

2b79809

Nit

57c2a5c

WIP

60a71b8

Merge branch 'main' into threadpool-merge-scheduler-sort-all-merges

68db209

IO throttling

4099ac5

Merge branch 'main' into threadpool-merge-scheduler-sort-all-merges

5554bc2

elasticsearchmachine and others added 3 commits March 13, 2025 18:15

[CI] Auto commit changes from spotless

55c142d

Fix randomization of thread pool based merge scheduler

169e06b

[CI] Auto commit changes from spotless

0918dcc

albertzaharovits requested a review from henningandersen March 13, 2025 19:55

albertzaharovits and others added 6 commits March 16, 2025 23:27

ThreadPoolMergeSchedulerStressTestIT

15b3e4c

Merge branch 'main' into threadpool-merge-scheduler-sort-all-merges-t…

5ce463f

…ake-2

[CI] Auto commit changes from spotless

dc85b15

Done ThreadPoolMergeSchedulerStressTestIT

f3e7ffb

Also run a force merge at the end

40631b3

Merge branch 'main' into threadpool-merge-scheduler-sort-all-merges-t…

a728b6a

…ake-2

henningandersen reviewed Mar 17, 2025

View reviewed changes

albertzaharovits and others added 3 commits March 17, 2025 14:16

Ensure merge threads are blocked

0e300a7

ThreadPoolMergeSchedulerStressTestIT no static

e4216d2

[CI] Auto commit changes from spotless

4d0cab2

albertzaharovits requested a review from henningandersen March 17, 2025 14:23

henningandersen approved these changes Mar 17, 2025

View reviewed changes

albertzaharovits and others added 6 commits March 17, 2025 20:01

Meh ThreadPoolMergeSchedulerStressTestIT

a3f73b7

force merge without max num segments

b187e1c

Merge branch 'main' into threadpool-merge-scheduler-sort-all-merges-t…

cec66c5

…ake-2

Remove segments count before merging caught up

1a7db64

[CI] Auto commit changes from spotless

ad6d7b2

Merge branch 'main' into threadpool-merge-scheduler-sort-all-merges-t…

4d23a54

…ake-2

albertzaharovits merged commit fa46b87 into elastic:main Mar 18, 2025
16 of 17 checks passed

albertzaharovits deleted the threadpool-merge-scheduler-sort-all-merges-take-2 branch March 19, 2025 13:44

mark-vieira mentioned this pull request Mar 19, 2025

Fix failure in ScalingThreadPoolTests after addition of merge scheduler #125245

Merged

fressi-elastic reviewed Mar 21, 2025

View reviewed changes

albertzaharovits mentioned this pull request Mar 27, 2025

Start indexing throttling only after disk IO unthrottling does not keep up with the merge load #125654

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threadpool merge scheduler #120869

Threadpool merge scheduler #120869

albertzaharovits commented Jan 26, 2025 •

edited

Loading

henningandersen left a comment

henningandersen Mar 17, 2025

albertzaharovits Mar 17, 2025

albertzaharovits Mar 17, 2025

henningandersen Mar 17, 2025

albertzaharovits Mar 17, 2025

henningandersen Mar 17, 2025

albertzaharovits Mar 17, 2025

albertzaharovits Mar 17, 2025

henningandersen Mar 17, 2025

henningandersen Mar 17, 2025

albertzaharovits Mar 17, 2025

albertzaharovits Mar 17, 2025

albertzaharovits Mar 18, 2025

henningandersen left a comment

henningandersen Mar 17, 2025

fressi-elastic Mar 21, 2025

henningandersen Mar 21, 2025

Threadpool merge scheduler #120869

Threadpool merge scheduler #120869

Conversation

albertzaharovits commented Jan 26, 2025 • edited Loading

henningandersen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertzaharovits commented Jan 26, 2025 •

edited

Loading