Ensure that RefreshListener do not access engine under refresh lock #124328

tlrx · 2025-03-07T14:22:42Z

Today RefreshListener implementations can access the current engine reference during the reader(s) refresh by using the IndexShard#getEngineOrNull() method (or another method that ultimately calls it).

In a future change we'd like to guard the mutation of shard's engine using a read/write lock instead of the actual engineMutex object monitor. It means that closing or resetting the engine will be done while holding a write lock and should be able to complete without depending on some other thread doing some work requiring a read lock, otherwise the mutation is at risk to deadlock. Such a deadlock has been identified within RefreshListener that access the engine while holding the refreshLock.

This change introduces an EngineAwarRefreshListener interface that can be implemented by refresh listeners to be notified when a new engine is created. This way refresh listeners already know the current engine instance at the time of the reader is refreshed, and therefore do not need to call getEngineOrNull().

To ensure that refresh listeners are not accessing getEngineOrNull() during refreshes, they are now wrapped into an AssertingRefreshListener when they are added to the reader reference manager. AssertingRefreshListener uses a SafeEngineAccessThreadLocal to set a thread-local flag before executing the beforeRefresh() and afterRefresh() methods and resets the flag after the methods are executed. Finally, the flag is checked when the getEngineOrNull() is executed, throwing if it is set and printing the stack trace.

tlrx · 2025-03-07T14:31:15Z

server/src/main/java/org/elasticsearch/action/support/replication/PostWriteRefresh.java

        Translog.Location location,
        ActionListener<Boolean> listener,
        boolean forced,
        @Nullable TimeValue postWriteRefreshTimeout
    ) {
-        Engine engineOrNull = indexShard.getEngineOrNull();


This invocation has been caught by the new SafeEngineAccessThreadLocal

tlrx · 2025-03-07T14:33:57Z

server/src/main/java/org/elasticsearch/search/SearchService.java

@@ -1120,7 +1120,7 @@ public void openReaderContext(ShardId shardId, TimeValue keepAlive, ActionListen
            Engine.SearcherSupplier searcherSupplier = null;
            ReaderContext readerContext = null;
            try {
-                searcherSupplier = shard.acquireSearcherSupplier();
+                searcherSupplier = shard.acquireSearcherSupplier(); // TODO I think we need to fork here


By reading the code it invokes getEngine and therefore might cause the new assertion to fail on CI.

It's likely that we don't have many tests using PITs and that's why I haven't seen it yet, but let's see if that fails.

tlrx · 2025-03-07T17:10:05Z

This one is going to be complex to fix:

[2025-03-07T18:43:45,511][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [test-cluster-0] fatal error in thread [elasticsearch[test-cluster-0][refresh][T#3]], exiting
java.lang.AssertionError: Current thread has made an unsafe access to the engine
	at org.elasticsearch.index.engine.SafeEngineAccessThreadLocal.getAccessorSafe(SafeEngineAccessThreadLocal.java:63) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.engine.SafeEngineAccessThreadLocal.accessEnd(SafeEngineAccessThreadLocal.java:77) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.engine.AbstractReaderManager$AssertingRefreshListener.afterRefresh(AbstractReaderManager.java:100) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275) ~[lucene-core-10.1.0.jar:?]
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182) ~[lucene-core-10.1.0.jar:?]
	at org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:213) ~[lucene-core-10.1.0.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:2044) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.lambda$maybeRefresh$9(InternalEngine.java:2017) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:367) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.maybeRefresh(InternalEngine.java:2017) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.lambda$scheduledRefresh$47(IndexShard.java:4056) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.ActionListener.run(ActionListener.java:463) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.scheduledRefresh(IndexShard.java:4037) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.IndexService.maybeRefreshEngine(IndexService.java:1086) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:1222) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:138) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:977) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1575) ~[?:?]
Caused by: java.lang.AssertionError: thread [Thread[#109,elasticsearch[test-cluster-0][refresh][T#3],5,main]] should not access the engine using the getEngineOrNull() method
	at org.elasticsearch.index.engine.SafeEngineAccessThreadLocal.assertNoAccessByCurrentThread(SafeEngineAccessThreadLocal.java:90) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.getEngineOrNull(IndexShard.java:3317) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.getEngine(IndexShard.java:3305) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.getLocalCheckpoint(IndexShard.java:2974) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.localCheckpoint(TransportReplicationAction.java:1192) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.ReplicationOperation.updateCheckPoints(ReplicationOperation.java:384) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.ReplicationOperation$1.onResponse(ReplicationOperation.java:202) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.ReplicationOperation$1.onResponse(ReplicationOperation.java:197) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.TransportWriteAction$WritePrimaryResult$1.onSuccess(TransportWriteAction.java:341) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.TransportWriteAction$AsyncAfterWriteAction.maybeFinish(TransportWriteAction.java:498) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.TransportWriteAction$AsyncAfterWriteAction$1.onResponse(TransportWriteAction.java:524) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.TransportWriteAction$AsyncAfterWriteAction$1.onResponse(TransportWriteAction.java:516) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.RefreshListeners.lambda$addOrNotify$1(RefreshListeners.java:152) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.RefreshListeners.fireListeners(RefreshListeners.java:432) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.RefreshListeners.afterRefresh(RefreshListeners.java:420) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.engine.AbstractReaderManager$AssertingRefreshListener.afterRefresh(AbstractReaderManager.java:98) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	... 17 more

Ensure that RefreshListener do not access engine under refresh lock

efebd52

tlrx added >non-issue :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v9.1.0 labels Mar 7, 2025

elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Mar 7, 2025

[CI] Auto commit changes from spotless

94632da

tlrx commented Mar 7, 2025

View reviewed changes

more registration

6bcf4a8

tlrx and others added 13 commits March 10, 2025 07:06

Merge branch 'main' into ensure-safe-engine-access-in-refresh-listeners

f2da65e

defer engine

de3f416

fix TransportReplicationActionTests

af744e9

also replica

20d5dfc

fix TransportReplicationActionTests

1f31512

fork in rewriteAndFetchShardRequest

f7c193f

fork in rewriteAndFetchShardRequest

f5c1ea0

[CI] Auto commit changes from spotless

37c3409

remove replica

b1e673d

remove replica

e33744c

replica again

d88b476

replication action

0b34aa8

Merge branch 'main' into ensure-safe-engine-access-in-refresh-listeners

9d51165

tlrx force-pushed the ensure-safe-engine-access-in-refresh-listeners branch from 26a7719 to 9d51165 Compare March 11, 2025 12:25

tlrx added 6 commits March 11, 2025 13:42

replication action

e053aa0

bug

cd2333c

bug

c5153c1

bug

a422586

bug

110426f

revert replication changes

b9e646c

Merge branch 'main' into ensure-safe-engine-access-in-refresh-listeners

3e18fc8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure that RefreshListener do not access engine under refresh lock #124328

Ensure that RefreshListener do not access engine under refresh lock #124328

tlrx commented Mar 7, 2025 •

edited

Loading

tlrx Mar 7, 2025

tlrx Mar 7, 2025

tlrx commented Mar 7, 2025

Ensure that RefreshListener do not access engine under refresh lock #124328

Are you sure you want to change the base?

Ensure that RefreshListener do not access engine under refresh lock #124328

Conversation

tlrx commented Mar 7, 2025 • edited Loading

tlrx Mar 7, 2025

Choose a reason for hiding this comment

tlrx Mar 7, 2025

Choose a reason for hiding this comment

tlrx commented Mar 7, 2025

tlrx commented Mar 7, 2025 •

edited

Loading