Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that RefreshListener do not access engine under refresh lock #124328

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Mar 7, 2025

Today RefreshListener implementations can access the current engine reference during the reader(s) refresh by using the IndexShard#getEngineOrNull() method (or another method that ultimately calls it).

In a future change we'd like to guard the mutation of shard's engine using a read/write lock instead of the actual engineMutex object monitor. It means that closing or resetting the engine will be done while holding a write lock and should be able to complete without depending on some other thread doing some work requiring a read lock, otherwise the mutation is at risk to deadlock. Such a deadlock has been identified within RefreshListener that access the engine while holding the refreshLock.

This change introduces an EngineAwarRefreshListener interface that can be implemented by refresh listeners to be notified when a new engine is created. This way refresh listeners already know the current engine instance at the time of the reader is refreshed, and therefore do not need to call getEngineOrNull().

To ensure that refresh listeners are not accessing getEngineOrNull() during refreshes, they are now wrapped into an AssertingRefreshListener when they are added to the reader reference manager. AssertingRefreshListener uses a SafeEngineAccessThreadLocal to set a thread-local flag before executing the beforeRefresh() and afterRefresh() methods and resets the flag after the methods are executed. Finally, the flag is checked when the getEngineOrNull() is executed, throwing if it is set and printing the stack trace.

@tlrx tlrx added >non-issue :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v9.1.0 labels Mar 7, 2025
@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Mar 7, 2025
Translog.Location location,
ActionListener<Boolean> listener,
boolean forced,
@Nullable TimeValue postWriteRefreshTimeout
) {
Engine engineOrNull = indexShard.getEngineOrNull();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This invocation has been caught by the new SafeEngineAccessThreadLocal

@@ -1120,7 +1120,7 @@ public void openReaderContext(ShardId shardId, TimeValue keepAlive, ActionListen
Engine.SearcherSupplier searcherSupplier = null;
ReaderContext readerContext = null;
try {
searcherSupplier = shard.acquireSearcherSupplier();
searcherSupplier = shard.acquireSearcherSupplier(); // TODO I think we need to fork here
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By reading the code it invokes getEngine and therefore might cause the new assertion to fail on CI.

It's likely that we don't have many tests using PITs and that's why I haven't seen it yet, but let's see if that fails.

@tlrx
Copy link
Member Author

tlrx commented Mar 7, 2025

This one is going to be complex to fix:

[2025-03-07T18:43:45,511][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [test-cluster-0] fatal error in thread [elasticsearch[test-cluster-0][refresh][T#3]], exiting
java.lang.AssertionError: Current thread has made an unsafe access to the engine
	at org.elasticsearch.index.engine.SafeEngineAccessThreadLocal.getAccessorSafe(SafeEngineAccessThreadLocal.java:63) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.engine.SafeEngineAccessThreadLocal.accessEnd(SafeEngineAccessThreadLocal.java:77) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.engine.AbstractReaderManager$AssertingRefreshListener.afterRefresh(AbstractReaderManager.java:100) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275) ~[lucene-core-10.1.0.jar:?]
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182) ~[lucene-core-10.1.0.jar:?]
	at org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:213) ~[lucene-core-10.1.0.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:2044) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.lambda$maybeRefresh$9(InternalEngine.java:2017) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:367) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.maybeRefresh(InternalEngine.java:2017) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.lambda$scheduledRefresh$47(IndexShard.java:4056) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.ActionListener.run(ActionListener.java:463) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.scheduledRefresh(IndexShard.java:4037) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.IndexService.maybeRefreshEngine(IndexService.java:1086) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:1222) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:138) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:977) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1575) ~[?:?]
Caused by: java.lang.AssertionError: thread [Thread[#109,elasticsearch[test-cluster-0][refresh][T#3],5,main]] should not access the engine using the getEngineOrNull() method
	at org.elasticsearch.index.engine.SafeEngineAccessThreadLocal.assertNoAccessByCurrentThread(SafeEngineAccessThreadLocal.java:90) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.getEngineOrNull(IndexShard.java:3317) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.getEngine(IndexShard.java:3305) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.getLocalCheckpoint(IndexShard.java:2974) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.localCheckpoint(TransportReplicationAction.java:1192) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.ReplicationOperation.updateCheckPoints(ReplicationOperation.java:384) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.ReplicationOperation$1.onResponse(ReplicationOperation.java:202) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.ReplicationOperation$1.onResponse(ReplicationOperation.java:197) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.TransportWriteAction$WritePrimaryResult$1.onSuccess(TransportWriteAction.java:341) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.TransportWriteAction$AsyncAfterWriteAction.maybeFinish(TransportWriteAction.java:498) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.TransportWriteAction$AsyncAfterWriteAction$1.onResponse(TransportWriteAction.java:524) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.TransportWriteAction$AsyncAfterWriteAction$1.onResponse(TransportWriteAction.java:516) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.RefreshListeners.lambda$addOrNotify$1(RefreshListeners.java:152) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.RefreshListeners.fireListeners(RefreshListeners.java:432) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.RefreshListeners.afterRefresh(RefreshListeners.java:420) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.engine.AbstractReaderManager$AssertingRefreshListener.afterRefresh(AbstractReaderManager.java:98) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	... 17 more

@tlrx tlrx force-pushed the ensure-safe-engine-access-in-refresh-listeners branch from 26a7719 to 9d51165 Compare March 11, 2025 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >non-issue serverless-linked Added by automation, don't add manually v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants