-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure that RefreshListener do not access engine under refresh lock #124328
base: main
Are you sure you want to change the base?
Ensure that RefreshListener do not access engine under refresh lock #124328
Conversation
Translog.Location location, | ||
ActionListener<Boolean> listener, | ||
boolean forced, | ||
@Nullable TimeValue postWriteRefreshTimeout | ||
) { | ||
Engine engineOrNull = indexShard.getEngineOrNull(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This invocation has been caught by the new SafeEngineAccessThreadLocal
@@ -1120,7 +1120,7 @@ public void openReaderContext(ShardId shardId, TimeValue keepAlive, ActionListen | |||
Engine.SearcherSupplier searcherSupplier = null; | |||
ReaderContext readerContext = null; | |||
try { | |||
searcherSupplier = shard.acquireSearcherSupplier(); | |||
searcherSupplier = shard.acquireSearcherSupplier(); // TODO I think we need to fork here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By reading the code it invokes getEngine
and therefore might cause the new assertion to fail on CI.
It's likely that we don't have many tests using PITs and that's why I haven't seen it yet, but let's see if that fails.
This one is going to be complex to fix: [2025-03-07T18:43:45,511][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [test-cluster-0] fatal error in thread [elasticsearch[test-cluster-0][refresh][T#3]], exiting
java.lang.AssertionError: Current thread has made an unsafe access to the engine
at org.elasticsearch.index.engine.SafeEngineAccessThreadLocal.getAccessorSafe(SafeEngineAccessThreadLocal.java:63) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.index.engine.SafeEngineAccessThreadLocal.accessEnd(SafeEngineAccessThreadLocal.java:77) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.index.engine.AbstractReaderManager$AssertingRefreshListener.afterRefresh(AbstractReaderManager.java:100) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275) ~[lucene-core-10.1.0.jar:?]
at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182) ~[lucene-core-10.1.0.jar:?]
at org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:213) ~[lucene-core-10.1.0.jar:?]
at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:2044) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.index.engine.InternalEngine.lambda$maybeRefresh$9(InternalEngine.java:2017) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:367) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.index.engine.InternalEngine.maybeRefresh(InternalEngine.java:2017) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.index.shard.IndexShard.lambda$scheduledRefresh$47(IndexShard.java:4056) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.action.ActionListener.run(ActionListener.java:463) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.index.shard.IndexShard.scheduledRefresh(IndexShard.java:4037) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.index.IndexService.maybeRefreshEngine(IndexService.java:1086) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:1222) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:138) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:977) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.lang.Thread.run(Thread.java:1575) ~[?:?]
Caused by: java.lang.AssertionError: thread [Thread[#109,elasticsearch[test-cluster-0][refresh][T#3],5,main]] should not access the engine using the getEngineOrNull() method
at org.elasticsearch.index.engine.SafeEngineAccessThreadLocal.assertNoAccessByCurrentThread(SafeEngineAccessThreadLocal.java:90) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.index.shard.IndexShard.getEngineOrNull(IndexShard.java:3317) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.index.shard.IndexShard.getEngine(IndexShard.java:3305) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.index.shard.IndexShard.getLocalCheckpoint(IndexShard.java:2974) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.localCheckpoint(TransportReplicationAction.java:1192) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.action.support.replication.ReplicationOperation.updateCheckPoints(ReplicationOperation.java:384) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.action.support.replication.ReplicationOperation$1.onResponse(ReplicationOperation.java:202) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.action.support.replication.ReplicationOperation$1.onResponse(ReplicationOperation.java:197) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.action.support.replication.TransportWriteAction$WritePrimaryResult$1.onSuccess(TransportWriteAction.java:341) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.action.support.replication.TransportWriteAction$AsyncAfterWriteAction.maybeFinish(TransportWriteAction.java:498) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.action.support.replication.TransportWriteAction$AsyncAfterWriteAction$1.onResponse(TransportWriteAction.java:524) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.action.support.replication.TransportWriteAction$AsyncAfterWriteAction$1.onResponse(TransportWriteAction.java:516) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.index.shard.RefreshListeners.lambda$addOrNotify$1(RefreshListeners.java:152) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.index.shard.RefreshListeners.fireListeners(RefreshListeners.java:432) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.index.shard.RefreshListeners.afterRefresh(RefreshListeners.java:420) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
at org.elasticsearch.index.engine.AbstractReaderManager$AssertingRefreshListener.afterRefresh(AbstractReaderManager.java:98) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
... 17 more |
26a7719
to
9d51165
Compare
Today
RefreshListener
implementations can access the current engine reference during the reader(s) refresh by using theIndexShard#getEngineOrNull()
method (or another method that ultimately calls it).In a future change we'd like to guard the mutation of shard's engine using a read/write lock instead of the actual
engineMutex
object monitor. It means that closing or resetting the engine will be done while holding a write lock and should be able to complete without depending on some other thread doing some work requiring a read lock, otherwise the mutation is at risk to deadlock. Such a deadlock has been identified withinRefreshListener
that access the engine while holding therefreshLock
.This change introduces an
EngineAwarRefreshListener
interface that can be implemented by refresh listeners to be notified when a new engine is created. This way refresh listeners already know the current engine instance at the time of the reader is refreshed, and therefore do not need to callgetEngineOrNull()
.To ensure that refresh listeners are not accessing
getEngineOrNull()
during refreshes, they are now wrapped into anAssertingRefreshListener
when they are added to the reader reference manager.AssertingRefreshListener
uses aSafeEngineAccessThreadLocal
to set a thread-local flag before executing thebeforeRefresh()
andafterRefresh()
methods and resets the flag after the methods are executed. Finally, the flag is checked when thegetEngineOrNull()
is executed, throwing if it is set and printing the stack trace.