Error restoring backup from version 6.8 to 8.6 #93389

lnowicki10 · 2023-01-31T17:48:20Z

Elasticsearch Version

8.6.1

Installed Plugins

No response

Java Version

bundled

OS Version

CentOS 7

Problem Description

We are unable to restore some large backup from version 6.8.13 on 8.6.1

Restore fails on some shards with these messages :
Caused by: [users/owINpknoTCmTe2kMmzuRhg][[users][25]] org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery ... 20 more Caused by: [users/owINpknoTCmTe2kMmzuRhg][[users][25]] org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed ... 18 more Caused by: java.lang.IllegalStateException: Maximum sequence number [117122753] from last commit does not match global checkpoint [117122751]
The same snapshot restores without a problem on a 7.X cluster. The problem occurs on random shards on large indices with lots of data (100 shards, 2TB data)

Steps to Reproduce

Try to restore some large dataset from 6.X to 8.X.

Logs (if relevant)

[2023-01-30T15:17:10,547][WARN ][o.e.c.r.a.AllocationService] [es-master1-archive]failing shard [FailedShard[routingEntry=[users][13], node[do0HqvNyRluI84MCgDzBHA], [P], recovery_source[snapshot recovery [LVdaXt-nQ8i0D2lEy0QWAw] from backupS3_1674680402:snapshot_1674939601/6h2NrRU3RcmQgmnZizHIoA], s[INITIALIZING], a[id=2JHqNugaSfSRYkGsKHrYBA], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-01-30T14:17:10.140Z], failed_attempts[4], failed_nodes[[do0HqvNyRluI84MCgDzBHA]], delayed=false, last_node[do0HqvNyRluI84MCgDzBHA], details[failed shard on node [do0HqvNyRluI84MCgDzBHA]: failed recovery, failure org.elasticsearch.indices.recovery.RecoveryFailedException: [users][13]: Recovery failed on {es16-archive-2}{do0HqvNyRluI84MCgDzBHA}{6RlBvSk_QjOUpaDcpJEkhg}{es16-archive-2}{10.94.121.5}{10.94.121.5:9301}{d}{xpack.installed=true} at org.elasticsearch.index.shard.IndexShard.lambda$executeRecovery$24(IndexShard.java:3123) at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:170) at org.elasticsearch.index.shard.StoreRecovery.lambda$recoveryListener$6(StoreRecovery.java:385) at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:170) at org.elasticsearch.index.shard.StoreRecovery.lambda$restore$8(StoreRecovery.java:518) at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:170) at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:164) at org.elasticsearch.action.ActionListener$DelegatingActionListener.onResponse(ActionListener.java:212) at org.elasticsearch.action.ActionListener$RunBeforeActionListener.onResponse(ActionListener.java:397) at org.elasticsearch.repositories.blobstore.FileRestoreContext.lambda$restore$1(FileRestoreContext.java:166) at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162) at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:127) at org.elasticsearch.action.support.GroupedActionListener.onResponse(GroupedActionListener.java:55) at org.elasticsearch.action.ActionListener$DelegatingActionListener.onResponse(ActionListener.java:212) at org.elasticsearch.repositories.blobstore.BlobStoreRepository$11.executeOneFileRestore(BlobStoreRepository.java:3066) at org.elasticsearch.repositories.blobstore.BlobStoreRepository$11.lambda$executeOneFileRestore$1(BlobStoreRepository.java:3075) at org.elasticsearch.action.ActionRunnable$3.doRun(ActionRunnable.java:72) at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:917) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.lang.Thread.run(Thread.java:1589) Caused by: [users/owINpknoTCmTe2kMmzuRhg][[users][13]] org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery ... 20 more Caused by: [users/owINpknoTCmTe2kMmzuRhg][[users][13]] org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed ... 18 more Caused by: java.lang.IllegalStateException: Maximum sequence number [105904628] from last commit does not match global checkpoint [105904627] at org.elasticsearch.index.engine.ReadOnlyEngine.ensureMaxSeqNoEqualsToGlobalCheckpoint(ReadOnlyEngine.java:184) at org.elasticsearch.index.engine.ReadOnlyEngine.<init>(ReadOnlyEngine.java:121) at org.elasticsearch.xpack.lucene.bwc.OldLuceneVersions.lambda$getEngineFactory$4(OldLuceneVersions.java:248) at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1949) at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1913) at org.elasticsearch.index.shard.StoreRecovery.lambda$restore$7(StoreRecovery.java:513) at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162) ... 15 more

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-01-31T18:30:56Z

Pinging @elastic/es-search (Team:Search)

DaveCTurner · 2023-01-31T18:36:56Z

Maximum sequence number [105904628] from last commit does not match global checkpoint [105904627]

I think this can legitimately happen if the snapshot was taken while indexing was ongoing. We don't restore regular snapshots into a ReadOnlyEngine so I think it's not an issue there, hence labelling this for the search team. It's possible this affects searchable snapshots too, although I think less frequently because of how ILM typically manages them.

elasticsearchmachine · 2024-07-17T13:23:28Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

javanna · 2024-11-21T14:09:28Z

Hey @lnowicki10 sorry for the lag. I wonder how you restored the backup taken from 6.8. Did you mean to restore it as archive index, in read-only mode? The error you got is confusing, but indices created before 7.0 are not supported in 8.x, see https://www.elastic.co/guide/en/elasticsearch/reference/7.17/setup-upgrade.html . They can though be imported as archive indices with limited functionality, see https://www.elastic.co/guide/en/elasticsearch/reference/current/archive-indices.html . Otherwise, they'll need to be reindexed.

javanna · 2024-11-21T14:12:35Z

Nevermind, you can ignore my question above. I looked better at the stacktrace you provided and it does highlight that you mounted the index as archive, which should work.

javanna · 2025-03-14T13:19:30Z

After further consideration, I am not quite sure how this error would be specific to archive indices. Archive indices rely on a read-only engine, which searchable snapshots also rely on, as mentioned above. It is peculiar that the error appears only when restoring against a read only engine in 8.x, as opposed to restoring to 7.x which succeeds, and read only engine is the only way to restore such index to 8.x so we can't test that it can be restored in 8.x against an ordinary engine. I am not sure how we can reproduce this issue, and I suspect this has to do with how the snapshot was created, and I am not familiar with the read only engine assertion that trips. Rerouting back to distrib for more info, this does not appear to be a search problem after all.

elasticsearchmachine · 2025-03-14T13:20:06Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

DaveCTurner · 2025-03-14T13:34:49Z

It is peculiar that the error appears only when restoring against a read only engine in 8.x, as opposed to restoring to 7.x which succeeds

That's expected to me - when restoring into 7.x it just goes into a regular index which doesn't mind that the GCP and MSN are not equal. It's only 8.x that uses the OldLuceneVersions feature where this is a problem.

We would expect to see GCP != MSN if taking a snapshot while indexing is ongoing, so we need the OldLuceneVersions feature to handle this case. With searchable snapshots (via ILM at least) we take the snapshot after indexing has finished, so we would have GCP == MSN there and again this'd be no problem.

I'm going to ask the engine/recovery folks to take a look.

elasticsearchmachine · 2025-03-14T13:35:20Z

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

javanna · 2025-03-14T18:11:33Z

Ok thanks for clarifying @DaveCTurner . If the outcome is that we need to adjust some assertion in the archive indices codepath, we can take ownership back for it. I am just not sure currently how to proceed around fixing the issue as well as adding tests for it.

tlrx · 2025-03-18T16:32:39Z

I agree we can relax the GCP==MSN assertion/check and the easiest way is to not require the complete history for archives indices when opening the read-only engine (ie requireCompleteHistory: false). This is what is already done for searchable snapshots today because as David mentioned, there is no guarantee that GCP is equal to max. sequence number at the time the snapshot is created.

I think the check is really important when we re-initalize shards in place and we know we won't be able to write to them, for example when closing indices or for opening read-only compatible (N-2) indices.

I suppose that creating a snapshot while a thread is continuously indexing should be enough to have GCP != MSN on CI. Otherwise blocking TransportWriteAction responses from replicas (which return the replica LCP which allows the GCP to advance) should also do the trick.

lnowicki10 added >bug needs:triage Requires assignment of a team area label labels Jan 31, 2023

DaveCTurner added :Search/Search Search-related issues that do not fall into other categories and removed needs:triage Requires assignment of a team area label labels Jan 31, 2023

elasticsearchmachine added the Team:Search Meta label for search team label Jan 31, 2023

benwtrent added the priority:high A label for assessing bug priority to be used by ES engineers label Jul 9, 2024

javanna added :Search Foundations/Search Catch all for Search Foundations and removed :Search/Search Search-related issues that do not fall into other categories labels Jul 17, 2024

elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 17, 2024

elasticsearchmachine removed the Team:Search Meta label for search team label Jul 17, 2024

javanna added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs and removed priority:high A label for assessing bug priority to be used by ES engineers :Search Foundations/Search Catch all for Search Foundations labels Mar 14, 2025

elasticsearchmachine added Team:Distributed Coordination Meta label for Distributed Coordination team and removed Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch labels Mar 14, 2025

elasticsearchmachine added the Team:Distributed Indexing Meta label for Distributed Indexing team label Mar 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error restoring backup from version 6.8 to 8.6 #93389

Error restoring backup from version 6.8 to 8.6 #93389

lnowicki10 commented Jan 31, 2023

elasticsearchmachine commented Jan 31, 2023

DaveCTurner commented Jan 31, 2023

elasticsearchmachine commented Jul 17, 2024

javanna commented Nov 21, 2024

javanna commented Nov 21, 2024

javanna commented Mar 14, 2025

elasticsearchmachine commented Mar 14, 2025

DaveCTurner commented Mar 14, 2025

elasticsearchmachine commented Mar 14, 2025

javanna commented Mar 14, 2025

tlrx commented Mar 18, 2025

Error restoring backup from version 6.8 to 8.6 #93389

Error restoring backup from version 6.8 to 8.6 #93389

Comments

lnowicki10 commented Jan 31, 2023

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

elasticsearchmachine commented Jan 31, 2023

DaveCTurner commented Jan 31, 2023

elasticsearchmachine commented Jul 17, 2024

javanna commented Nov 21, 2024

javanna commented Nov 21, 2024

javanna commented Mar 14, 2025

elasticsearchmachine commented Mar 14, 2025

DaveCTurner commented Mar 14, 2025

elasticsearchmachine commented Mar 14, 2025

javanna commented Mar 14, 2025

tlrx commented Mar 18, 2025