Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error restoring backup from version 6.8 to 8.6 #93389

Open
lnowicki10 opened this issue Jan 31, 2023 · 11 comments
Open

Error restoring backup from version 6.8 to 8.6 #93389

lnowicki10 opened this issue Jan 31, 2023 · 11 comments
Labels
>bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. Team:Distributed Indexing Meta label for Distributed Indexing team

Comments

@lnowicki10
Copy link

Elasticsearch Version

8.6.1

Installed Plugins

No response

Java Version

bundled

OS Version

CentOS 7

Problem Description

We are unable to restore some large backup from version 6.8.13 on 8.6.1

Restore fails on some shards with these messages :
Caused by: [users/owINpknoTCmTe2kMmzuRhg][[users][25]] org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery ... 20 more Caused by: [users/owINpknoTCmTe2kMmzuRhg][[users][25]] org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed ... 18 more Caused by: java.lang.IllegalStateException: Maximum sequence number [117122753] from last commit does not match global checkpoint [117122751]
The same snapshot restores without a problem on a 7.X cluster. The problem occurs on random shards on large indices with lots of data (100 shards, 2TB data)

Steps to Reproduce

Try to restore some large dataset from 6.X to 8.X.

Logs (if relevant)

[2023-01-30T15:17:10,547][WARN ][o.e.c.r.a.AllocationService] [es-master1-archive]failing shard [FailedShard[routingEntry=[users][13], node[do0HqvNyRluI84MCgDzBHA], [P], recovery_source[snapshot recovery [LVdaXt-nQ8i0D2lEy0QWAw] from backupS3_1674680402:snapshot_1674939601/6h2NrRU3RcmQgmnZizHIoA], s[INITIALIZING], a[id=2JHqNugaSfSRYkGsKHrYBA], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-01-30T14:17:10.140Z], failed_attempts[4], failed_nodes[[do0HqvNyRluI84MCgDzBHA]], delayed=false, last_node[do0HqvNyRluI84MCgDzBHA], details[failed shard on node [do0HqvNyRluI84MCgDzBHA]: failed recovery, failure org.elasticsearch.indices.recovery.RecoveryFailedException: [users][13]: Recovery failed on {es16-archive-2}{do0HqvNyRluI84MCgDzBHA}{6RlBvSk_QjOUpaDcpJEkhg}{es16-archive-2}{10.94.121.5}{10.94.121.5:9301}{d}{xpack.installed=true} at org.elasticsearch.index.shard.IndexShard.lambda$executeRecovery$24(IndexShard.java:3123) at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:170) at org.elasticsearch.index.shard.StoreRecovery.lambda$recoveryListener$6(StoreRecovery.java:385) at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:170) at org.elasticsearch.index.shard.StoreRecovery.lambda$restore$8(StoreRecovery.java:518) at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:170) at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:164) at org.elasticsearch.action.ActionListener$DelegatingActionListener.onResponse(ActionListener.java:212) at org.elasticsearch.action.ActionListener$RunBeforeActionListener.onResponse(ActionListener.java:397) at org.elasticsearch.repositories.blobstore.FileRestoreContext.lambda$restore$1(FileRestoreContext.java:166) at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162) at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:127) at org.elasticsearch.action.support.GroupedActionListener.onResponse(GroupedActionListener.java:55) at org.elasticsearch.action.ActionListener$DelegatingActionListener.onResponse(ActionListener.java:212) at org.elasticsearch.repositories.blobstore.BlobStoreRepository$11.executeOneFileRestore(BlobStoreRepository.java:3066) at org.elasticsearch.repositories.blobstore.BlobStoreRepository$11.lambda$executeOneFileRestore$1(BlobStoreRepository.java:3075) at org.elasticsearch.action.ActionRunnable$3.doRun(ActionRunnable.java:72) at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:917) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.lang.Thread.run(Thread.java:1589) Caused by: [users/owINpknoTCmTe2kMmzuRhg][[users][13]] org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery ... 20 more Caused by: [users/owINpknoTCmTe2kMmzuRhg][[users][13]] org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed ... 18 more Caused by: java.lang.IllegalStateException: Maximum sequence number [105904628] from last commit does not match global checkpoint [105904627] at org.elasticsearch.index.engine.ReadOnlyEngine.ensureMaxSeqNoEqualsToGlobalCheckpoint(ReadOnlyEngine.java:184) at org.elasticsearch.index.engine.ReadOnlyEngine.<init>(ReadOnlyEngine.java:121) at org.elasticsearch.xpack.lucene.bwc.OldLuceneVersions.lambda$getEngineFactory$4(OldLuceneVersions.java:248) at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1949) at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1913) at org.elasticsearch.index.shard.StoreRecovery.lambda$restore$7(StoreRecovery.java:513) at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162) ... 15 more

@lnowicki10 lnowicki10 added >bug needs:triage Requires assignment of a team area label labels Jan 31, 2023
@DaveCTurner DaveCTurner added :Search/Search Search-related issues that do not fall into other categories and removed needs:triage Requires assignment of a team area label labels Jan 31, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Jan 31, 2023
@DaveCTurner
Copy link
Contributor

Maximum sequence number [105904628] from last commit does not match global checkpoint [105904627]

I think this can legitimately happen if the snapshot was taken while indexing was ongoing. We don't restore regular snapshots into a ReadOnlyEngine so I think it's not an issue there, hence labelling this for the search team. It's possible this affects searchable snapshots too, although I think less frequently because of how ILM typically manages them.

@benwtrent benwtrent added the priority:high A label for assessing bug priority to be used by ES engineers label Jul 9, 2024
@javanna javanna added :Search Foundations/Search Catch all for Search Foundations and removed :Search/Search Search-related issues that do not fall into other categories labels Jul 17, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 17, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@elasticsearchmachine elasticsearchmachine removed the Team:Search Meta label for search team label Jul 17, 2024
@javanna
Copy link
Member

javanna commented Nov 21, 2024

Hey @lnowicki10 sorry for the lag. I wonder how you restored the backup taken from 6.8. Did you mean to restore it as archive index, in read-only mode? The error you got is confusing, but indices created before 7.0 are not supported in 8.x, see https://www.elastic.co/guide/en/elasticsearch/reference/7.17/setup-upgrade.html . They can though be imported as archive indices with limited functionality, see https://www.elastic.co/guide/en/elasticsearch/reference/current/archive-indices.html . Otherwise, they'll need to be reindexed.

@javanna
Copy link
Member

javanna commented Nov 21, 2024

Nevermind, you can ignore my question above. I looked better at the stacktrace you provided and it does highlight that you mounted the index as archive, which should work.

@javanna
Copy link
Member

javanna commented Mar 14, 2025

After further consideration, I am not quite sure how this error would be specific to archive indices. Archive indices rely on a read-only engine, which searchable snapshots also rely on, as mentioned above. It is peculiar that the error appears only when restoring against a read only engine in 8.x, as opposed to restoring to 7.x which succeeds, and read only engine is the only way to restore such index to 8.x so we can't test that it can be restored in 8.x against an ordinary engine. I am not sure how we can reproduce this issue, and I suspect this has to do with how the snapshot was created, and I am not familiar with the read only engine assertion that trips. Rerouting back to distrib for more info, this does not appear to be a search problem after all.

@javanna javanna added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs and removed priority:high A label for assessing bug priority to be used by ES engineers :Search Foundations/Search Catch all for Search Foundations labels Mar 14, 2025
@elasticsearchmachine elasticsearchmachine added Team:Distributed Coordination Meta label for Distributed Coordination team and removed Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch labels Mar 14, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@DaveCTurner
Copy link
Contributor

It is peculiar that the error appears only when restoring against a read only engine in 8.x, as opposed to restoring to 7.x which succeeds

That's expected to me - when restoring into 7.x it just goes into a regular index which doesn't mind that the GCP and MSN are not equal. It's only 8.x that uses the OldLuceneVersions feature where this is a problem.

We would expect to see GCP != MSN if taking a snapshot while indexing is ongoing, so we need the OldLuceneVersions feature to handle this case. With searchable snapshots (via ILM at least) we take the snapshot after indexing has finished, so we would have GCP == MSN there and again this'd be no problem.

I'm going to ask the engine/recovery folks to take a look.

@DaveCTurner DaveCTurner added :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination Meta label for Distributed Coordination team labels Mar 14, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Indexing Meta label for Distributed Indexing team label Mar 14, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

@javanna
Copy link
Member

javanna commented Mar 14, 2025

Ok thanks for clarifying @DaveCTurner . If the outcome is that we need to adjust some assertion in the archive indices codepath, we can take ownership back for it. I am just not sure currently how to proceed around fixing the issue as well as adding tests for it.

@tlrx
Copy link
Member

tlrx commented Mar 18, 2025

I agree we can relax the GCP==MSN assertion/check and the easiest way is to not require the complete history for archives indices when opening the read-only engine (ie requireCompleteHistory: false). This is what is already done for searchable snapshots today because as David mentioned, there is no guarantee that GCP is equal to max. sequence number at the time the snapshot is created.

I think the check is really important when we re-initalize shards in place and we know we won't be able to write to them, for example when closing indices or for opening read-only compatible (N-2) indices.

I suppose that creating a snapshot while a thread is continuously indexing should be enough to have GCP != MSN on CI. Otherwise blocking TransportWriteAction responses from replicas (which return the replica LCP which allows the GCP to advance) should also do the trick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. Team:Distributed Indexing Meta label for Distributed Indexing team
Projects
None yet
Development

No branches or pull requests

6 participants