Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] MinioRepositoryAnalysisRestIT testRepositoryAnalysis failing #122670

Open
elasticsearchmachine opened this issue Feb 15, 2025 · 4 comments
Open
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs low-risk An open issue or test failure that is a low risk to future releases Team:Distributed Coordination Meta label for Distributed Coordination team >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Feb 15, 2025

Build Scans:

Reproduction Line:

./gradlew ":x-pack:plugin:snapshot-repo-test-kit:qa:minio:javaRestTest" --tests "org.elasticsearch.repositories.blobstore.testkit.analyze.MinioRepositoryAnalysisRestIT.testRepositoryAnalysis" -Dtests.seed=6B294144CF0B45D0 -Dtests.locale=ast-ES -Dtests.timezone=America/Dominica -Druntime.java=23

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.net.SocketTimeoutException: 60.000 milliseconds timeout on connection http-outgoing-0 [ACTIVE]

Issue Reasons:

  • [main] 2 failures in test testRepositoryAnalysis (0.3% fail rate in 763 executions)
  • [main] 2 failures in pipeline elasticsearch-periodic-platform-support (15.4% fail rate in 13 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Feb 15, 2025
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch 9.0

Mute Reasons:

  • [9.0] 2 failures in test testRepositoryAnalysis (1.4% fail rate in 138 executions)

Build Scans:

elasticsearchmachine added a commit that referenced this issue Feb 15, 2025
@elasticsearchmachine elasticsearchmachine added the needs:risk Requires assignment of a risk label (low, medium, blocker) label Feb 15, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Feb 15, 2025
@ywangd ywangd self-assigned this Feb 17, 2025
@ywangd ywangd added low-risk An open issue or test failure that is a low risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Feb 17, 2025
@ywangd
Copy link
Member

ywangd commented Feb 17, 2025

The repo analysis request timed out. Should be a test or test environment issue. Marking it low risk.

@ywangd
Copy link
Member

ywangd commented Feb 17, 2025

It seems to me that Minio is slow at aborting multipart upload or its behaviour for concurrent multipart uploads is not as robust as what we expect.

[2025-02-16T06:24:31,081][INFO ][o.e.r.b.t.a.RepositoryAnalyzeAction] [test-cluster-0] running analysis of repository [repository] using path [temp-analysis-pDnBcyt2T32puRcorBLDpQ]
...
[2025-02-16T06:24:46,618][WARN ][o.e.r.s.S3BlobContainer  ] [test-cluster-0] cleaning up stale compare-and-swap upload [YmU4YWQ4NTctZWNhZC00NzE3LWI3OTgtZmUxY2M1NDAxZjlhLjk3MjUyNmFkLTkwM2ItNGRhZC04OGFhLTRkYmVkNTMyNzI3MngxNzM5NzE5NDcxNDA1ODg1NDUw] initiated at [2025-02-16T06:24:31.405-0900]
... Many more lines of similar logging messages
[YmU4YWQ4NTctZWNhZC00NzE3LWI3OTgtZmUxY2M1NDAxZjlhLjNkZWFjN2Q1LWVkOWEtNGJlZS1iYzJmLWRiZGMxNjlmMzYxMngxNzM5NzE5NDcxNDA1ODg1NTE2] initiated at [2025-02-16T06:24:31.405-0900]
[2025-02-16T06:25:28,492][WARN ][o.e.r.s.S3BlobContainer  ] [test-cluster-0] cleaning up stale compare-and-swap upload 
...
... TIMED OUT

It attempted to abort the same stale multipart upload for ~42s and there are other aborts as well. The aborts likely did not complete when the request timed out at 60seconds. In the cleanup logs, we see

[2025-02-16T06:25:32,014][INFO ][o.e.r.b.t.a.MinioRepositoryAnalysisRestIT] [testRepositoryAnalysis] There are still tasks running after this test that might break subsequent tests [cluster:admin/repository/analyze, cluster:admin/repository/analyze/register, health-node[c]].

So that the analyze was still running. Not sure what we can do on the ES side if this is a Minio issue. Maybe adding and/or enabling more logs to get better picture of the progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs low-risk An open issue or test failure that is a low risk to future releases Team:Distributed Coordination Meta label for Distributed Coordination team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

2 participants