Skip to content

Conversation

@slaskawi
Copy link
Contributor

@slaskawi slaskawi commented Jul 12, 2023

Summary

This Pull Request changes the core deadlock detection algorithm and replaces it with detecting Wait Steps (the steps with isWaitStep: true. This loosens up the way the Probe behaves and makes it less restrictive.

The end user impact is completely opposite now. In the worst case scenario, we will report the database to be ready (instead of not ready and blocking the rollout as we've seen before). The end user experience is much better as it may experience a short outage instead of full database blackout.

This Pull Request also introduces name refactoring and aligns it with the field names in JSON file. This way, it's less confusing when discussing this with a wider Team.

Technical details

A lot of this design is related to this conversation: #1332

Testing results in both Community and EA

  • All tests against Community seem to be fine. I can successfully delete any number of Pods (being careful not to lose quorum, which causes a short outage - until a new quorum is established - see RAFT algorithm)
  • All tests against EA seem to be fine. When introducing AutomationConfig changes and deleting Pods during the Rolling Update, it's easy to lose quorum. But even then, the system recovers in a few seconds.

All tests against the Enterprise are successful link.

@slaskawi slaskawi added the safe-to-test Add this label to PRs from forks to trigger E2E tests label Jul 12, 2023
fifteenSecondsAgo := time.Now().Add(time.Duration(-15) * time.Second)
if contains.String(riskySteps, status.Step) && status.Completed == nil && status.Started.Before(fifteenSecondsAgo) {
logger.Infof("Indicated a possible deadlock, status: %s, started at %s but hasn't finished "+
if status.IsWaitStep && status.Completed == nil && status.Started.Before(fifteenSecondsAgo) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please verify if this will return "ready" too early in the following conditions:

  • database is in a stable state
  • rolling update is performed without any changes to the automation config
  • rolling update is performed and we changed some things in automation config:
  • I'm not sure what changes that would be in AC that would cause new plans to be generated

This is to ensure we won't cause intermittent downtimes when we're performing normal operator upgrades. Now we are not concerned about it and we even skip mentioning it in the release notes. We should be confident that operator upgrades or changes performed by the users to mdb (changing sts or ac) are safe and won't ever cause database downtimes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot - I'll perform all the test in both Community and Enterprise.

I'm not sure what changes that would be in AC that would cause new plans to be generated

This one is very, very tricky. I only managed to artificially test it by handcrafting the changes (and it worked fine). The thing is that under certain circumstances, the Automation Agent might completely recompute the plan. Then, the new plan is appended to the (therefore, it's the last one). So far I've seen only once such situation - someone seemed to delete the automation agent user when rolling out new deployment. However, I failed to recreate this scenario with more recent versions of the Agent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I performed all the test you mentioned. They all seem to be fine and at the same time, the probe seems to properly guard the cluster from rolling multiple members at once.

@slaskawi slaskawi force-pushed the CLOUDP-189434-Use-Wait-Steps-instead-of-risky-Steps branch from 14c0423 to 821eeb4 Compare July 13, 2023 06:29
@github-actions github-actions bot removed the safe-to-test Add this label to PRs from forks to trigger E2E tests label Jul 13, 2023
@slaskawi slaskawi force-pushed the CLOUDP-189434-Use-Wait-Steps-instead-of-risky-Steps branch 2 times, most recently from 5bb965b to c8a69e3 Compare July 13, 2023 07:41
@slaskawi slaskawi force-pushed the CLOUDP-189434-Use-Wait-Steps-instead-of-risky-Steps branch from c8a69e3 to a40f7bd Compare July 14, 2023 10:16
Copy link
Collaborator

@nammn nammn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for finding this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants