-
Notifications
You must be signed in to change notification settings - Fork 522
CLOUDP-189434 Using Wait Steps instead of risky Steps #1332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLOUDP-189434 Using Wait Steps instead of risky Steps #1332
Conversation
| fifteenSecondsAgo := time.Now().Add(time.Duration(-15) * time.Second) | ||
| if contains.String(riskySteps, status.Step) && status.Completed == nil && status.Started.Before(fifteenSecondsAgo) { | ||
| logger.Infof("Indicated a possible deadlock, status: %s, started at %s but hasn't finished "+ | ||
| if status.IsWaitStep && status.Completed == nil && status.Started.Before(fifteenSecondsAgo) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please verify if this will return "ready" too early in the following conditions:
- database is in a stable state
- rolling update is performed without any changes to the automation config
- rolling update is performed and we changed some things in automation config:
- I'm not sure what changes that would be in AC that would cause new plans to be generated
This is to ensure we won't cause intermittent downtimes when we're performing normal operator upgrades. Now we are not concerned about it and we even skip mentioning it in the release notes. We should be confident that operator upgrades or changes performed by the users to mdb (changing sts or ac) are safe and won't ever cause database downtimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot - I'll perform all the test in both Community and Enterprise.
I'm not sure what changes that would be in AC that would cause new plans to be generated
This one is very, very tricky. I only managed to artificially test it by handcrafting the changes (and it worked fine). The thing is that under certain circumstances, the Automation Agent might completely recompute the plan. Then, the new plan is appended to the (therefore, it's the last one). So far I've seen only once such situation - someone seemed to delete the automation agent user when rolling out new deployment. However, I failed to recreate this scenario with more recent versions of the Agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I performed all the test you mentioned. They all seem to be fine and at the same time, the probe seems to properly guard the cluster from rolling multiple members at once.
14c0423 to
821eeb4
Compare
5bb965b to
c8a69e3
Compare
c8a69e3 to
a40f7bd
Compare
nammn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for finding this!
Summary
This Pull Request changes the core deadlock detection algorithm and replaces it with detecting Wait Steps (the steps with
isWaitStep: true. This loosens up the way the Probe behaves and makes it less restrictive.The end user impact is completely opposite now. In the worst case scenario, we will report the database to be ready (instead of not ready and blocking the rollout as we've seen before). The end user experience is much better as it may experience a short outage instead of full database blackout.
This Pull Request also introduces name refactoring and aligns it with the field names in JSON file. This way, it's less confusing when discussing this with a wider Team.
Technical details
A lot of this design is related to this conversation: #1332
Testing results in both Community and EA
All tests against the Enterprise are successful link.