Skip to content

Conversation

shivlaks
Copy link
Contributor

@shivlaks shivlaks commented Sep 10, 2021

Summary

Currently, we observe a few different failures that occur during integration tests, which get
executed as a part of the PR build as well as pushes to branches.

createModel and createEndpoint particularly see failures most frequently and they are
primarily:

  • Rate exceeded - ThrottlingException.

This change defines a default retry strategy that makes 5 attempts, over an interval of 5
seconds, which backs off with a multiplier of 2. The methodology behind this strategy is
naive and may need some calibration. It should reduce the frequency of failures in the
short term.

We can adjust the retry strategy as we go and expand to something more API specific as
the need arises.

Testing

  • ran integ tests a few times locally - ensured they had the retry in the ASL definition and
    executed through successfully.

rendered retry from the StateMachine definition on sagemaker steps:

"Retry": [
        {
          "ErrorEquals": [
            "SageMaker.AmazonSageMakerException"
          ],
          "IntervalSeconds": 5,
          "MaxAttempts": 5,
          "BackoffRate": 2
        }
      ]
  • also pushed some dummy / trivial commits to this PR to trigger simultaneous builds. haven't
    seen any state machine failures yet 🤞

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@ca-nguyen ca-nguyen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!
This will allow us to push changes without worrying about triggering a build failures!

@StepFunctions-Bot
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildProject6AEA49D1-sEHrOdk7acJc
  • Commit ID: c788e7b
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@shivlaks shivlaks merged commit 74d0f07 into main Sep 10, 2021
@wong-a wong-a deleted the shivlaks/add-retry branch September 10, 2021 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants