-
Notifications
You must be signed in to change notification settings - Fork 6.9k
Open
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticaldataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capability
Milestone
Description
Implement fault tolerance for the actor pool strategy in MapOperator. This involves enabling actor restarts on failure for MapOperator, and testing this out in practice.
There are two levels of testing:
- Actor faults: can randomly kill x% of actors (e.g., os.exit() in inference call) and see if it can recover.
- Lineage faults: can randomly kill x% of raylets (e.g.. pkill raylet in inference call) and see if it can recover lost objects.
One limitation of our actor fault tolerance right now is it requires the identical original actor to complete reconstruction. This may be a problem.
Metadata
Metadata
Assignees
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticaldataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capability