Skip to content

[data] [streaming] Support/test fault tolerance with ActorPoolStrategy #31794

@ericl

Description

@ericl

Implement fault tolerance for the actor pool strategy in MapOperator. This involves enabling actor restarts on failure for MapOperator, and testing this out in practice.

There are two levels of testing:

  • Actor faults: can randomly kill x% of actors (e.g., os.exit() in inference call) and see if it can recover.
  • Lineage faults: can randomly kill x% of raylets (e.g.. pkill raylet in inference call) and see if it can recover lost objects.

One limitation of our actor fault tolerance right now is it requires the identical original actor to complete reconstruction. This may be a problem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticaldataRay Data-related issuesenhancementRequest for new feature and/or capability

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions