-
Notifications
You must be signed in to change notification settings - Fork 30
[REP] Native streaming support in Ray Datasets #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This REP propose improving the usability and performance of Datasets for ML ingest and training workloads via native pipelining support. Signed-off-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Eric Liang <ekhliang@gmail.com>
+ Map | ||
+ Sort | ||
+ RandomShuffle | ||
+ ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would windowed all-to-all operations be supported here?
For example, a random shuffle, but it doesn't need to operate globally and it's sufficient to shuffle within a certain window size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should be able to support this. We could add a special Window
physical operator that buffers groups of blocks and releases them. We would also have to extend the operator interface slightly to introduce the notion of resetting the state for each window.
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Hi Eric, can you give some future plans for this? What is the current major gap between Ray pipelining support and flink? Thanks |
@wilsonwang371 sure. I'm not sure Flink is the right class of system to compare against, but in general the goals of Ray pipelining are to:
Big data streaming systems are designed for reliable data processing on infinite data streams. In contrast, Ray data is designed for high throughput operations (e.g., 100+GB/s) on finite data and heterogeneous CPU/GPU pipelines. Hence there isn't a gap per se; this REP is mostly about improving Ray Data's performance for these key ML use cases. The closest competing systems would be tf.data service and DPP from Meta (https://engineering.fb.com/2022/09/19/ml-applications/data-ingestion-machine-learning-training-meta/). |
Hi Eric, Thanks for your answer. Agreed, Flink might not be the right system to compare Ray with. I saw that Ray can do some the work that other steam processing systems can do. From what I can see in Ray, there is nothing actually blocking Ray from doing such things. Therefore, I wonder what else is needed to make Ray capable of doing the things that flink and spark are designed to do.
I assume that all these points are the things Ray is not focusing right now. Am I correct? Thanks Wilson |
Indeed, there's not much missing if you look purely at the execution layer. For example, there is an active project to port Apache Beam to run on Ray: https://github.com/ray-project/ray_beam_runner. Beam on Ray would support streaming APIs, checkpointing, barriers, watermarking, etc. as provided by the Beam runtime.
In short, yes. I would put it this way:
|
This PR adds the basic interfaces and feature flags; split out from https://github.com/ray-project/ray/pull/30903/files See REP ray-project/enhancements#18 for more details.
@stephanie-wang , any further comments? |
This PR adds the basic interfaces and feature flags; split out from https://github.com/ray-project/ray/pull/30903/files See REP ray-project/enhancements#18 for more details.
…ecutor (#31579) Initial implementation of ray-project/enhancements#18, dependent on #30903 Streaming execution can be toggled with the following env var: RAY_DATASET_USE_STREAMING_EXECUTOR=0|1.
…ecutor (ray-project#31579) Initial implementation of ray-project/enhancements#18, dependent on ray-project#30903 Streaming execution can be toggled with the following env var: RAY_DATASET_USE_STREAMING_EXECUTOR=0|1. Signed-off-by: Andrea Pisoni <andreapiso@gmail.com>
Initial implementation of ray-project/enhancements#18 Original prototype: https://github.com/ray-project/ray/pull/30222/files Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: jianoaix <iamjianxiao@gmail.com>
…t#31216) This PR adds the basic interfaces and feature flags; split out from https://github.com/ray-project/ray/pull/30903/files See REP ray-project/enhancements#18 for more details. Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
This REP improves the usability and performance of Datasets for ML ingest and training workloads via native pipelining support.
Signed-off-by: Eric Liang ekhliang@gmail.com