[REP] Native streaming support in Ray Datasets #18

ericl · 2022-12-14T19:38:17Z

This REP improves the usability and performance of Datasets for ML ingest and training workloads via native pipelining support.

Signed-off-by: Eric Liang ekhliang@gmail.com

This REP propose improving the usability and performance of Datasets for ML ingest and training workloads via native pipelining support. Signed-off-by: Eric Liang <ekhliang@gmail.com>

Signed-off-by: Eric Liang <ekhliang@gmail.com>

amogkam · 2022-12-14T23:42:44Z

reps/2022-12-14-native-pipelining-data.md

+        + Map
+        + Sort
+        + RandomShuffle
+        + ...


would windowed all-to-all operations be supported here?

For example, a random shuffle, but it doesn't need to operate globally and it's sufficient to shuffle within a certain window size.

Yes, we should be able to support this. We could add a special Window physical operator that buffers groups of blocks and releases them. We would also have to extend the operator interface slightly to introduce the notion of resetting the state for each window.

Signed-off-by: Eric Liang <ekhliang@gmail.com>

wilsonwang371 · 2022-12-17T01:46:10Z

Hi Eric, can you give some future plans for this? What is the current major gap between Ray pipelining support and flink?

Thanks

ericl · 2022-12-19T18:49:12Z

@wilsonwang371 sure. I'm not sure Flink is the right class of system to compare against, but in general the goals of Ray pipelining are to:

Minimize memory usage for streaming ML preprocessing and ingest into ML training jobs
Processing large datasets for batch inference without needing to spill to disk

Big data streaming systems are designed for reliable data processing on infinite data streams. In contrast, Ray data is designed for high throughput operations (e.g., 100+GB/s) on finite data and heterogeneous CPU/GPU pipelines.

Hence there isn't a gap per se; this REP is mostly about improving Ray Data's performance for these key ML use cases. The closest competing systems would be tf.data service and DPP from Meta (https://engineering.fb.com/2022/09/19/ml-applications/data-ingestion-machine-learning-training-meta/).

wilsonwang371 · 2022-12-19T21:57:02Z

Big data streaming systems are designed for reliable data processing on infinite data streams. In contrast, Ray data is designed for high throughput operations (e.g., 100+GB/s) on finite data and heterogeneous CPU/GPU pipelines.

Hi Eric,

Thanks for your answer. Agreed, Flink might not be the right system to compare Ray with.

I saw that Ray can do some the work that other steam processing systems can do. From what I can see in Ray, there is nothing actually blocking Ray from doing such things. Therefore, I wonder what else is needed to make Ray capable of doing the things that flink and spark are designed to do.

Streamed data. I can see that we use Ray dataset to process the ML-related training and inference data, mostly in the form of dataframes. However, is there anything blocking ray dataset from holding generic data format?
Unbounded data. Yes, Ray data is currently designed to work with a finite amount of data. Is there any significant changes needed to support a stream of contiguous data(unbounded)?
Flink has higher level API support such as table and SQL apis. With Ray dataset stream processing, i think it is also possible we can build similar APIs on top, right?
Checkpointing, Barrier, Watermark, Time window etc. From what I can see, these are the advanced features in flink. With these features, flink can do timely data processing and FT. It looks like Ray is not planing to do these right now.

I assume that all these points are the things Ray is not focusing right now. Am I correct?

Thanks

Wilson

ericl · 2022-12-19T22:16:32Z

I saw that Ray can do some the work that other steam processing systems can do. From what I can see in Ray, there is nothing actually blocking Ray from doing such things. Therefore, I wonder what else is needed to make Ray capable of doing the things that flink and spark are designed to do.

Indeed, there's not much missing if you look purely at the execution layer. For example, there is an active project to port Apache Beam to run on Ray: https://github.com/ray-project/ray_beam_runner. Beam on Ray would support streaming APIs, checkpointing, barriers, watermarking, etc. as provided by the Beam runtime.

I assume that all these points are the things Ray is not focusing right now. Am I correct?

In short, yes. I would put it this way:

Ray Data: as part of Ray AIR, the focus is on solving data problems for large-scale ML workloads (i.e., train ingest and batch inference).
Ray Core: focus is on general purpose distributed computing (including projects like Beam on Ray, etc.).

This PR adds the basic interfaces and feature flags; split out from https://github.com/ray-project/ray/pull/30903/files See REP ray-project/enhancements#18 for more details.

ericl · 2023-01-06T23:08:41Z

@stephanie-wang , any further comments?

This PR adds the basic interfaces and feature flags; split out from https://github.com/ray-project/ray/pull/30903/files See REP ray-project/enhancements#18 for more details.

…ecutor (#31579) Initial implementation of ray-project/enhancements#18, dependent on #30903 Streaming execution can be toggled with the following env var: RAY_DATASET_USE_STREAMING_EXECUTOR=0|1.

…ecutor (ray-project#31579) Initial implementation of ray-project/enhancements#18, dependent on ray-project#30903 Streaming execution can be toggled with the following env var: RAY_DATASET_USE_STREAMING_EXECUTOR=0|1. Signed-off-by: Andrea Pisoni <andreapiso@gmail.com>

Initial implementation of ray-project/enhancements#18 Original prototype: https://github.com/ray-project/ray/pull/30222/files Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: jianoaix <iamjianxiao@gmail.com>

…t#31216) This PR adds the basic interfaces and feature flags; split out from https://github.com/ray-project/ray/pull/30903/files See REP ray-project/enhancements#18 for more details. Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>

[WIP] Native pipelining support in Ray Datasets

1081de4

This REP propose improving the usability and performance of Datasets for ML ingest and training workloads via native pipelining support. Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl assigned stephanie-wang, c21, clarkzinzow and jianoaix Dec 14, 2022

ericl changed the title ~~[WIP] Native pipelining support in Ray Datasets~~ [REP] Native pipelining support in Ray Datasets Dec 14, 2022

ericl mentioned this pull request Dec 14, 2022

[Data] Pipelines create too many Actors ray-project/ray#26396

Closed

ericl added the shepherding label Dec 14, 2022

Update 2022-12-14-native-pipelining-data.md

3ccac96

Signed-off-by: Eric Liang <ekhliang@gmail.com>

amogkam reviewed Dec 14, 2022

View reviewed changes

Update 2022-12-14-native-pipelining-data.md

948409b

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl mentioned this pull request Dec 16, 2022

[WIP] Bulk executor initial implementation ray-project/ray#30903

Merged

21 tasks

ericl mentioned this pull request Dec 20, 2022

[data] New executor backend [1/n]--- Add basic interfaces ray-project/ray#31216

Merged

ericl added pending-committer-vote and removed shepherding labels Jan 10, 2023

This was referenced Jan 11, 2023

[data] New executor [7/n]--- bare bones implementation of StreamingExecutor ray-project/ray#31579

Merged

[data] [streaming] Port Datasets to new execution backend ray-project/ray#31646

Closed

ericl added vote-approved and removed pending-committer-vote labels Jan 13, 2023

zhe-thoughts requested a review from amogkam January 13, 2023 00:05

stephanie-wang approved these changes Jan 13, 2023

View reviewed changes

zhe-thoughts merged commit 44a499c into main Jan 13, 2023

jovany-wang deleted the dataset-native-pipelining branch March 2, 2023 15:34

ericl changed the title ~~[REP] Native pipelining support in Ray Datasets~~ [REP] Native streaming support in Ray Datasets Mar 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[REP] Native streaming support in Ray Datasets #18

[REP] Native streaming support in Ray Datasets #18

Uh oh!

ericl commented Dec 14, 2022

Uh oh!

amogkam Dec 14, 2022

Uh oh!

ericl Dec 15, 2022

Uh oh!

wilsonwang371 commented Dec 17, 2022

Uh oh!

ericl commented Dec 19, 2022

Uh oh!

wilsonwang371 commented Dec 19, 2022

Uh oh!

ericl commented Dec 19, 2022

Uh oh!

ericl commented Jan 6, 2023

Uh oh!

Uh oh!

[REP] Native streaming support in Ray Datasets #18

[REP] Native streaming support in Ray Datasets #18

Uh oh!

Conversation

ericl commented Dec 14, 2022

Uh oh!

amogkam Dec 14, 2022

Choose a reason for hiding this comment

Uh oh!

ericl Dec 15, 2022

Choose a reason for hiding this comment

Uh oh!

wilsonwang371 commented Dec 17, 2022

Uh oh!

ericl commented Dec 19, 2022

Uh oh!

wilsonwang371 commented Dec 19, 2022

Uh oh!

ericl commented Dec 19, 2022

Uh oh!

ericl commented Jan 6, 2023

Uh oh!

Uh oh!