Skip to content

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Dec 14, 2022

This REP improves the usability and performance of Datasets for ML ingest and training workloads via native pipelining support.

Signed-off-by: Eric Liang ekhliang@gmail.com

This REP propose improving the usability and performance of Datasets for ML ingest and training workloads via native pipelining support.

Signed-off-by: Eric Liang <ekhliang@gmail.com>
@ericl ericl changed the title [WIP] Native pipelining support in Ray Datasets [REP] Native pipelining support in Ray Datasets Dec 14, 2022
Signed-off-by: Eric Liang <ekhliang@gmail.com>
+ Map
+ Sort
+ RandomShuffle
+ ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would windowed all-to-all operations be supported here?

For example, a random shuffle, but it doesn't need to operate globally and it's sufficient to shuffle within a certain window size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should be able to support this. We could add a special Window physical operator that buffers groups of blocks and releases them. We would also have to extend the operator interface slightly to introduce the notion of resetting the state for each window.

Signed-off-by: Eric Liang <ekhliang@gmail.com>
@wilsonwang371
Copy link
Contributor

Hi Eric, can you give some future plans for this? What is the current major gap between Ray pipelining support and flink?

Thanks

@ericl
Copy link
Contributor Author

ericl commented Dec 19, 2022

@wilsonwang371 sure. I'm not sure Flink is the right class of system to compare against, but in general the goals of Ray pipelining are to:

  • Minimize memory usage for streaming ML preprocessing and ingest into ML training jobs
  • Processing large datasets for batch inference without needing to spill to disk

Big data streaming systems are designed for reliable data processing on infinite data streams. In contrast, Ray data is designed for high throughput operations (e.g., 100+GB/s) on finite data and heterogeneous CPU/GPU pipelines.

Hence there isn't a gap per se; this REP is mostly about improving Ray Data's performance for these key ML use cases. The closest competing systems would be tf.data service and DPP from Meta (https://engineering.fb.com/2022/09/19/ml-applications/data-ingestion-machine-learning-training-meta/).

@wilsonwang371
Copy link
Contributor

Big data streaming systems are designed for reliable data processing on infinite data streams. In contrast, Ray data is designed for high throughput operations (e.g., 100+GB/s) on finite data and heterogeneous CPU/GPU pipelines.

Hi Eric,

Thanks for your answer. Agreed, Flink might not be the right system to compare Ray with.

I saw that Ray can do some the work that other steam processing systems can do. From what I can see in Ray, there is nothing actually blocking Ray from doing such things. Therefore, I wonder what else is needed to make Ray capable of doing the things that flink and spark are designed to do.

  1. Streamed data. I can see that we use Ray dataset to process the ML-related training and inference data, mostly in the form of dataframes. However, is there anything blocking ray dataset from holding generic data format?
  2. Unbounded data. Yes, Ray data is currently designed to work with a finite amount of data. Is there any significant changes needed to support a stream of contiguous data(unbounded)?
  3. Flink has higher level API support such as table and SQL apis. With Ray dataset stream processing, i think it is also possible we can build similar APIs on top, right?
  4. Checkpointing, Barrier, Watermark, Time window etc. From what I can see, these are the advanced features in flink. With these features, flink can do timely data processing and FT. It looks like Ray is not planing to do these right now.

I assume that all these points are the things Ray is not focusing right now. Am I correct?

Thanks

Wilson

@ericl
Copy link
Contributor Author

ericl commented Dec 19, 2022

I saw that Ray can do some the work that other steam processing systems can do. From what I can see in Ray, there is nothing actually blocking Ray from doing such things. Therefore, I wonder what else is needed to make Ray capable of doing the things that flink and spark are designed to do.

Indeed, there's not much missing if you look purely at the execution layer. For example, there is an active project to port Apache Beam to run on Ray: https://github.com/ray-project/ray_beam_runner. Beam on Ray would support streaming APIs, checkpointing, barriers, watermarking, etc. as provided by the Beam runtime.

I assume that all these points are the things Ray is not focusing right now. Am I correct?

In short, yes. I would put it this way:

  • Ray Data: as part of Ray AIR, the focus is on solving data problems for large-scale ML workloads (i.e., train ingest and batch inference).
  • Ray Core: focus is on general purpose distributed computing (including projects like Beam on Ray, etc.).

ericl added a commit to ray-project/ray that referenced this pull request Dec 22, 2022
This PR adds the basic interfaces and feature flags; split out from https://github.com/ray-project/ray/pull/30903/files

See REP ray-project/enhancements#18 for more details.
@ericl
Copy link
Contributor Author

ericl commented Jan 6, 2023

@stephanie-wang , any further comments?

@zhe-thoughts zhe-thoughts merged commit 44a499c into main Jan 13, 2023
ericl added a commit to ray-project/ray that referenced this pull request Jan 17, 2023
…ecutor (#31579)

Initial implementation of ray-project/enhancements#18, dependent on #30903

Streaming execution can be toggled with the following env var: RAY_DATASET_USE_STREAMING_EXECUTOR=0|1.
andreapiso pushed a commit to andreapiso/ray that referenced this pull request Jan 22, 2023
…ecutor (ray-project#31579)

Initial implementation of ray-project/enhancements#18, dependent on ray-project#30903

Streaming execution can be toggled with the following env var: RAY_DATASET_USE_STREAMING_EXECUTOR=0|1.

Signed-off-by: Andrea Pisoni <andreapiso@gmail.com>
ericl added a commit to ray-project/ray that referenced this pull request Jan 25, 2023
Initial implementation of ray-project/enhancements#18

Original prototype: https://github.com/ray-project/ray/pull/30222/files

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: jianoaix <iamjianxiao@gmail.com>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…t#31216)

This PR adds the basic interfaces and feature flags; split out from https://github.com/ray-project/ray/pull/30903/files

See REP ray-project/enhancements#18 for more details.

Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
@jovany-wang jovany-wang deleted the dataset-native-pipelining branch March 2, 2023 15:34
@ericl ericl changed the title [REP] Native pipelining support in Ray Datasets [REP] Native streaming support in Ray Datasets Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants