[Roadmap] Ray Q3 2025

Hello everyone! :wave: I'm excited to share what we have in plan for Q3 2025 for Ray. I will try to keep this updated as features get merged in, and rolled out. 

**Goal**: Deliver foundational reliability, performance, and DX improvements across Ray Core, Data, Train, LLM, Serve, RL, Observability, Technical Content, and KubeRay.

## Ray Core

**Reliability & Fault Tolerance**
- [ ]  Improve system stability under node and network failures, 
  - [ ] Including making RPCs tolerant to transient errors
- [ ]  Add robust support for preemptible instances

**Scheduling & Performance**
- [ ] Introduce label-based scheduling for finer-grained resource control (#51564)
- [ ] Implement GPU objects with RDMA transfer support for high-performance GPU data handling (#54943)

**Developer Experience**
- [ ]  Introduce ActorMesh for simplified interaction with groups of actors (#54760)
- [x]  Improve static typing across the codebase to enhance developer productivity (https://github.com/ray-project/ray/issues/54149)
- [ ]  Address outstanding technical debt in core worker components

**Ecosystem Integrations**
- [ ]  Provide official support for reinforcement learning libraries like [veRL](https://github.com/volcengine/verl), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), and [ROLL](https://github.com/alibaba/ROLL) (#54021)

## Ray Data
**Reliability**
- [ ]  Ensure workloads complete successfully despite cluster failures

**Performance**
- [ ]  Enhance training ingest pipelines with advanced sampling and caching support

**Connectors**
- [ ]  Improve Apache Iceberg integration
- [ ]  Expand data catalog support, starting with [Databricks Unity Catalog](https://www.databricks.com/product/unity-catalog)

**Usability**
- [ ]  Schema UDFs
- [ ]  Enhanced internal query planning

## Ray Train

**API**
- [ ]  Finalize Train v2 API

**Performance**
- [ ]  Implement asynchronous checkpointing

## LLM
_Goal_: Run large models (e.g. DeepSeek) at scale via vLLM on Ray Serve:
- [ ] Prefill diaggregation
- [ ] Large scale DP
- [ ] Custom request routing
- [ ] Elastic expert parallelism

**Performance & Efficiency**
- [ ]  Implement prefill disaggregation to optimize performance for large context models.
- [ ]  Develop an intelligent, KV cache-aware router with a pluggable architecture
- [ ]  Implement Distributed Parallel (DP) Attention within Ray Serve

**Operations**
- [ ]  Publish updated performance benchmarks

**Ecosystem**
- [ ] Support [SkyRL](https://github.com/NovaSky-AI/SkyRL) for reinforcement learning for human feedback (RLHF) workloads

## Ray Serve
**Serving Flexibility**
- [ ]  Custom auto‑scaling and routing patterns
- [ ]  Async inference support
- [ ]  MCP server patterns
- [ ]  Integrate label based scheduling

**Observability**
- [ ]  Enhanced tracing support

## RLlib
- [ ]  Ray RL V2 stack GA
- [ ]  Algorithm composability enhancements

## Observability

**API Release**
- [ ]  Public launch of unified event export API

**Optimization**
- [ ]  Refactor internals to leverage new export API

## Technical Content
 - [ ]  New technical templates
 - [ ]  More examples & deep‑dives

## KubeRay
**Upgrades**
- [ ]  Productionize the incremental upgrade feature for seamless cluster updates

**Hardware Support**
- [ ]  Streamline support for diverse accelerators, including multiple GPU types, Dynamic Resource Allocation (DRA), and MIG

**Autoscaling**
- [ ]  Continue to improve the functionality and reliability of Autoscaler V2

-------------

We love hearing from the community! If there is a feature you'd like to see in Ray in the future, let us know by filing a feature request or comment here. Thank you for supporting Ray!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Roadmap] Ray Q3 2025 #54923

Ray Core

Ray Data

Ray Train

LLM

Ray Serve

RLlib

Observability

Technical Content

KubeRay

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Roadmap] Ray Q3 2025 #54923

Description

Ray Core

Ray Data

Ray Train

LLM

Ray Serve

RLlib

Observability

Technical Content

KubeRay

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions