Skip to content

[Roadmap] Ray Q3 2025 #54923

@cszhu

Description

@cszhu

Hello everyone! 👋 I'm excited to share what we have in plan for Q3 2025 for Ray. I will try to keep this updated as features get merged in, and rolled out.

Goal: Deliver foundational reliability, performance, and DX improvements across Ray Core, Data, Train, LLM, Serve, RL, Observability, Technical Content, and KubeRay.

Ray Core

Reliability & Fault Tolerance

  • Improve system stability under node and network failures,
    • Including making RPCs tolerant to transient errors
  • Add robust support for preemptible instances

Scheduling & Performance

Developer Experience

Ecosystem Integrations

Ray Data

Reliability

  • Ensure workloads complete successfully despite cluster failures

Performance

  • Enhance training ingest pipelines with advanced sampling and caching support

Connectors

Usability

  • Schema UDFs
  • Enhanced internal query planning

Ray Train

API

  • Finalize Train v2 API

Performance

  • Implement asynchronous checkpointing

LLM

Goal: Run large models (e.g. DeepSeek) at scale via vLLM on Ray Serve:

  • Prefill diaggregation
  • Large scale DP
  • Custom request routing
  • Elastic expert parallelism

Performance & Efficiency

  • Implement prefill disaggregation to optimize performance for large context models.
  • Develop an intelligent, KV cache-aware router with a pluggable architecture
  • Implement Distributed Parallel (DP) Attention within Ray Serve

Operations

  • Publish updated performance benchmarks

Ecosystem

  • Support SkyRL for reinforcement learning for human feedback (RLHF) workloads

Ray Serve

Serving Flexibility

  • Custom auto‑scaling and routing patterns
  • Async inference support
  • MCP server patterns
  • Integrate label based scheduling

Observability

  • Enhanced tracing support

RLlib

  • Ray RL V2 stack GA
  • Algorithm composability enhancements

Observability

API Release

  • Public launch of unified event export API

Optimization

  • Refactor internals to leverage new export API

Technical Content

  • New technical templates
  • More examples & deep‑dives

KubeRay

Upgrades

  • Productionize the incremental upgrade feature for seamless cluster updates

Hardware Support

  • Streamline support for diverse accelerators, including multiple GPU types, Dynamic Resource Allocation (DRA), and MIG

Autoscaling

  • Continue to improve the functionality and reliability of Autoscaler V2

We love hearing from the community! If there is a feature you'd like to see in Ray in the future, let us know by filing a feature request or comment here. Thank you for supporting Ray!

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreIssues that should be addressed in Ray CoredataRay Data-related issueskuberayIssues for the Ray/Kuberay integration that are tracked on the Ray sidellmobservabilityIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or ProfilingrllibRLlib related issuesserveRay Serve Related IssuestabilitytrainRay Train Related Issue

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions