Skip to content

[Serve] ServeHandles fail if GCS crashes before first request #29539

@shrekris-anyscale

Description

@shrekris-anyscale

What happened + What you expected to happen

If the GCS crashes before a ServeHandle's first request, the ServeHandle cannot process any requests, even if there are replicas alive. However, if it crashes after the ServeHandle's first request, the ServeHandle can process requests as long as there are replicas alive.

I would expect the ServeHandle to fulfill any requests even if the GCS crashes before its first request.

Note: this issue has fairly minor impact since it's unlikely that the GCS crashes before the ServeHandle processes even one request.

Versions / Dependencies

Ray on the latest master.

Reproduction script

Repro script:

# File name: repro.py

import os
import signal
import psutil

import ray
from ray import serve

gcs_dead = False

def kill_gcs():
    print("Killing gcs...")
    gcs_pid = None
    for proc in psutil.process_iter():
        if "gcs_server" in proc.name():
            if gcs_pid is not None:
                raise ValueError("Got two pids!")
            gcs_pid = proc.pid

    os.kill(gcs_pid, signal.SIGTERM)
    global gcs_dead
    gcs_dead = True
    print(
        "Killed gcs! If you did this before any requests succeeded, the "
        "first request afterwards should hang."
    )

ray.init()

@serve.deployment
class C():

    def __call__(self, *args):
        return os.getpid()

graph = C.bind()
serve.run(graph)

handle = serve.get_deployment("C").get_handle()

# Comment this out if you want to kill GCS later:
kill_gcs()

ref = handle.remote()
print("Issued request 1...")

result = ray.get(ref)
print(f"Request 1 succeeded. Got: {result}")

if not gcs_dead:
    kill_gcs()

ref = handle.remote()
print("Issued request 2...")

result = ray.get(ref)
print(f"Request 2 succeeded. Got: {result}")

You can run the file like a regular Python script:

$ python repro.py

The file makes two requests to a deployment with one replica. There's two places where the file calls kill_gcs(). If you leave the first kill_gcs() call uncommented, the script hangs on the first request. If you comment it out, the script successfully fulfills both requests.

Issue Severity

Low: It annoys or frustrates me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Issue moderate in impact or severitybugSomething that is supposed to be working; but isn'tpending-cleanupThis issue is pending cleanup. It will be removed in 2 weeks after being assigned.serveRay Serve Related Issue

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions