Skip to content

[Core] Set num_cpus to 1 causes runtime error on WSL2 #56988

@dcalsky

Description

@dcalsky

What happened + What you expected to happen

When I didn't set the num_gpus, the torch.cuda.is_available() always throw an exception.

nvcc --version                    
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jan_15_19:20:09_PST_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0
nvidia-smi                       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.04              Driver Version: 576.52         CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:01:00.0  On |                  Off |
|  0%   35C    P8             11W /  450W |    1095MiB /  24564MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Error:

(GpuActor pid=13275) free(): double free detected in tcache 2
(GpuActor pid=13275) *** SIGABRT received at time=1759026077 on cpu 15 ***
(GpuActor pid=13275) PC: @     0x7bcb68c969fc  (unknown)  pthread_kill
(GpuActor pid=13275)     @     0x7bcb68c42520  (unknown)  (unknown)
(GpuActor pid=13275) [2025-09-28 10:21:17,740 E 13275 13275] logging.cc:501: *** SIGABRT received at time=1759026077 on cpu 15 ***
(GpuActor pid=13275) [2025-09-28 10:21:17,740 E 13275 13275] logging.cc:501: PC: @     0x7bcb68c969fc  (unknown)  pthread_kill
(GpuActor pid=13275) [2025-09-28 10:21:17,740 E 13275 13275] logging.cc:501:     @     0x7bcb68c42520  (unknown)  (unknown)
(GpuActor pid=13275) Fatal Python error: Aborted
(GpuActor pid=13275) 
(GpuActor pid=13275) Stack (most recent call first):
(GpuActor pid=13275)   File "/home/mako/venvs/skyrl-train/lib/python3.12/site-packages/torch/cuda/__init__.py", line 177 in is_available
(GpuActor pid=13275)   File "/home/mako/www/SkyRL/skyrl-train/test.py", line 15 in __init__
(GpuActor pid=13275)   File "/home/mako/venvs/skyrl-train/lib/python3.12/site-packages/ray/util/tracing/tracing_helper.py", line 461 in _resume_span
(GpuActor pid=13275)   File "/home/mako/venvs/skyrl-train/lib/python3.12/site-packages/ray/_private/function_manager.py", line 689 in actor_method_executor
(GpuActor pid=13275)   File "/home/mako/venvs/skyrl-train/lib/python3.12/site-packages/ray/_private/worker.py", line 974 in main_loop
(GpuActor pid=13275)   File "/home/mako/venvs/skyrl-train/lib/python3.12/site-packages/ray/_private/workers/default_worker.py", line 321 in <module>

Versions / Dependencies

ray=2.48.0
OS=WSL2 on windows, ubuntu 22
python=3.12
uv

Reproduction script

Command:

ray start --head
import ray

if not ray.is_initialized():
    ray.init()

@ray.remote() # <- here
class GpuActor:
    def __init__(self):
        # The actor will claim its GPU resource upon initialization.
        # Let's confirm it can see it.
        import torch
        self.cuda_available = torch.cuda.is_available()
        if self.cuda_available:
            self.device = torch.device("cuda")
            print(f"Actor {ray.get_runtime_context().get_actor_id()} initialized on GPU: {torch.cuda.get_device_name(0)}")
        else:
            self.device = torch.device("cpu")
            print("Actor failed to initialize on GPU.")

    def is_cuda_available(self):
        return self.cuda_available

# Create an instance of the actor.
# Ray will schedule this on a worker with a GPU.
gpu_actor = GpuActor.remote()

# Call a method on the actor to get the result.
is_available_future = gpu_actor.is_cuda_available.remote()
result = ray.get(is_available_future)

print(f"\nResult from actor: CUDA is available = {result}")

ray.shutdown()

Issue Severity

None

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions