-
Notifications
You must be signed in to change notification settings - Fork 6.9k
Open
Labels
community-backlogcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray CorequestionJust a question :)Just a question :)usability
Description
What happened + What you expected to happen
When I didn't set the num_gpus
, the torch.cuda.is_available()
always throw an exception.
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jan_15_19:20:09_PST_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0
nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.04 Driver Version: 576.52 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 On | Off |
| 0% 35C P8 11W / 450W | 1095MiB / 24564MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Error:
(GpuActor pid=13275) free(): double free detected in tcache 2
(GpuActor pid=13275) *** SIGABRT received at time=1759026077 on cpu 15 ***
(GpuActor pid=13275) PC: @ 0x7bcb68c969fc (unknown) pthread_kill
(GpuActor pid=13275) @ 0x7bcb68c42520 (unknown) (unknown)
(GpuActor pid=13275) [2025-09-28 10:21:17,740 E 13275 13275] logging.cc:501: *** SIGABRT received at time=1759026077 on cpu 15 ***
(GpuActor pid=13275) [2025-09-28 10:21:17,740 E 13275 13275] logging.cc:501: PC: @ 0x7bcb68c969fc (unknown) pthread_kill
(GpuActor pid=13275) [2025-09-28 10:21:17,740 E 13275 13275] logging.cc:501: @ 0x7bcb68c42520 (unknown) (unknown)
(GpuActor pid=13275) Fatal Python error: Aborted
(GpuActor pid=13275)
(GpuActor pid=13275) Stack (most recent call first):
(GpuActor pid=13275) File "/home/mako/venvs/skyrl-train/lib/python3.12/site-packages/torch/cuda/__init__.py", line 177 in is_available
(GpuActor pid=13275) File "/home/mako/www/SkyRL/skyrl-train/test.py", line 15 in __init__
(GpuActor pid=13275) File "/home/mako/venvs/skyrl-train/lib/python3.12/site-packages/ray/util/tracing/tracing_helper.py", line 461 in _resume_span
(GpuActor pid=13275) File "/home/mako/venvs/skyrl-train/lib/python3.12/site-packages/ray/_private/function_manager.py", line 689 in actor_method_executor
(GpuActor pid=13275) File "/home/mako/venvs/skyrl-train/lib/python3.12/site-packages/ray/_private/worker.py", line 974 in main_loop
(GpuActor pid=13275) File "/home/mako/venvs/skyrl-train/lib/python3.12/site-packages/ray/_private/workers/default_worker.py", line 321 in <module>
Versions / Dependencies
ray=2.48.0
OS=WSL2 on windows, ubuntu 22
python=3.12
uv
Reproduction script
Command:
ray start --head
import ray
if not ray.is_initialized():
ray.init()
@ray.remote() # <- here
class GpuActor:
def __init__(self):
# The actor will claim its GPU resource upon initialization.
# Let's confirm it can see it.
import torch
self.cuda_available = torch.cuda.is_available()
if self.cuda_available:
self.device = torch.device("cuda")
print(f"Actor {ray.get_runtime_context().get_actor_id()} initialized on GPU: {torch.cuda.get_device_name(0)}")
else:
self.device = torch.device("cpu")
print("Actor failed to initialize on GPU.")
def is_cuda_available(self):
return self.cuda_available
# Create an instance of the actor.
# Ray will schedule this on a worker with a GPU.
gpu_actor = GpuActor.remote()
# Call a method on the actor to get the result.
is_available_future = gpu_actor.is_cuda_available.remote()
result = ray.get(is_available_future)
print(f"\nResult from actor: CUDA is available = {result}")
ray.shutdown()
Issue Severity
None
Metadata
Metadata
Assignees
Labels
community-backlogcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray CorequestionJust a question :)Just a question :)usability