Torch support for CUDA and DDP #552

lukasgd · 2025-02-03T18:50:43Z

Trying to run some basic examples on a system with 4 GH200 modules using a container image based on nvcr.io/nvidia/pytorch:25.01-py3 with viztracer 1.0.1 installed on top fails for me as follows.

For moving tensors to a CUDA device with test_cuda.py

import torch
from viztracer import VizTracer

with VizTracer(log_torch=True) as tracer:
    initial_value = torch.tensor([3.0]).cuda(0)
    print("done!")

I'm getting

/workspace$ python test_cuda.py 
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 330, in _lazy_init
    queued_call()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 1567, in _register_triton_kernels
    torch._TritonLibrary.registerOp(
  File "/usr/local/lib/python3.12/dist-packages/torch/__init__.py", line 2585, in registerOp
    cls.lib.define(full_schema)
  File "/usr/local/lib/python3.12/dist-packages/torch/library.py", line 153, in define
    result = self.m.define(schema, alias_analysis, tuple(tags))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: VizTracer: Unexpected type. Might be an event mismatch.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/test_cuda.py", line 5, in <module>
    initial_value = torch.tensor([3.0]).cuda(0)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 336, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: VizTracer: Unexpected type. Might be an event mismatch.

CUDA call was originally invoked at:

  File "/workspace/test_cuda.py", line 1, in <module>
    import torch
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/usr/local/lib/python3.12/dist-packages/torch/__init__.py", line 2007, in <module>
    _C._initExtension(_manager_path())
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 1585, in <module>
    _lazy_call(_register_triton_kernels)
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 261, in _lazy_call
    _queued_calls.append((callable, traceback.format_stack()))


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 330, in _lazy_init
    queued_call()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 1567, in _register_triton_kernels
    torch._TritonLibrary.registerOp(
  File "/usr/local/lib/python3.12/dist-packages/torch/__init__.py", line 2585, in registerOp
    cls.lib.define(full_schema)
  File "/usr/local/lib/python3.12/dist-packages/torch/library.py", line 153, in define
    result = self.m.define(schema, alias_analysis, tuple(tags))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Tried to register an operator (triton::_triton_bsr_dense_mm_out(Tensor bsr, Tensor dense, *, Tensor(a!) out) -> Tensor(a!)) with the same name and overload name multiple times. Each overload's schema should only be registered with a single call to def(). Duplicate registration: registered at /dev/null:2578. Original registration: registered at /dev/null:2578

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/test_cuda.py", line 4, in <module>
    with VizTracer(log_torch=True) as tracer:
  File "/usr/local/lib/python3.12/dist-packages/viztracer/viztracer.py", line 170, in __exit__
    self.stop()
  File "/usr/local/lib/python3.12/dist-packages/viztracer/viztracer.py", line 241, in stop
    self.torch_profile.__exit__(None, None, None)
  File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 777, in __exit__
    self.stop()
  File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 793, in stop
    self._transit_action(self.current_action, None)
  File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 836, in _transit_action
    action()
  File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 239, in stop_trace
    self.profiler.__exit__(None, None, None)
  File "/usr/local/lib/python3.12/dist-packages/torch/autograd/profiler.py", line 369, in __exit__
    device_module.synchronize()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 965, in synchronize
    _lazy_init()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 336, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: Tried to register an operator (triton::_triton_bsr_dense_mm_out(Tensor bsr, Tensor dense, *, Tensor(a!) out) -> Tensor(a!)) with the same name and overload name multiple times. Each overload's schema should only be registered with a single call to def(). Duplicate registration: registered at /dev/null:2578. Original registration: registered at /dev/null:2578

CUDA call was originally invoked at:

  File "/workspace/test_cuda.py", line 1, in <module>
    import torch
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/usr/local/lib/python3.12/dist-packages/torch/__init__.py", line 2007, in <module>
    _C._initExtension(_manager_path())
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 1585, in <module>
    _lazy_call(_register_triton_kernels)
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 261, in _lazy_call
    _queued_calls.append((callable, traceback.format_stack()))

[nid006679:53988:0:53988] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0xf86a280)
==== backtrace (tid:  53988) ====
 0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2cc) [0x4000c1cd14dc]
 1  /opt/hpcx/ucx/lib/libucs.so.0(+0x3168c) [0x4000c1cd168c]
 2  /opt/hpcx/ucx/lib/libucs.so.0(+0x319b8) [0x4000c1cd19b8]
 3  linux-vdso.so.1(__kernel_rt_sigreturn+0) [0x4000239507dc]
 4  [0xf86a280]
=================================
Segmentation fault (core dumped)

and for DDP with test_ddp.py

import torch
import torch.distributed as dist
from viztracer import VizTracer

with VizTracer(log_torch=True) as tracer:
    dist.init_process_group(backend='nccl', init_method='env://')   #  having set DDP env vars
    print("done!")

it is

/workspace$ MASTER_ADDR=$(hostname) MASTER_PORT=29500 RANK=0 WORLD_SIZE=1 LOCAL_RANK=1 LOCAL_WORLD_SIZE=1 python test_ddp.py 
Loading finish                                        
Total Entries: 73                                                               
Use the following command to open the report:
vizviewer /workspace/viztracer.json
Traceback (most recent call last):
  File "/workspace/test_ddp.py", line 6, in <module>
    dist.init_process_group(backend='nccl', init_method='env://')   #  having set DDP env vars
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 94, in wrapper
    with _WaitCounter(f"pytorch.wait_counter.c10d.{func.__name__}").guard():
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: VizTracer: Unexpected type. Might be an event mismatch.

Using only the CPU and no DDP, a simple test runs fine. Does viztracer support CUDA and DDP workloads with Pytorch?

The text was updated successfully, but these errors were encountered:

gaogaotiantian · 2025-02-03T19:12:25Z

Viztracer requires a consistent stack, which means all function entries and returns have to match. For example, this pattern is invalid:

A call
B call
A return (missing B return)

I don't believe this is a Viztracer specific issue, this is probably a violation to the consistent stack from torch (maybe just dist module). You can use a very simple tracing function to confirm that (just log calls and returns and see if they match). If the stack is not consistent, there's nothing Viztracer can do because that's basically just illegal data. (gevent does something like this, see #531)

lukasgd · 2025-02-04T16:24:55Z

Hi @gaogaotiantian, thank you for the reply. W.r.t. the suggestion in #531 - actually for neither the CUDA nor the DDP example, I can generate an assertion failure with the tracefunc when not using VizTracer (setting sys.setprofile(None) before exit). However, when I reintroduce VizTracer in the with-statement as above, it gives me the error

    assert frames.pop() == frame
           ^^^^^^^^^^^^
IndexError: pop from empty list

in both cases. In DDP it doesn't progress far enough to produce also the wait_counter error, while in the CUDA example, I'm getting additionally the segfault above.

gaogaotiantian · 2025-02-04T18:45:06Z

Okay this is actually related to a CPython bug I fixed before - python/cpython#122029. viztracer copied some code from lsprof and the code has some issues. I can fix this.

lukasgd · 2025-02-05T15:04:11Z

Thanks for the quick fix!

gaogaotiantian mentioned this issue Feb 4, 2025

Fix the bug that fails to trigger the event when ccall has an arg that's a method #553

Merged

gaogaotiantian closed this as completed in #553 Feb 4, 2025

sourcery-ai bot mentioned this issue Feb 4, 2025

[pull] master from gaogaotiantian:master admariner/viztracer#131

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch support for CUDA and DDP #552

Torch support for CUDA and DDP #552

lukasgd commented Feb 3, 2025

gaogaotiantian commented Feb 3, 2025

lukasgd commented Feb 4, 2025

gaogaotiantian commented Feb 4, 2025

lukasgd commented Feb 5, 2025

Torch support for CUDA and DDP #552

Torch support for CUDA and DDP #552

Comments

lukasgd commented Feb 3, 2025

gaogaotiantian commented Feb 3, 2025

lukasgd commented Feb 4, 2025

gaogaotiantian commented Feb 4, 2025

lukasgd commented Feb 5, 2025