- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Torch support for CUDA and DDP #552
Comments
Viztracer requires a consistent stack, which means all function entries and returns have to match. For example, this pattern is invalid:
I don't believe this is a Viztracer specific issue, this is probably a violation to the consistent stack from torch (maybe just dist module). You can use a very simple tracing function to confirm that (just log calls and returns and see if they match). If the stack is not consistent, there's nothing Viztracer can do because that's basically just illegal data. (gevent does something like this, see #531) |
Hi @gaogaotiantian, thank you for the reply. W.r.t. the suggestion in #531 - actually for neither the CUDA nor the DDP example, I can generate an assertion failure with the tracefunc when not using VizTracer (setting
in both cases. In DDP it doesn't progress far enough to produce also the |
Okay this is actually related to a CPython bug I fixed before - python/cpython#122029. viztracer copied some code from lsprof and the code has some issues. I can fix this. |
Thanks for the quick fix! |
Trying to run some basic examples on a system with 4 GH200 modules using a container image based on
nvcr.io/nvidia/pytorch:25.01-py3
with viztracer 1.0.1 installed on top fails for me as follows.For moving tensors to a CUDA device with
test_cuda.py
I'm getting
and for DDP with
test_ddp.py
it is
Using only the CPU and no DDP, a simple test runs fine. Does viztracer support CUDA and DDP workloads with Pytorch?
The text was updated successfully, but these errors were encountered: