[Kineto] Add XPU profiler with Kineto plugin support(c10) #5

zejun-chen · 2024-05-23T02:57:55Z

RFC: https://github.com/intel-innersource/frameworks.ai.pytorch.private-gpu/issues/234

register function Signed-off-by: Chen, Zejun <zejun.chen@intel.com>

c10/core/KinetoPluginAPI.h

c10/core/KinetoPluginAPI.cpp

zejun-chen · 2024-05-24T08:12:57Z

#6
ATen version PR

Signed-off-by: Chen, Zejun <zejun.chen@intel.com>

@zdevito

…ytorch#139659) ### Motivation Today, watchdog only reports that it found a collective timeout: ``` [rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out. ``` While this is nice, it is hard to associate the error with user's program or library stack. ### This PR This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior. The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users. ### Demo [stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09). ``` TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py ``` `TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder. Output: ``` [rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 bar from /data/users/kw2501/sync_async/repro.py:15 #3 foo from /data/users/kw2501/sync_async/repro.py:24 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 [rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 baz from /data/users/kw2501/sync_async/repro.py:20 #3 foo from /data/users/kw2501/sync_async/repro.py:26 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 ``` From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert. Pull Request resolved: pytorch#139659 Approved by: https://github.com/wconstab, https://github.com/fduwjj

See pytorch#140725 (comment) Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40 frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame #8: 0x0000000100fccbe4 Python`run_mod + 168 frame #9: 0x0000000100fcb518 Python`pyrun_file + 164 frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame pytorch#11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame pytorch#12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame pytorch#13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame pytorch#14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame pytorch#15: 0x0000000100ff1564 Python`pymain_main + 304 frame pytorch#16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame pytorch#17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: pytorch#141296 Approved by: https://github.com/huydhn

zejun-chen force-pushed the zejun/kineto_plugin branch 2 times, most recently from ee0ba78 to 899fc91 Compare May 24, 2024 05:18

[XPU][Kineto] register XPU kineto profiler and implement

4f480ff

register function Signed-off-by: Chen, Zejun <zejun.chen@intel.com>

zejun-chen force-pushed the zejun/kineto_plugin branch from 899fc91 to 4f480ff Compare May 24, 2024 06:05

zejun-chen commented May 24, 2024

View reviewed changes

c10/core/KinetoPluginAPI.h Outdated Show resolved Hide resolved

zejun-chen commented May 24, 2024

View reviewed changes

c10/core/KinetoPluginAPI.cpp Outdated Show resolved Hide resolved

zejun-chen commented May 24, 2024

View reviewed changes

c10/core/KinetoPluginAPI.cpp Outdated Show resolved Hide resolved

zejun-chen added 2 commits May 24, 2024 17:03

remove unregister method

32fa39e

Signed-off-by: Chen, Zejun <zejun.chen@intel.com>

fix build error

53faf8c

Signed-off-by: Chen, Zejun <zejun.chen@intel.com>

zejun-chen changed the title ~~[Kineto] Add XPU profiler with Kineto plugin support~~ [Kineto] Add XPU profiler with Kineto plugin support(c10) May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Kineto] Add XPU profiler with Kineto plugin support(c10) #5

[Kineto] Add XPU profiler with Kineto plugin support(c10) #5

Uh oh!

zejun-chen commented May 23, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zejun-chen commented May 24, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Kineto] Add XPU profiler with Kineto plugin support(c10) #5

Are you sure you want to change the base?

[Kineto] Add XPU profiler with Kineto plugin support(c10) #5

Uh oh!

Conversation

zejun-chen commented May 23, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zejun-chen commented May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zejun-chen commented May 24, 2024 •

edited

Loading