1
1
Introduction to Holistic Trace Analysis
2
2
=======================================
3
- **Author: ** `Anupam Bhatnagar <https://github.com/anupambhatnagar >`_
4
3
5
- Setup
6
- -----
4
+ **Author: ** `Anupam Bhatnagar <https://github.com/anupambhatnagar >`_
7
5
8
- In this tutorial we demonstrate how to use Holistic Trace Analysis (HTA) to
6
+ In this tutorial, we demonstrate how to use Holistic Trace Analysis (HTA) to
9
7
analyze traces from a distributed training job. To get started follow the steps
10
- below:
8
+ below.
11
9
12
10
Installing HTA
13
- ^^^^^^^^^^^^^^
11
+ ~~~~~~~~~~~~~~
14
12
15
13
We recommend using a Conda environment to install HTA. To install Anaconda, see
16
- `here <https://docs.anaconda.com/anaconda/install/index.html >`_.
14
+ `the official Anaconda documentation <https://docs.anaconda.com/anaconda/install/index.html >`_.
17
15
18
- 1) Install HTA using pip
16
+ 1. Install HTA using pip:
19
17
20
- .. code-block :: python
18
+ .. code-block :: python
21
19
22
- pip install HolisticTraceAnalysis
20
+ pip install HolisticTraceAnalysis
23
21
24
- 2) [ Optional and recommended] Setup a conda environment
22
+ 2. ( Optional and recommended) Set up a Conda environment:
25
23
26
- .. code-block :: python
24
+ .. code-block :: python
27
25
28
- # create the environment env_name
29
- conda create - n env_name
26
+ # create the environment env_name
27
+ conda create - n env_name
30
28
31
- # activate the environment
32
- conda activate env_name
29
+ # activate the environment
30
+ conda activate env_name
33
31
34
- # deactivate the environment
35
- conda deactivate
32
+ # When you are done, deactivate the environment by running ``conda deactivate``
36
33
37
- Getting started
38
- ^^^^^^^^^^^^^^^
34
+ Getting Started
35
+ ~~~~~~~~~~~~~~~
39
36
40
- Launch a jupyter notebook and set the ``trace_dir `` variable to the location of the traces.
37
+ Launch a Jupyter notebook and set the ``trace_dir `` variable to the location of the traces.
41
38
42
39
.. code-block :: python
43
40
44
- from hta.trace_analysis import TraceAnalysis
45
- trace_dir = " /path/to/folder/with/traces"
46
- analyzer = TraceAnalysis(trace_dir = trace_dir)
41
+ from hta.trace_analysis import TraceAnalysis
42
+ trace_dir = " /path/to/folder/with/traces"
43
+ analyzer = TraceAnalysis(trace_dir = trace_dir)
47
44
48
45
49
46
Temporal Breakdown
50
47
------------------
51
48
52
- To best utilize the GPUs it is vital to understand where the GPU is spending
53
- time for a given job. Is the GPU spending time on computation, communication,
54
- memory events, or is it idle? The temporal breakdown feature breaks down the
55
- time spent in three categories
49
+ To effectively utilize the GPUs, it is crucial to understand how they are spending
50
+ time for a specific job. Are they primarily engaged in computation, communication,
51
+ memory events, or are they idle? The temporal breakdown feature provides a detailed
52
+ analysis of the time spent in these three categories.
56
53
57
- 1) Idle time - GPU is idle.
58
- 2) Compute time - GPU is being used for matrix multiplications or vector operations.
59
- 3) Non-compute time - GPU is being used for communication or memory events.
54
+ * Idle time - GPU is idle.
55
+ * Compute time - GPU is being used for matrix multiplications or vector operations.
56
+ * Non-compute time - GPU is being used for communication or memory events.
60
57
61
- To achieve high training efficiency the code should maximize compute time and
62
- minimize idle time and non-compute time. The function below returns
63
- a dataframe containing the temporal breakdown for each rank.
58
+ To achieve high training efficiency, the code should maximize compute time and
59
+ minimize idle time and non-compute time. The following function generates a
60
+ dataframe that provides a detailed breakdown of the temporal usage for each rank.
64
61
65
62
.. code-block :: python
66
63
67
- analyzer = TraceAnalysis(trace_dir = " /path/to/trace/folder" )
68
- time_spent_df = analyzer.get_temporal_breakdown()
64
+ analyzer = TraceAnalysis(trace_dir = " /path/to/trace/folder" )
65
+ time_spent_df = analyzer.get_temporal_breakdown()
69
66
70
67
71
68
.. image :: ../_static/img/hta/temporal_breakdown_df.png
72
69
73
- When the ``visualize `` argument is set to True in the `get_temporal_breakdown
70
+ When the ``visualize `` argument is set to `` True `` in the `get_temporal_breakdown
74
71
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_temporal_breakdown> `_
75
72
function it also generates a bar graph representing the breakdown by rank.
76
73
@@ -79,23 +76,26 @@ function it also generates a bar graph representing the breakdown by rank.
79
76
80
77
Idle Time Breakdown
81
78
-------------------
82
- Understanding how much time the GPU is idle and its causes can help direct
83
- optimization strategies. A GPU is considered idle when no kernel is running on
84
- it. We developed an algorithm to categorize the Idle time into 3 categories:
85
79
86
- #. Host wait: is the idle duration on the GPU due to the CPU not enqueuing
87
- kernels fast enough to keep the GPU busy. These kinds of inefficiencies can
88
- be resolved by examining the CPU operators that are contributing to the slow
89
- down, increasing the batch size and applying operator fusion.
80
+ Gaining insight into the amount of time the GPU spends idle and the
81
+ reasons behind it can help guide optimization strategies. A GPU is
82
+ considered idle when no kernel is running on it. We have developed an
83
+ algorithm to categorize the `Idle ` time into three distinct categories:
84
+
85
+ * **Host wait: ** refers to the idle time on the GPU that is caused by
86
+ the CPU not enqueuing kernels quickly enough to keep the GPU fully utilized.
87
+ These types of inefficiencies can be addressed by examining the CPU
88
+ operators that are contributing to the slowdown, increasing the batch
89
+ size and applying operator fusion.
90
90
91
- #. Kernel wait: constitutes the short overhead to launch consecutive kernels on
92
- the GPU. The idle time attributed to this category can be minimized by using
93
- CUDA Graph optimizations.
91
+ * ** Kernel wait: ** This refers to brief overhead associated with launching
92
+ consecutive kernels on the GPU. The idle time attributed to this category
93
+ can be minimized by using CUDA Graph optimizations.
94
94
95
- #. Other wait: Lastly, this category includes idle we could not currently
96
- attribute due to insufficient information. The likely causes include
97
- synchronization among CUDA streams using CUDA events and delays in launching
98
- kernels.
95
+ * ** Other wait: ** This category includes idle time that cannot currently
96
+ be attributed due to insufficient information. The likely causes include
97
+ synchronization among CUDA streams using CUDA events and delays in launching
98
+ kernels.
99
99
100
100
The host wait time can be interpreted as the time when the GPU is stalling due
101
101
to the CPU. To attribute the idle time as kernel wait we use the following
@@ -132,6 +132,7 @@ on each rank.
132
132
:scale: 100%
133
133
134
134
.. tip ::
135
+
135
136
By default, the idle time breakdown presents the percentage of each of the
136
137
idle time categories. Setting the ``visualize_pctg `` argument to ``False ``,
137
138
the function renders with absolute time on the y-axis.
@@ -140,10 +141,10 @@ on each rank.
140
141
Kernel Breakdown
141
142
----------------
142
143
143
- The kernel breakdown feature breaks down the time spent for each kernel type
144
- i.e. communication (COMM), computation (COMP), and memory (MEM) across all
145
- ranks and presents the proportion of time spent in each category. The
146
- percentage of time spent in each category as a pie chart.
144
+ The kernel breakdown feature breaks down the time spent for each kernel type,
145
+ such as communication (COMM), computation (COMP), and memory (MEM), across all
146
+ ranks and presents the proportion of time spent in each category. Here is the
147
+ percentage of time spent in each category as a pie chart:
147
148
148
149
.. image :: ../_static/img/hta/kernel_type_breakdown.png
149
150
:align: center
@@ -156,15 +157,15 @@ The kernel breakdown can be calculated as follows:
156
157
kernel_type_metrics_df, kernel_metrics_df = analyzer.get_gpu_kernel_breakdown()
157
158
158
159
The first dataframe returned by the function contains the raw values used to
159
- generate the Pie chart.
160
+ generate the pie chart.
160
161
161
162
Kernel Duration Distribution
162
163
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
163
164
164
165
The second dataframe returned by `get_gpu_kernel_breakdown
165
166
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_gpu_kernel_breakdown> `_
166
167
contains duration summary statistics for each kernel. In particular, this
167
- includes the count, min, max, average, standard deviation, sum and kernel type
168
+ includes the count, min, max, average, standard deviation, sum, and kernel type
168
169
for each kernel on each rank.
169
170
170
171
.. image :: ../_static/img/hta/kernel_metrics_df.png
@@ -181,11 +182,12 @@ bottlenecks.
181
182
.. image :: ../_static/img/hta/pie_charts.png
182
183
183
184
.. tip ::
185
+
184
186
All images are generated using plotly. Hovering on the graph shows the
185
- mode bar on the top right which allows the user to zoom, pan, select and
187
+ mode bar on the top right which allows the user to zoom, pan, select, and
186
188
download the graph.
187
189
188
- The pie charts above shows the top 5 computation, communication and memory
190
+ The pie charts above show the top 5 computation, communication, and memory
189
191
kernels. Similar pie charts are generated for each rank. The pie charts can be
190
192
configured to show the top k kernels using the ``num_kernels `` argument passed
191
193
to the `get_gpu_kernel_breakdown ` function. Additionally, the
@@ -212,21 +214,21 @@ in the examples folder of the repo.
212
214
Communication Computation Overlap
213
215
---------------------------------
214
216
215
- In distributed training a significant amount of time is spent in communication
216
- and synchronization events between GPUs. To achieve high GPU efficiency (i.e.
217
- TFLOPS/GPU) it is vital to keep the GPU oversubscribed with computation
217
+ In distributed training, a significant amount of time is spent in communication
218
+ and synchronization events between GPUs. To achieve high GPU efficiency (such as
219
+ TFLOPS/GPU), it is crucial to keep the GPU oversubscribed with computation
218
220
kernels. In other words, the GPU should not be blocked due to unresolved data
219
221
dependencies. One way to measure the extent to which computation is blocked by
220
222
data dependencies is to calculate the communication computation overlap. Higher
221
223
GPU efficiency is observed if communication events overlap computation events.
222
224
Lack of communication and computation overlap will lead to the GPU being idle,
223
- thus the efficiency would be low. To sum up, higher communication computation
224
- overlap is desirable. To calculate the overlap percentage for each rank we
225
- measure the following ratio:
225
+ resulting in low efficiency.
226
+ To sum up, a higher communication computation overlap is desirable. To calculate
227
+ the overlap percentage for each rank, we measure the following ratio:
226
228
227
229
| **(time spent in computation while communicating) / (time spent in communication)**
228
230
229
- Communication computation overlap can be calculated as follows:
231
+ The communication computation overlap can be calculated as follows:
230
232
231
233
.. code-block :: python
232
234
@@ -266,9 +268,9 @@ API outputs a new trace file with the memory bandwidth and queue length
266
268
counters. The new trace file contains tracks which indicate the memory
267
269
bandwidth used by memcpy/memset operations and tracks for the queue length on
268
270
each stream. By default, these counters are generated using the rank 0
269
- trace file and the new file contains the suffix ``_with_counters `` in its name.
271
+ trace file, and the new file contains the suffix ``_with_counters `` in its name.
270
272
Users have the option to generate the counters for multiple ranks by using the
271
- ``ranks `` argument in the `generate_trace_with_counters ` API.
273
+ ``ranks `` argument in the `` generate_trace_with_counters ` ` API.
272
274
273
275
.. code-block :: python
274
276
@@ -284,19 +286,15 @@ HTA also provides a summary of the memory copy bandwidth and queue length
284
286
counters as well as the time series of the counters for the profiled portion of
285
287
the code using the following API:
286
288
287
- #. `get_memory_bw_summary
288
- <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_summary> `_
289
+ * `get_memory_bw_summary <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_summary >`_
289
290
290
- #. `get_queue_length_summary
291
- <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_summary> `_
291
+ * `get_queue_length_summary <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_summary >`_
292
292
293
- #. `get_memory_bw_time_series
294
- <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_time_series> `_
293
+ * `get_memory_bw_time_series <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_time_series >`_
295
294
296
- #. `get_queue_length_time_series
297
- <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_time_series> `_
295
+ * `get_queue_length_time_series <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_time_series >`_
298
296
299
- To view the summary and time series use:
297
+ To view the summary and time series, use:
300
298
301
299
.. code-block :: python
302
300
@@ -321,17 +319,16 @@ bandwidth and queue length time series functions return a dictionary whose key
321
319
is the rank and the value is the time series for that rank. By default, the
322
320
time series is computed for rank 0 only.
323
321
324
-
325
322
CUDA Kernel Launch Statistics
326
323
-----------------------------
327
324
328
325
.. image :: ../_static/img/hta/cuda_kernel_launch.png
329
326
330
- For each event launched on the GPU there is a corresponding scheduling event on
331
- the CPU e.g. CudaLaunchKernel, CudaMemcpyAsync, CudaMemsetAsync. These events
332
- are linked by a common correlation id in the trace. See figure above. This
333
- feature computes the duration of the CPU runtime event, its corresponding GPU
334
- kernel and the launch delay i.e. the difference between GPU kernel starting and
327
+ For each event launched on the GPU, there is a corresponding scheduling event on
328
+ the CPU, such as `` CudaLaunchKernel ``, `` CudaMemcpyAsync ``, `` CudaMemsetAsync ``.
329
+ These events are linked by a common correlation ID in the trace - see the figure
330
+ above. This feature computes the duration of the CPU runtime event, its corresponding GPU
331
+ kernel and the launch delay, for example, the difference between GPU kernel starting and
335
332
CPU operator ending. The kernel launch info can be generated as follows:
336
333
337
334
.. code-block :: python
@@ -345,23 +342,23 @@ A screenshot of the generated dataframe is given below.
345
342
:scale: 100%
346
343
:align: center
347
344
348
- The duration of the CPU op, GPU kernel and the launch delay allows us to find:
345
+ The duration of the CPU op, GPU kernel, and the launch delay allow us to find
346
+ the following:
349
347
350
- #. **Short GPU kernels ** - GPU kernels with duration less than the
351
- corresponding CPU runtime event.
348
+ * **Short GPU kernels ** - GPU kernels with duration less than the corresponding
349
+ CPU runtime event.
352
350
353
- #. **Runtime event outliers ** - CPU runtime events with excessive duration.
351
+ * **Runtime event outliers ** - CPU runtime events with excessive duration.
354
352
355
- #. **Launch delay outliers ** - GPU kernels which take too long to be scheduled.
353
+ * **Launch delay outliers ** - GPU kernels which take too long to be scheduled.
356
354
357
355
HTA generates distribution plots for each of the aforementioned three categories.
358
356
359
-
360
357
**Short GPU kernels **
361
358
362
- Usually , the launch time on the CPU side is between 5-20 microseconds. In some
363
- cases the GPU execution time is lower than the launch time itself. The graph
364
- below allows us to find how frequently such instances appear in the code.
359
+ Typically , the launch time on the CPU side ranges from 5-20 microseconds. In some
360
+ cases, the GPU execution time is lower than the launch time itself. The graph
361
+ below helps us to find how frequently such instances occur in the code.
365
362
366
363
.. image :: ../_static/img/hta/short_gpu_kernels.png
367
364
@@ -382,3 +379,12 @@ hence the `get_cuda_kernel_launch_stats` API provides the
382
379
``launch_delay_cutoff `` argument to configure the value.
383
380
384
381
.. image :: ../_static/img/hta/launch_delay_outliers.png
382
+
383
+
384
+ Conclusion
385
+ ~~~~~~~~~~
386
+
387
+ In this tutorial, you have learned how to install and use HTA,
388
+ a performance tool that enables you analyze bottlenecks in your distributed
389
+ training workflows. To learn how you can use the HTA tool to perform trace
390
+ diff analysis, see `Trace Diff using Holistic Trace Analysis <https://pytorch.org/tutorials/beginner/hta_trace_diff_tutorial.html >`__.
0 commit comments