Skip to content

Commit cc80c68

Browse files
authored
An editorial pass on the HTA tutorials (#2726)
* Editorial pass on the HTA tutorials
1 parent 765d428 commit cc80c68

File tree

2 files changed

+114
-109
lines changed

2 files changed

+114
-109
lines changed

beginner_source/hta_intro_tutorial.rst

Lines changed: 98 additions & 92 deletions
Original file line numberDiff line numberDiff line change
@@ -1,76 +1,73 @@
11
Introduction to Holistic Trace Analysis
22
=======================================
3-
**Author:** `Anupam Bhatnagar <https://github.com/anupambhatnagar>`_
43

5-
Setup
6-
-----
4+
**Author:** `Anupam Bhatnagar <https://github.com/anupambhatnagar>`_
75

8-
In this tutorial we demonstrate how to use Holistic Trace Analysis (HTA) to
6+
In this tutorial, we demonstrate how to use Holistic Trace Analysis (HTA) to
97
analyze traces from a distributed training job. To get started follow the steps
10-
below:
8+
below.
119

1210
Installing HTA
13-
^^^^^^^^^^^^^^
11+
~~~~~~~~~~~~~~
1412

1513
We recommend using a Conda environment to install HTA. To install Anaconda, see
16-
`here <https://docs.anaconda.com/anaconda/install/index.html>`_.
14+
`the official Anaconda documentation <https://docs.anaconda.com/anaconda/install/index.html>`_.
1715

18-
1) Install HTA using pip
16+
1. Install HTA using pip:
1917

20-
.. code-block:: python
18+
.. code-block:: python
2119
22-
pip install HolisticTraceAnalysis
20+
pip install HolisticTraceAnalysis
2321
24-
2) [Optional and recommended] Setup a conda environment
22+
2. (Optional and recommended) Set up a Conda environment:
2523

26-
.. code-block:: python
24+
.. code-block:: python
2725
28-
# create the environment env_name
29-
conda create -n env_name
26+
# create the environment env_name
27+
conda create -n env_name
3028
31-
# activate the environment
32-
conda activate env_name
29+
# activate the environment
30+
conda activate env_name
3331
34-
# deactivate the environment
35-
conda deactivate
32+
# When you are done, deactivate the environment by running ``conda deactivate``
3633
37-
Getting started
38-
^^^^^^^^^^^^^^^
34+
Getting Started
35+
~~~~~~~~~~~~~~~
3936

40-
Launch a jupyter notebook and set the ``trace_dir`` variable to the location of the traces.
37+
Launch a Jupyter notebook and set the ``trace_dir`` variable to the location of the traces.
4138

4239
.. code-block:: python
4340
44-
from hta.trace_analysis import TraceAnalysis
45-
trace_dir = "/path/to/folder/with/traces"
46-
analyzer = TraceAnalysis(trace_dir=trace_dir)
41+
from hta.trace_analysis import TraceAnalysis
42+
trace_dir = "/path/to/folder/with/traces"
43+
analyzer = TraceAnalysis(trace_dir=trace_dir)
4744
4845
4946
Temporal Breakdown
5047
------------------
5148

52-
To best utilize the GPUs it is vital to understand where the GPU is spending
53-
time for a given job. Is the GPU spending time on computation, communication,
54-
memory events, or is it idle? The temporal breakdown feature breaks down the
55-
time spent in three categories
49+
To effectively utilize the GPUs, it is crucial to understand how they are spending
50+
time for a specific job. Are they primarily engaged in computation, communication,
51+
memory events, or are they idle? The temporal breakdown feature provides a detailed
52+
analysis of the time spent in these three categories.
5653

57-
1) Idle time - GPU is idle.
58-
2) Compute time - GPU is being used for matrix multiplications or vector operations.
59-
3) Non-compute time - GPU is being used for communication or memory events.
54+
* Idle time - GPU is idle.
55+
* Compute time - GPU is being used for matrix multiplications or vector operations.
56+
* Non-compute time - GPU is being used for communication or memory events.
6057

61-
To achieve high training efficiency the code should maximize compute time and
62-
minimize idle time and non-compute time. The function below returns
63-
a dataframe containing the temporal breakdown for each rank.
58+
To achieve high training efficiency, the code should maximize compute time and
59+
minimize idle time and non-compute time. The following function generates a
60+
dataframe that provides a detailed breakdown of the temporal usage for each rank.
6461

6562
.. code-block:: python
6663
67-
analyzer = TraceAnalysis(trace_dir = "/path/to/trace/folder")
68-
time_spent_df = analyzer.get_temporal_breakdown()
64+
analyzer = TraceAnalysis(trace_dir = "/path/to/trace/folder")
65+
time_spent_df = analyzer.get_temporal_breakdown()
6966
7067
7168
.. image:: ../_static/img/hta/temporal_breakdown_df.png
7269

73-
When the ``visualize`` argument is set to True in the `get_temporal_breakdown
70+
When the ``visualize`` argument is set to ``True`` in the `get_temporal_breakdown
7471
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_temporal_breakdown>`_
7572
function it also generates a bar graph representing the breakdown by rank.
7673

@@ -79,23 +76,26 @@ function it also generates a bar graph representing the breakdown by rank.
7976

8077
Idle Time Breakdown
8178
-------------------
82-
Understanding how much time the GPU is idle and its causes can help direct
83-
optimization strategies. A GPU is considered idle when no kernel is running on
84-
it. We developed an algorithm to categorize the Idle time into 3 categories:
8579

86-
#. Host wait: is the idle duration on the GPU due to the CPU not enqueuing
87-
kernels fast enough to keep the GPU busy. These kinds of inefficiencies can
88-
be resolved by examining the CPU operators that are contributing to the slow
89-
down, increasing the batch size and applying operator fusion.
80+
Gaining insight into the amount of time the GPU spends idle and the
81+
reasons behind it can help guide optimization strategies. A GPU is
82+
considered idle when no kernel is running on it. We have developed an
83+
algorithm to categorize the `Idle` time into three distinct categories:
84+
85+
* **Host wait:** refers to the idle time on the GPU that is caused by
86+
the CPU not enqueuing kernels quickly enough to keep the GPU fully utilized.
87+
These types of inefficiencies can be addressed by examining the CPU
88+
operators that are contributing to the slowdown, increasing the batch
89+
size and applying operator fusion.
9090

91-
#. Kernel wait: constitutes the short overhead to launch consecutive kernels on
92-
the GPU. The idle time attributed to this category can be minimized by using
93-
CUDA Graph optimizations.
91+
* **Kernel wait:** This refers to brief overhead associated with launching
92+
consecutive kernels on the GPU. The idle time attributed to this category
93+
can be minimized by using CUDA Graph optimizations.
9494

95-
#. Other wait: Lastly, this category includes idle we could not currently
96-
attribute due to insufficient information. The likely causes include
97-
synchronization among CUDA streams using CUDA events and delays in launching
98-
kernels.
95+
* **Other wait:** This category includes idle time that cannot currently
96+
be attributed due to insufficient information. The likely causes include
97+
synchronization among CUDA streams using CUDA events and delays in launching
98+
kernels.
9999

100100
The host wait time can be interpreted as the time when the GPU is stalling due
101101
to the CPU. To attribute the idle time as kernel wait we use the following
@@ -132,6 +132,7 @@ on each rank.
132132
:scale: 100%
133133

134134
.. tip::
135+
135136
By default, the idle time breakdown presents the percentage of each of the
136137
idle time categories. Setting the ``visualize_pctg`` argument to ``False``,
137138
the function renders with absolute time on the y-axis.
@@ -140,10 +141,10 @@ on each rank.
140141
Kernel Breakdown
141142
----------------
142143

143-
The kernel breakdown feature breaks down the time spent for each kernel type
144-
i.e. communication (COMM), computation (COMP), and memory (MEM) across all
145-
ranks and presents the proportion of time spent in each category. The
146-
percentage of time spent in each category as a pie chart.
144+
The kernel breakdown feature breaks down the time spent for each kernel type,
145+
such as communication (COMM), computation (COMP), and memory (MEM), across all
146+
ranks and presents the proportion of time spent in each category. Here is the
147+
percentage of time spent in each category as a pie chart:
147148

148149
.. image:: ../_static/img/hta/kernel_type_breakdown.png
149150
:align: center
@@ -156,15 +157,15 @@ The kernel breakdown can be calculated as follows:
156157
kernel_type_metrics_df, kernel_metrics_df = analyzer.get_gpu_kernel_breakdown()
157158
158159
The first dataframe returned by the function contains the raw values used to
159-
generate the Pie chart.
160+
generate the pie chart.
160161

161162
Kernel Duration Distribution
162163
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
163164

164165
The second dataframe returned by `get_gpu_kernel_breakdown
165166
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_gpu_kernel_breakdown>`_
166167
contains duration summary statistics for each kernel. In particular, this
167-
includes the count, min, max, average, standard deviation, sum and kernel type
168+
includes the count, min, max, average, standard deviation, sum, and kernel type
168169
for each kernel on each rank.
169170

170171
.. image:: ../_static/img/hta/kernel_metrics_df.png
@@ -181,11 +182,12 @@ bottlenecks.
181182
.. image:: ../_static/img/hta/pie_charts.png
182183

183184
.. tip::
185+
184186
All images are generated using plotly. Hovering on the graph shows the
185-
mode bar on the top right which allows the user to zoom, pan, select and
187+
mode bar on the top right which allows the user to zoom, pan, select, and
186188
download the graph.
187189

188-
The pie charts above shows the top 5 computation, communication and memory
190+
The pie charts above show the top 5 computation, communication, and memory
189191
kernels. Similar pie charts are generated for each rank. The pie charts can be
190192
configured to show the top k kernels using the ``num_kernels`` argument passed
191193
to the `get_gpu_kernel_breakdown` function. Additionally, the
@@ -212,21 +214,21 @@ in the examples folder of the repo.
212214
Communication Computation Overlap
213215
---------------------------------
214216

215-
In distributed training a significant amount of time is spent in communication
216-
and synchronization events between GPUs. To achieve high GPU efficiency (i.e.
217-
TFLOPS/GPU) it is vital to keep the GPU oversubscribed with computation
217+
In distributed training, a significant amount of time is spent in communication
218+
and synchronization events between GPUs. To achieve high GPU efficiency (such as
219+
TFLOPS/GPU), it is crucial to keep the GPU oversubscribed with computation
218220
kernels. In other words, the GPU should not be blocked due to unresolved data
219221
dependencies. One way to measure the extent to which computation is blocked by
220222
data dependencies is to calculate the communication computation overlap. Higher
221223
GPU efficiency is observed if communication events overlap computation events.
222224
Lack of communication and computation overlap will lead to the GPU being idle,
223-
thus the efficiency would be low. To sum up, higher communication computation
224-
overlap is desirable. To calculate the overlap percentage for each rank we
225-
measure the following ratio:
225+
resulting in low efficiency.
226+
To sum up, a higher communication computation overlap is desirable. To calculate
227+
the overlap percentage for each rank, we measure the following ratio:
226228

227229
| **(time spent in computation while communicating) / (time spent in communication)**
228230
229-
Communication computation overlap can be calculated as follows:
231+
The communication computation overlap can be calculated as follows:
230232

231233
.. code-block:: python
232234
@@ -266,9 +268,9 @@ API outputs a new trace file with the memory bandwidth and queue length
266268
counters. The new trace file contains tracks which indicate the memory
267269
bandwidth used by memcpy/memset operations and tracks for the queue length on
268270
each stream. By default, these counters are generated using the rank 0
269-
trace file and the new file contains the suffix ``_with_counters`` in its name.
271+
trace file, and the new file contains the suffix ``_with_counters`` in its name.
270272
Users have the option to generate the counters for multiple ranks by using the
271-
``ranks`` argument in the `generate_trace_with_counters` API.
273+
``ranks`` argument in the ``generate_trace_with_counters`` API.
272274

273275
.. code-block:: python
274276
@@ -284,19 +286,15 @@ HTA also provides a summary of the memory copy bandwidth and queue length
284286
counters as well as the time series of the counters for the profiled portion of
285287
the code using the following API:
286288

287-
#. `get_memory_bw_summary
288-
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_summary>`_
289+
* `get_memory_bw_summary <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_summary>`_
289290

290-
#. `get_queue_length_summary
291-
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_summary>`_
291+
* `get_queue_length_summary <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_summary>`_
292292

293-
#. `get_memory_bw_time_series
294-
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_time_series>`_
293+
* `get_memory_bw_time_series <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_time_series>`_
295294

296-
#. `get_queue_length_time_series
297-
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_time_series>`_
295+
* `get_queue_length_time_series <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_time_series>`_
298296

299-
To view the summary and time series use:
297+
To view the summary and time series, use:
300298

301299
.. code-block:: python
302300
@@ -321,17 +319,16 @@ bandwidth and queue length time series functions return a dictionary whose key
321319
is the rank and the value is the time series for that rank. By default, the
322320
time series is computed for rank 0 only.
323321

324-
325322
CUDA Kernel Launch Statistics
326323
-----------------------------
327324

328325
.. image:: ../_static/img/hta/cuda_kernel_launch.png
329326

330-
For each event launched on the GPU there is a corresponding scheduling event on
331-
the CPU e.g. CudaLaunchKernel, CudaMemcpyAsync, CudaMemsetAsync. These events
332-
are linked by a common correlation id in the trace. See figure above. This
333-
feature computes the duration of the CPU runtime event, its corresponding GPU
334-
kernel and the launch delay i.e. the difference between GPU kernel starting and
327+
For each event launched on the GPU, there is a corresponding scheduling event on
328+
the CPU, such as ``CudaLaunchKernel``, ``CudaMemcpyAsync``, ``CudaMemsetAsync``.
329+
These events are linked by a common correlation ID in the trace - see the figure
330+
above. This feature computes the duration of the CPU runtime event, its corresponding GPU
331+
kernel and the launch delay, for example, the difference between GPU kernel starting and
335332
CPU operator ending. The kernel launch info can be generated as follows:
336333

337334
.. code-block:: python
@@ -345,23 +342,23 @@ A screenshot of the generated dataframe is given below.
345342
:scale: 100%
346343
:align: center
347344

348-
The duration of the CPU op, GPU kernel and the launch delay allows us to find:
345+
The duration of the CPU op, GPU kernel, and the launch delay allow us to find
346+
the following:
349347

350-
#. **Short GPU kernels** - GPU kernels with duration less than the
351-
corresponding CPU runtime event.
348+
* **Short GPU kernels** - GPU kernels with duration less than the corresponding
349+
CPU runtime event.
352350

353-
#. **Runtime event outliers** - CPU runtime events with excessive duration.
351+
* **Runtime event outliers** - CPU runtime events with excessive duration.
354352

355-
#. **Launch delay outliers** - GPU kernels which take too long to be scheduled.
353+
* **Launch delay outliers** - GPU kernels which take too long to be scheduled.
356354

357355
HTA generates distribution plots for each of the aforementioned three categories.
358356

359-
360357
**Short GPU kernels**
361358

362-
Usually, the launch time on the CPU side is between 5-20 microseconds. In some
363-
cases the GPU execution time is lower than the launch time itself. The graph
364-
below allows us to find how frequently such instances appear in the code.
359+
Typically, the launch time on the CPU side ranges from 5-20 microseconds. In some
360+
cases, the GPU execution time is lower than the launch time itself. The graph
361+
below helps us to find how frequently such instances occur in the code.
365362

366363
.. image:: ../_static/img/hta/short_gpu_kernels.png
367364

@@ -382,3 +379,12 @@ hence the `get_cuda_kernel_launch_stats` API provides the
382379
``launch_delay_cutoff`` argument to configure the value.
383380

384381
.. image:: ../_static/img/hta/launch_delay_outliers.png
382+
383+
384+
Conclusion
385+
~~~~~~~~~~
386+
387+
In this tutorial, you have learned how to install and use HTA,
388+
a performance tool that enables you analyze bottlenecks in your distributed
389+
training workflows. To learn how you can use the HTA tool to perform trace
390+
diff analysis, see `Trace Diff using Holistic Trace Analysis <https://pytorch.org/tutorials/beginner/hta_trace_diff_tutorial.html>`__.

0 commit comments

Comments
 (0)