Skip to content

Add Distributed tutorials landing page #1997

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Oct 4, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion _static/css/custom.css
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -71,4 +71,3 @@
.sd-card:hover:after {
transform: scaleX(1);
}

6 changes: 3 additions & 3 deletions conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -275,16 +275,16 @@ def setup(app):
# and can be moved outside of this function (and the setup(app) function
# can be deleted).
#html_css_files = [
# 'https://cdn.jsdelivr.net/npm/katex@0.10.0-beta/dist/katex.min.css'
# 'https://cdn.jsdelivr.net/npm/katex@0.10.0-beta/dist/katex.min.css',
# 'css/custom.css'
#]
# In Sphinx 1.8 it was renamed to `add_css_file`, 1.7 and prior it is
# `add_stylesheet` (deprecated in 1.8).
#add_css = getattr(app, 'add_css_file', app.add_stylesheet)
#for css_file in html_css_files:
# add_css(css_file)

# Custom CSS
# app.add_stylesheet('css/pytorch_theme.css')
#app.add_stylesheet('css/pytorch_theme.css')
# app.add_stylesheet('https://fonts.googleapis.com/css?family=Lato')
# Custom directives
app.add_directive('includenodoc', IncludeDirective)
Expand Down
151 changes: 151 additions & 0 deletions distributed/home.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
Distributed and Parallel Training Tutorials
===========================================

Distributed training is a model training paradigm that involves
spreading training workload across multiple worker nodes, therefore
significantly improving the speed of training and model accuracy. While
distributed training can be used for any type of ML model training, it
is most beneficial to use it for large models and compute demanding
tasks as deep learning.

There are a few ways you can perform distributed training in
PyTorch with each method having their advantages in certain use cases:

* `DistributedDataParallel (DDP) <#learn-ddp>`__
* `Fully Sharded Data Parallel (FSDP) <#learn-fsdp>`__
* `Remote Procedure Call (RPC) distributed training <#learn-rpc>`__
* `Custom Extensions <#custom-extensions>`__

Read more about these options in `Distributed Overview <../beginner/dist_overview.html>`__.

.. _learn-ddp:

Learn DDP
---------

.. grid:: 3

.. grid-item-card:: :octicon:`file-code;1em`
DDP Intro Video Tutorials
:link: https://pytorch.org/tutorials/beginner/ddp_series_intro.html?utm_source=distr_landing&utm_medium=ddp_series_intro
:link-type: url

A step-by-step video series on how to get started with
`DistributedDataParallel` and advance to more complex topics
+++
:octicon:`code;1em` Code :octicon:`square-fill;1em` :octicon:`video;1em` Video

.. grid-item-card:: :octicon:`file-code;1em`
Getting Started with Distributed Data Parallel
:link: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html?utm_source=distr_landing&utm_medium=intermediate_ddp_tutorial
:link-type: url

This tutorial provides a short and gentle intro to the PyTorch
DistributedData Parallel.
+++
:octicon:`code;1em` Code

.. grid-item-card:: :octicon:`file-code;1em`
Distributed Training with Uneven Inputs Using
the Join Context Manager
:link: https://pytorch.org/tutorials/advanced/generic_join.html?utm_source=distr_landing&utm_medium=generic_join
:link-type: url

This tutorial provides a short and gentle intro to the PyTorch
DistributedData Parallel.
+++
:octicon:`code;1em` Code

.. _learn-fsdp:

Learn FSDP
----------

.. grid:: 3

.. grid-item-card:: :octicon:`file-code;1em`
Getting Started with FSDP
:link: https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html?utm_source=distr_landing&utm_medium=FSDP_getting_started
:link-type: url

This tutorial demonstrates how you can perform distributed training
with FSDP on a MNIST dataset.
+++
:octicon:`code;1em` Code

.. grid-item-card:: :octicon:`file-code;1em`
FSDP Advanced
:link: https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html?utm_source=distr_landing&utm_medium=FSDP_advanced
:link-type: url

In this tutorial, you will learn how to fine-tune a HuggingFace (HF) T5
model with FSDP for text summarization.
+++
:octicon:`code;1em` Code

.. _learn-rpc:

Learn RPC
---------

.. grid:: 3

.. grid-item-card:: :octicon:`file-code;1em`
Getting Started with Distributed RPC Framework
:link: https://pytorch.org/tutorials/intermediate/rpc_tutorial.html?utm_source=distr_landing&utm_medium=rpc_getting_started
:link-type: url

This tutorial demonstrates how to get started with RPC-based distributed
training.
+++
:octicon:`code;1em` Code

.. grid-item-card:: :octicon:`file-code;1em`
Implementing a Parameter Server Using Distributed RPC Framework
:link: https://pytorch.org/tutorials/intermediate/rpc_param_server_tutorial.html?utm_source=distr_landing&utm_medium=rpc_param_server_tutorial
:link-type: url

This tutorial walks you through a simple example of implementing a
parameter server using PyTorch’s Distributed RPC framework.
+++
:octicon:`code;1em` Code

.. grid-item-card:: :octicon:`file-code;1em`
Implementing Batch RPC Processing Using Asynchronous Executions
:link: https://pytorch.org/tutorials/intermediate/rpc_async_execution.html?utm_source=distr_landing&utm_medium=rpc_async_execution
:link-type: url

In this tutorial you will build batch-processing RPC applications
with the @rpc.functions.async_execution decorator.
+++
:octicon:`code;1em` Code

.. grid:: 3

.. grid-item-card:: :octicon:`file-code;1em`
Combining Distributed DataParallel with Distributed RPC Framework
:link: https://pytorch.org/tutorials/advanced/rpc_ddp_tutorial.html?utm_source=distr_landing&utm_medium=rpc_plus_ddp
:link-type: url

In this tutorial you will learn how to combine distributed data
parallelism with distributed model parallelism.
+++
:octicon:`code;1em` Code

.. _custom-extensions:

Custom Extensions
-----------------

.. grid:: 3

.. grid-item-card:: :octicon:`file-code;1em`
Customize Process Group Backends Using Cpp Extensions
:link: https://pytorch.org/tutorials/intermediate/process_group_cpp_extension_tutorial.html?utm_source=distr_landing&utm_medium=custom_extensions_cpp
:link-type: url

In this tutorial you will learn to implement a custom `ProcessGroup`
backend and plug that into PyTorch distributed package using
cpp extensions.
+++
:octicon:`code;1em` Code
1 change: 1 addition & 0 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -891,6 +891,7 @@ Additional Resources
:hidden:
:caption: Parallel and Distributed Training

distributed/home
beginner/dist_overview
beginner/ddp_series_intro
intermediate/model_parallel_tutorial
Expand Down