Name	Name	Last commit message	Last commit date
Latest commit History 6 Commits
ccl	ccl
examples	examples
patches	patches
src	src
tests	tests
third_party	third_party
.gitmodules	.gitmodules
CMakeLists.txt	CMakeLists.txt
LICENSE	LICENSE
README.md	README.md
setup.py	setup.py
third-party-programs.txt	third-party-programs.txt

torch-ccl

This repository holds PyTorch bindings maintained by Intel for the Intel® oneAPI Collective Communications Library (oneCCL).

Introduction

PyTorch is an open-source machine learning framework.

Intel® oneCCL (collective commnications library) is a library for efficient distributed deep learning training implementing such collectives like allreduce, allgather, alltoall. For more information on oneCCL, please refer to the oneCCL documentation.

torch-ccl module implements PyTorch C10D ProcessGroup API and can be dynamically loaded as external ProcessGroup and only works on Linux platform now.

Pytorch API Align

We recommend Anaconda as Python package management system. The following is the corresponding branchs (tags) of torch-ccl and supported Pytorch.

`torch`	`torch-ccl`
`master`	`master`
v1.6.0	ccl_torch1.6
v1.5-rc3	2021.1-beta09

The usage details can be found in the README of corresponding branch. The following part is about the usage of 2021.1-beta09 tag. if you want to use other version of torch-ccl please checkout to that branch(tag). For pytorch-1.5.0-rc3, the #PR28068 and #PR32361 are need to dynamicall register external ProcessGroup and enable alltoall collective communication primitive. The patch file about these two PRs is in patches directory and you can use it directly.

Requirements

Python 3.6 or later and a C++14 compiler.

pytorch v1.5.0-rc3.

Installation

To install torch-ccl:

clone PyTorch from source code.

   git clone https://github.com/pytorch/pytorch.git
   cd pytorch 
   git checkout v1.5.0-rc3
   cd ../

clone the torch-ccl.

   git clone https://github.com/intel/torch-ccl.git && cd torch-ccl 
   git submodule sync 
   git submodule update --init --recursive

Install pytorch and torch-ccl

   cd ../pytorch 
   git apply ../torch-ccl/patches/enable_torch_ccl_for_pytorch1.5.0-rc3.diff
   git submodule sync
   git submodule update --init --recursive
   python setup.py install 
   cd ../torch-ccl
   python setup.py install

oneCCL is used as third party repo of torch-ccl but you need to source the oneCCL environment before runing.

   source <torch_ccl_path>/ccl/env/setvars.sh

   for example: 
   torch_ccl_path=$CONDA_PREFIX/lib/python3.7/site-packages/torch_ccl-1.0.1-py3.7-linux-x86_64.egg/
   source <torch_ccl_path>/ccl/env/setvars.sh

Usage

example.py

import torch.nn.parallel
import torch.distributed as dist
import torch_ccl

...

os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
os.environ['RANK'] = os.environ.get('PMI_RANK', -1)
os.environ['WORLD_SIZE'] = os.environ.get('PMI_SIZE', -1)

backend = 'ccl'
dist.init_process_group(backend, ...)
my_rank = dist.get_rank()
my_size = dist.get_world_size()
print("my rank = %d  my size = %d" % (my_rank, my_size))

...

model = torch.nn.parallel.DistributedDataParallel(model, ...)

...

$ source <torch_ccl_path>/ccl/env/setvars.sh
$ mpirun -n <N> -ppn <PPN> -f <hostfile> python example.py

Performance Debugging

For debugging performance of communication primitives PyTorch's Autograd profiler can be used to inspect time spent inside oneCCL calls.

Example:

profiling.py

import torch.nn.parallel
import torch.distributed as dist
import torch_ccl

backend = 'ccl'
dist.init_process_group(backend, ...)
my_rank = dist.get_rank()
my_size = dist.get_world_size()
print("my rank = %d  my size = %d" % (my_rank, my_size))

x = torch.ones([2, 2])
y = torch.ones([4, 4])
with torch.autograd.profiler.profile(record_shapes=True) as prof:
    for _ in range(10):
        dist.all_reduce(x)
        dist.all_reduce(y)
dist.barrier()
print(prof.key_averages(group_by_input_shape=True).table(sort_by="self_cpu_time_total"))

$ mpirun -n 2 -l python profiling.py

[0] rank = 0, size = 2
[0] ------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
[0] Name                            Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Input Shapes
[0] ------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
[0] pg::allreduce                   37.70%           61.935us         37.70%           61.935us         6.194us          10               [[2, 2]]
[0] pg::allreduce                   23.40%           38.438us         23.40%           38.438us         3.844us          10               [[4, 4]]
[0] pg::wait::allreduce::sz:16      19.64%           32.258us         19.64%           32.258us         3.226us          10               []
[0] pg::wait::allreduce::sz:4       19.26%           31.634us         19.26%           31.634us         3.163us          10               []
[0] ------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
[0] Self CPU time total: 164.265us
[0]
[1] rank = 1, size = 2
[1] ------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
[1] Name                            Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Input Shapes
[1] ------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
[1] pg::allreduce                   50.27%           62.730us         50.27%           62.730us         6.273us          10               [[2, 2]]
[1] pg::allreduce                   28.96%           36.133us         28.96%           36.133us         3.613us          10               [[4, 4]]
[1] pg::wait::allreduce::sz:4       13.83%           17.254us         13.83%           17.254us         1.725us          10               []
[1] pg::wait::allreduce::sz:16      6.95%            8.672us          6.95%            8.672us          0.867us          10               []
[1] ------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
[1] Self CPU time total: 124.789us
[1]

License

BSD License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

torch-ccl

Introduction

Pytorch API Align

Requirements

Installation

Usage

Performance Debugging

License

About

Releases 20

Packages

Contributors 30

Languages

License

intel/torch-ccl

Folders and files

Latest commit

History

Repository files navigation

torch-ccl

Introduction

Pytorch API Align

Requirements

Installation

Usage

Performance Debugging

License

About

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases 20

Packages 0

Contributors 30

Languages

Packages