Skip to content

Latest commit

 

History

History
158 lines (108 loc) · 4.04 KB

README.md

File metadata and controls

158 lines (108 loc) · 4.04 KB

CIFAR10 Example with Ignite

In this example, we show how to use Ignite to train a neural network:

  • on 1 or more GPUs or TPUs
  • compute training/validation metrics
  • log learning rate, metrics etc
  • save the best model weights

Configurations:

  • single GPU
  • multi GPUs on a single node
  • multi GPUs on multiple nodes
  • TPUs on Colab

Requirements:

Alternatively, install the all requirements using pip install -r requirements.txt.

Usage:

Run the example on a single GPU:

python main.py run

For more details on accepted arguments:

python main.py run -- --help

If user would like to provide already downloaded dataset, the path can be setup in parameters as

--data_path="/path/to/cifar10/"

Distributed training

Single node, multiple GPUs

Let's start training on a single node with 2 gpus:

# using torchrun
torchrun --nproc_per_node=2 main.py run --backend="nccl"

or

# using function spawn inside the code
python -u main.py run --backend="nccl" --nproc_per_node=2
Using Horovod as distributed backend

Please, make sure to have Horovod installed before running.

Let's start training on a single node with 2 gpus:

# horovodrun
horovodrun -np=2 python -u main.py run --backend="horovod"

or

# using function spawn inside the code
python -u main.py run --backend="horovod" --nproc_per_node=2

Colab, on 8 TPUs

Same code can be run on TPUs: Open In Colab

Multiple nodes, multiple GPUs

Let's start training on two nodes with 2 gpus each. We assuming that master node can be connected as master, e.g. ping master.

  1. Execute on master node
torchrun \
    --nnodes=2 \
    --nproc_per_node=2 \
    --node_rank=0 \
    --master_addr=master --master_port=2222 \
    main.py run --backend="nccl"
  1. Execute on worker node
torchrun \
    --nnodes=2 \
    --nproc_per_node=2 \
    --node_rank=1 \
    --master_addr=master --master_port=2222 \
    main.py run --backend="nccl"

Check resume training

Single GPU

Initial training with a stop on 1000 iteration (~11 epochs)

python main.py run --stop_iteration=1000

Resume from the latest checkpoint

python main.py run --resume-from=/tmp/output-cifar10/resnet18_backend-None-1_stop-on-1000/training_checkpoint_1000.pt

Distributed training

Single node, multiple GPUs

Initial training on a single node with 2 gpus with a stop on 1000 iteration (~11 epochs):

# using torchrun
torchrun --nproc_per_node=2 main.py run --backend="nccl" --stop_iteration=1000

Resume from the latest checkpoint

torchrun --nproc_per_node=2 main.py run --backend="nccl" \
    --resume-from=/tmp/output-cifar10/resnet18_backend-nccl-2_stop-on-1000/training_checkpoint_1000.pt

Similar commands can be adapted for other cases.

ClearML fileserver

If ClearML server is used (i.e. --with_clearml argument), the configuration to upload artifact must be done by modifying the ClearML configuration file ~/clearml.conf generated by clearml-init. According to the documentation, the output_uri argument can be configured in sdk.development.default_output_uri to fileserver uri. If server is self-hosted, ClearML fileserver uri is http://localhost:8081.

For more details, see https://allegro.ai/clearml/docs/docs/examples/reporting/artifacts.html