Skip to content

Latest commit

 

History

History

cifar10

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

CIFAR10 Example with Ignite

In this example, we show how to use Ignite to train a neural network:

  • on 1 or more GPUs or TPUs
  • compute training/validation metrics
  • log learning rate, metrics etc
  • save the best model weights

Configurations:

  • single GPU
  • multi GPUs on a single node
  • multi GPUs on multiple nodes
  • TPUs on Colab

Requirements:

Alternatively, install the all requirements using pip install -r requirements.txt.

Usage:

Run the example on a single GPU:

python main.py run

For more details on accepted arguments:

python main.py run -- --help

If user would like to provide already downloaded dataset, the path can be setup in parameters as

--data_path="/path/to/cifar10/"

Distributed training

Single node, multiple GPUs

Let's start training on a single node with 2 gpus:

# using torchrun
torchrun --nproc_per_node=2 main.py run --backend="nccl"

or

# using function spawn inside the code
python -u main.py run --backend="nccl" --nproc_per_node=2
Using Horovod as distributed backend

Please, make sure to have Horovod installed before running.

Let's start training on a single node with 2 gpus:

# horovodrun
horovodrun -np=2 python -u main.py run --backend="horovod"

or

# using function spawn inside the code
python -u main.py run --backend="horovod" --nproc_per_node=2

Colab, on 8 TPUs

Same code can be run on TPUs: Open In Colab

Multiple nodes, multiple GPUs

Let's start training on two nodes with 2 gpus each. We assuming that master node can be connected as master, e.g. ping master.

  1. Execute on master node
torchrun \
    --nnodes=2 \
    --nproc_per_node=2 \
    --node_rank=0 \
    --master_addr=master --master_port=2222 \
    main.py run --backend="nccl"
  1. Execute on worker node
torchrun \
    --nnodes=2 \
    --nproc_per_node=2 \
    --node_rank=1 \
    --master_addr=master --master_port=2222 \
    main.py run --backend="nccl"

Check resume training

Single GPU

Initial training with a stop on 1000 iteration (~11 epochs)

python main.py run --stop_iteration=1000

Resume from the latest checkpoint

python main.py run --resume-from=/tmp/output-cifar10/resnet18_backend-None-1_stop-on-1000/training_checkpoint_1000.pt

Distributed training

Single node, multiple GPUs

Initial training on a single node with 2 gpus with a stop on 1000 iteration (~11 epochs):

# using torchrun
torchrun --nproc_per_node=2 main.py run --backend="nccl" --stop_iteration=1000

Resume from the latest checkpoint

torchrun --nproc_per_node=2 main.py run --backend="nccl" \
    --resume-from=/tmp/output-cifar10/resnet18_backend-nccl-2_stop-on-1000/training_checkpoint_1000.pt

Similar commands can be adapted for other cases.

ClearML fileserver

If ClearML server is used (i.e. --with_clearml argument), the configuration to upload artifact must be done by modifying the ClearML configuration file ~/clearml.conf generated by clearml-init. According to the documentation, the output_uri argument can be configured in sdk.development.default_output_uri to fileserver uri. If server is self-hosted, ClearML fileserver uri is http://localhost:8081.

For more details, see https://allegro.ai/clearml/docs/docs/examples/reporting/artifacts.html