CIFAR10 Example with Ignite

In this example, we show how to use Ignite to train a neural network:

on 1 or more GPUs or TPUs
compute training/validation metrics
log learning rate, metrics etc
save the best model weights

Configurations:

single GPU
multi GPUs on a single node
multi GPUs on multiple nodes
TPUs on Colab

Requirements:

pytorch-ignite: pip install pytorch-ignite
torchvision: pip install torchvision
tqdm: pip install tqdm
tensorboardx: pip install tensorboardX
python-fire: pip install fire
Optional: clearml: pip install clearml

Alternatively, install the all requirements using pip install -r requirements.txt.

Usage:

Run the example on a single GPU:

python main.py run

For more details on accepted arguments:

python main.py run -- --help

If user would like to provide already downloaded dataset, the path can be setup in parameters as

--data_path="/path/to/cifar10/"

Distributed training

Single node, multiple GPUs

Let's start training on a single node with 2 gpus:

# using torchrun
torchrun --nproc_per_node=2 main.py run --backend="nccl"

or

# using function spawn inside the code
python -u main.py run --backend="nccl" --nproc_per_node=2

Using Horovod as distributed backend

Please, make sure to have Horovod installed before running.

Let's start training on a single node with 2 gpus:

# horovodrun
horovodrun -np=2 python -u main.py run --backend="horovod"

or

# using function spawn inside the code
python -u main.py run --backend="horovod" --nproc_per_node=2

Colab, on 8 TPUs

Same code can be run on TPUs:

Multiple nodes, multiple GPUs

Let's start training on two nodes with 2 gpus each. We assuming that master node can be connected as master, e.g. ping master.

Execute on master node

torchrun \
    --nnodes=2 \
    --nproc_per_node=2 \
    --node_rank=0 \
    --master_addr=master --master_port=2222 \
    main.py run --backend="nccl"

Execute on worker node

torchrun \
    --nnodes=2 \
    --nproc_per_node=2 \
    --node_rank=1 \
    --master_addr=master --master_port=2222 \
    main.py run --backend="nccl"

Check resume training

Single GPU

Initial training with a stop on 1000 iteration (~11 epochs)

python main.py run --stop_iteration=1000

Resume from the latest checkpoint

python main.py run --resume-from=/tmp/output-cifar10/resnet18_backend-None-1_stop-on-1000/training_checkpoint_1000.pt

Distributed training

Single node, multiple GPUs

Initial training on a single node with 2 gpus with a stop on 1000 iteration (~11 epochs):

# using torchrun
torchrun --nproc_per_node=2 main.py run --backend="nccl" --stop_iteration=1000

Resume from the latest checkpoint

torchrun --nproc_per_node=2 main.py run --backend="nccl" \
    --resume-from=/tmp/output-cifar10/resnet18_backend-nccl-2_stop-on-1000/training_checkpoint_1000.pt

Similar commands can be adapted for other cases.

ClearML fileserver

If ClearML server is used (i.e. --with_clearml argument), the configuration to upload artifact must be done by modifying the ClearML configuration file ~/clearml.conf generated by clearml-init. According to the documentation, the output_uri argument can be configured in sdk.development.default_output_uri to fileserver uri. If server is self-hosted, ClearML fileserver uri is http://localhost:8081.

For more details, see https://allegro.ai/clearml/docs/docs/examples/reporting/artifacts.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CIFAR10 Example with Ignite

Requirements:

Usage:

Distributed training

Single node, multiple GPUs

Using Horovod as distributed backend

Colab, on 8 TPUs

Multiple nodes, multiple GPUs

Check resume training

Single GPU

Distributed training

Single node, multiple GPUs

ClearML fileserver

Files

README.md

Latest commit

History

README.md

File metadata and controls

CIFAR10 Example with Ignite

Requirements:

Usage:

Distributed training

Single node, multiple GPUs

Using Horovod as distributed backend

Colab, on 8 TPUs

Multiple nodes, multiple GPUs

Check resume training

Single GPU

Distributed training

Single node, multiple GPUs

ClearML fileserver