In this example, we show how to use Ignite to train a neural network:
- on 1 or more GPUs or TPUs
- compute training/validation metrics
- log learning rate, metrics etc
- save the best model weights
Configurations:
- single GPU
- multi GPUs on a single node
- multi GPUs on multiple nodes
- TPUs on Colab
- pytorch-ignite:
pip install pytorch-ignite
- torchvision:
pip install torchvision
- tqdm:
pip install tqdm
- tensorboardx:
pip install tensorboardX
- python-fire:
pip install fire
- Optional: clearml:
pip install clearml
Alternatively, install the all requirements using pip install -r requirements.txt
.
Run the example on a single GPU:
python main.py run
For more details on accepted arguments:
python main.py run -- --help
If user would like to provide already downloaded dataset, the path can be setup in parameters as
--data_path="/path/to/cifar10/"
Let's start training on a single node with 2 gpus:
# using torchrun
torchrun --nproc_per_node=2 main.py run --backend="nccl"
or
# using function spawn inside the code
python -u main.py run --backend="nccl" --nproc_per_node=2
Using Horovod as distributed backend
Please, make sure to have Horovod installed before running.
Let's start training on a single node with 2 gpus:
# horovodrun
horovodrun -np=2 python -u main.py run --backend="horovod"
or
# using function spawn inside the code
python -u main.py run --backend="horovod" --nproc_per_node=2
Let's start training on two nodes with 2 gpus each. We assuming that master node can be connected as master
, e.g. ping master
.
- Execute on master node
torchrun \
--nnodes=2 \
--nproc_per_node=2 \
--node_rank=0 \
--master_addr=master --master_port=2222 \
main.py run --backend="nccl"
- Execute on worker node
torchrun \
--nnodes=2 \
--nproc_per_node=2 \
--node_rank=1 \
--master_addr=master --master_port=2222 \
main.py run --backend="nccl"
Initial training with a stop on 1000 iteration (~11 epochs)
python main.py run --stop_iteration=1000
Resume from the latest checkpoint
python main.py run --resume-from=/tmp/output-cifar10/resnet18_backend-None-1_stop-on-1000/training_checkpoint_1000.pt
Initial training on a single node with 2 gpus with a stop on 1000 iteration (~11 epochs):
# using torchrun
torchrun --nproc_per_node=2 main.py run --backend="nccl" --stop_iteration=1000
Resume from the latest checkpoint
torchrun --nproc_per_node=2 main.py run --backend="nccl" \
--resume-from=/tmp/output-cifar10/resnet18_backend-nccl-2_stop-on-1000/training_checkpoint_1000.pt
Similar commands can be adapted for other cases.
If ClearML
server is used (i.e. --with_clearml
argument), the configuration to upload artifact must be done by
modifying the ClearML
configuration file ~/clearml.conf
generated by clearml-init
. According to the
documentation, the output_uri
argument can be
configured in sdk.development.default_output_uri
to fileserver uri. If server is self-hosted, ClearML
fileserver uri is
http://localhost:8081
.
For more details, see https://allegro.ai/clearml/docs/docs/examples/reporting/artifacts.html