Implementation of SOTA visual transformers and mlp models on PaddlePaddle 2.0+
PaddlePaddle Visual Transformers (PPViT
) is a collection of PaddlePaddle image models beyond convolution, which are mostly based on visual transformers, visual attentions, and MLPs, etc. PPViT also integrates popular layers, utilities, optimizers, schedulers, data augmentations, training/validation scripts for PaddlePaddle 2.0+. The aim is to reproduce a wide variety of SOTA ViT models with full training/validation procedures.
- ViT (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)
- DeiT (Training data-efficient image transformers & distillation through attention)
- Swin Transformer (Swin Transformer: Hierarchical Vision Transformer using Shifted Windows)
- PVT (Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions)
- MLP-Mixer (MLP-Mixer: An all-MLP Architecture for Vision)
- ResMLP (ResMLP: Feedforward networks for image classification with data-efficient training)
- gMLP (Pay Attention to MLPs)
- VOLO (VOLO: Vision Outlooker for Visual Recognition)
- CaiT (Going deeper with Image Transformers)
- Shuffle Transformer (Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer)
-
T2T-ViT (Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet)
-
CSwin (CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows)
-
Focal Self-attention (Focal Self-attention for Local-Global Interactions in Vision Transformers)
-
HaloNet (Scaling Local Self-Attention for Parameter Efficient Visual Backbones)
-
Refined-ViT (Refiner: Refining Self-attention for Vision Transformers)
-
CrossViT (CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification)
- Swin Transformer (Swin Transformer: Hierarchical Vision Transformer using Shifted Windows)
- Shuffle Transformer (Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer)
- PVT (Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions)
- Focal Self-attention (Focal Self-attention for Local-Global Interactions in Vision Transformers)
- CSwin (CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows)
- UP-DETR (UP-DETR: Unsupervised Pre-training for Object Detection with Transformers)
- FTN (Fully Transformer Networks for Semantic Image Segmentation)
- Swin Transformer (Swin Transformer: Hierarchical Vision Transformer using Shifted Windows)
- Segmenter: (Transformer for Semantic Segmentation)
- SegFormer (SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers)
- Shuffle Transformer (Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer)
- Focal Self-attention (Focal Self-attention for Local-Global Interactions in Vision Transformers)
- CSwin (CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows)
- TransGAN (TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up)
- Styleformer (Styleformer: Transformer based Generative Adversarial Networks with Style Vector)
- ViTGAN (ViTGAN: Training GANs with Vision Transformers)
Model | Acc@1 | Acc@5 | Image Size | Crop_pct | Interpolation | Model |
---|---|---|---|---|---|---|
vit_base_patch16_224 | 81.66 | 96.09 | 224 | 0.875 | bilinear | google/baidu(nxhy) |
vit_base_patch16_384 | 84.20 | 97.22 | 384 | 1.0 | bilinear | google/baidu(8ack) |
vit_large_patch16_224 | 83.00 | 96.47 | 224 | 0.875 | bicubic | google/baidu(g7ij) |
swin_base_patch4_window7_224 | 85.27 | 97.56 | 224 | 0.9 | bicubic | google/baidu(ps9m) |
swin_base_patch4_window12_384 | 86.43 | 98.07 | 384 | 1.0 | bicubic | google/baidu(ef9t) |
swin_large_patch4_window12_384 | 87.14 | 98.23 | 384 | 1.0 | bicubic | google/baidu(5shn) |
pvtv2_tiny_224 | 70.47 | 90.16 | 224 | 0.875 | bicubic | google/baidu(575w) |
pvtv2_medium_224 | 82.02 | 95.99 | 224 | 0.875 | bicubic | google/baidu(ezfc) |
pvtv2_large_224 | 83.77 | 96.61 | 224 | 0.875 | bicubic | google/baidu(fbc4) |
mixer_b16_224 | 76.60 | 92.23 | 224 | 0.875 | bicubic | google/baidu(xh8x) |
resmlp_24_224 | 79.38 | 94.55 | 224 | 0.875 | bicubic | google/baidu(jdcx) |
gmlp_s16_224 | 79.64 | 94.63 | 224 | 0.875 | bicubic | google/baidu(bcth) |
volo_d5_224_86.10 | 86.08 | 97.58 | 224 | 1.0 | bicubic | google/baidu(td49) |
volo_d5_512_87.07 | 87.05 | 97.97 | 512 | 1.15 | bicubic | google/baidu(irik) |
cait_xxs24_224 | 78.38 | 94.32 | 224 | 1.0 | bicubic | google/baidu(j9m8) |
cait_s24_384 | 85.05 | 97.34 | 384 | 1.0 | bicubic | google/baidu(qb86) |
cait_m48_448 | 86.49 | 97.75 | 448 | 1.0 | bicubic | google/baidu(imk5) |
deit_base_distilled_patch16_224 | 83.32 | 96.49 | 224 | 0.875 | bicubic | google/baidu(5f2g) |
deit_base_distilled_patch16_384 | 85.43 | 97.33 | 384 | 1.0 | bicubic | google/baidu(qgj2) |
shuffle_vit_tiny_patch4_window7 | 82.39 | 96.05 | 224 | 0.875 | bicubic | google/baidu(8a1i) |
shuffle_vit_small_patch4_window7 | 83.53 | 96.57 | 224 | 0.875 | bicubic | google/baidu(xwh3) |
shuffle_vit_base_patch4_window7 | 83.95 | 96.91 | 224 | 0.875 | bicubic | google/baidu(1gsr) |
Model | backbone | box_mAP | Model |
---|---|---|---|
DETR | ResNet50 | 42.0 | google/baidu(n5gk) |
DETR | ResNet101 | 43.5 | google/baidu(bxz2) |
Model | Backbone | Batch_size | mIoU (ss) | mIoU (ms+flip) | Backbone_checkpoint | Model_checkpoint | ConfigFile |
---|---|---|---|---|---|---|---|
SETR_Naive | ViT_large | 16 | 52.06 | 52.57 | google/baidu(owoj) | google/baidu(xdb8) | config |
SETR_PUP | ViT_large | 16 | 53.90 | 54.53 | google/baidu(owoj) | google/baidu(6sji) | config |
SETR_MLA | ViT_Large | 8 | 54.39 | 55.16 | google/baidu(owoj) | google/baidu(wora) | config |
SETR_MLA | ViT_large | 16 | 55.01 | 55.87 | google/baidu(owoj) | google/baidu(76h2) | config |
Model | Backbone | Batch_size | Iteration | mIoU (ss) | mIoU (ms+flip) | Backbone_checkpoint | Model_checkpoint | ConfigFile |
---|---|---|---|---|---|---|---|---|
SETR_Naive | ViT_Large | 8 | 40k | 76.71 | - | google/baidu(owoj) | google/baidu(g7ro) | config |
SETR_Naive | ViT_Large | 8 | 80k | 77.31 | - | google/baidu(owoj) | google/baidu(wn6q) | config |
SETR_PUP | ViT_Large | 8 | 40k | 77.92 | - | google/baidu(owoj) | google/baidu(zmoi) | config |
SETR_PUP | ViT_Large | 8 | 80k | 78.81 | - | google/baidu(owoj) | baidu(f793) | config |
SETR_MLA | ViT_Large | 8 | 40k | 76.70 | - | google/baidu(owoj) | baidu(qaiw) | config |
SETR_MLA | ViT_Large | 8 | 80k | 77.26 | - | google/baidu(owoj) | baidu(6bgj) | config |
Model | Backbone | Batch_size | Iteration | mIoU (ss) | mIoU (ms+flip) | Backbone_checkpoint | Model_checkpoint | ConfigFile |
---|---|---|---|---|---|---|---|---|
SETR_Naive | ViT_Large | 16 | 160k | 47.57 | 48.12 | google/baidu(owoj) | baidu(lugq) | config |
SETR_PUP | ViT_Large | 16 | 160k | 49.12 | - | google/baidu(owoj) | baidu(udgs) | config |
SETR_MLA | ViT_Large | 8 | 160k | 47.80 | - | google/baidu(owoj) | baidu(mrrv) | config |
DPT | ViT_Large | 16 | 160k | 47.21 | - | google/baidu(owoj) | baidu(ts7h) | config |
sh run_eval.sh
or you can run the python script:
python main_single_gpu.py
with proper settings.
The script run_eval.sh
calls the main python script main_single_gpu.py
with a number of options, usually you need to change the following settings, e.g., for ViT base model:
python main_single_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \
-eval \
-pretrained='./vit_base_patch16_224'
Note:
- The
-pretrained
option accepts the path of pretrained weights file without the file extension (.pdparams).
sh run_eval_multi.sh
or you can run the python script:
python main_multi_gpu.py
with proper settings.
The script run_eval_multi.sh
calls the main python script main_multi_gpu.py
with a number of options, usually you need to change the following settings, e.g., for ViT base model:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python main_multi_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \
-eval \
-pretrained='./vit_base_patch16_224'
-ngpus=8
Note:
that the
-pretrained
option accepts the path of pretrained weights file without the file extension (.pdparams).If
-ngpu
is not set, all the available GPU devices will be used.
sh run_train.sh
or you can run the python script:
python main_single_gpu.py
with proper settings.
The script run_train.sh
calls the main python script main_single_gpu.py
with a number of options, usually you need to change the following settings, e.g., for ViT base model:
python main_single_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \
Note:
- The training options such as lr, image size, model layers, etc., can be changed in the
.yaml
file set in-cfg
. All the available settings can be found in./config.py
sh run_train_multi.sh
or you can run the python script:
python main_multi_gpu.py
with proper settings.
The script run_train_multi.sh
calls the main python script main_multi_gpu.py
with a number of options, usually you need to change the following settings, e.g., for ViT base model:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python main_multi_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \
-ngpus=8
Note:
- The training options such as lr, image size, model layers, etc., can be changed in the
.yaml
file set in-cfg
. All the available settings can be found in./config.py
- If
-ngpu
is not set, all the available GPU devices will be used.
- optimizers
- Schedulers
- DDP
- Data Augumentation
- DropPath
We encourage and appreciate your contribution to PPViT project, please refer to our workflow and work styles by CONTRIBUTING.md
This repo is under the Apache-2.0 license.