🤖 PaddlePaddle Visual Transformers (PaddleViT
or PPViT
) is a collection of vision models beyond convolution. Most of the models are based on Visual Transformers, Visual Attentions, and MLPs, etc. PaddleViT also integrates popular layers, utilities, optimizers, schedulers, data augmentations, training/validation scripts for PaddlePaddle 2.0+. The aim is to reproduce a wide variety of state-of-the-art ViT and MLP models with full training/validation procedures. We are passionate about making cuting-edge CV techniques easier to use for everyone.
🤖 PaddleViT provides models and tools for a variety of vision tasks, such as classifications, object detection, semantic segmentation, GAN, and more. Each model architecture is defined in standalone python module and can be modified to enable quick research experiments. At the same time, pretrained weights can be downloaded and used to finetune on your own datasets. PaddleViT also integrates popular tools and modules for custimized dataset, data preprocessing, performance metrics, DDP and more.
🤖 PaddleViT is backed by popular deep learning framework PaddlePaddle, we also provide tutorials and projects on Paddle AI Studio. It's intuitive and straightforward to get started for new users.
- ViT (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)
- DeiT (Training data-efficient image transformers & distillation through attention)
- Swin Transformer (Swin Transformer: Hierarchical Vision Transformer using Shifted Windows)
- PVTv2 (PVTv2: Improved Baselines with Pyramid Vision Transformer)
- MLP-Mixer (MLP-Mixer: An all-MLP Architecture for Vision)
- ResMLP (ResMLP: Feedforward networks for image classification with data-efficient training)
- gMLP (Pay Attention to MLPs)
- VOLO (VOLO: Vision Outlooker for Visual Recognition)
- CaiT (Going deeper with Image Transformers)
- Shuffle Transformer (Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer)
- CSwin (CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows)
- T2T-ViT (Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet)
- Focal Self-attention (Focal Self-attention for Local-Global Interactions in Vision Transformers)
- HaloNet (Scaling Local Self-Attention for Parameter Efficient Visual Backbones)
- Refined-ViT (Refiner: Refining Self-attention for Vision Transformers)
- CrossViT (CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification)
- Swin Transformer (Swin Transformer: Hierarchical Vision Transformer using Shifted Windows)
- Shuffle Transformer (Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer)
- PVT (Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions)
- Focal Self-attention (Focal Self-attention for Local-Global Interactions in Vision Transformers)
- CSwin (CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows)
- UP-DETR (UP-DETR: Unsupervised Pre-training for Object Detection with Transformers)
- SETR (Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers)
- DPT (Vision Transformers for Dense Prediction)
- Swin Transformer (Swin Transformer: Hierarchical Vision Transformer using Shifted Windows)
- Segmenter: (Transformer for Semantic Segmentation)
- FTN (Fully Transformer Networks for Semantic Image Segmentation)
- SegFormer (SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers)
- Shuffle Transformer (Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer)
- Focal Self-attention (Focal Self-attention for Local-Global Interactions in Vision Transformers)
- CSwin (CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows)
- TransGAN (TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up)
- Styleformer (Styleformer: Transformer based Generative Adversarial Networks with Style Vector)
- ViTGAN (ViTGAN: Training GANs with Vision Transformers)
Model | Acc@1 | Acc@5 | Image Size | Crop_pct | Interpolation | Link |
---|---|---|---|---|---|---|
vit_base_patch16_224 | 84.58 | 97.30 | 224 | 0.875 | bicubic | google/baidu(qv4n) |
vit_base_patch16_384 | 85.99 | 98.00 | 384 | 1.0 | bicubic | google/baidu(wsum) |
vit_large_patch16_224 | 85.81 | 97.82 | 224 | 0.875 | bicubic | google/baidu(1bgk) |
swin_base_patch4_window7_224 | 85.27 | 97.56 | 224 | 0.9 | bicubic | google/baidu(wyck) |
swin_base_patch4_window12_384 | 86.43 | 98.07 | 384 | 1.0 | bicubic | google/baidu(4a95) |
swin_large_patch4_window12_384 | 87.14 | 98.23 | 384 | 1.0 | bicubic | google/baidu(j71u) |
pvtv2_b0 | 70.47 | 90.16 | 224 | 0.875 | bicubic | google/baidu(dxgb) |
pvtv2_b1 | 78.70 | 94.49 | 224 | 0.875 | bicubic | google/baidu(2e5m) |
pvtv2_b2 | 82.02 | 95.99 | 224 | 0.875 | bicubic | google/baidu(are2) |
pvtv2_b3 | 83.14 | 96.47 | 224 | 0.875 | bicubic | google/baidu(nc21) |
pvtv2_b4 | 83.61 | 96.69 | 224 | 0.875 | bicubic | google/baidu(tthf) |
pvtv2_b5 | 83.77 | 96.61 | 224 | 0.875 | bicubic | google/baidu(9v6n) |
pvtv2_b2_linear | 82.06 | 96.04 | 224 | 0.875 | bicubic | google/baidu(a4c8) |
mlp_mixer_b16_224 | 76.60 | 92.23 | 224 | 0.875 | bicubic | google/baidu(xh8x) |
mlp_mixer_l16_224 | 72.06 | 87.67 | 224 | 0.875 | bicubic | google/baidu(8q7r) |
resmlp_24_224 | 79.38 | 94.55 | 224 | 0.875 | bicubic | google/baidu(jdcx) |
resmlp_36_224 | 79.77 | 94.89 | 224 | 0.875 | bicubic | google/baidu(33w3) |
resmlp_big_24_224 | 81.04 | 95.02 | 224 | 0.875 | bicubic | google/baidu(r9kb) |
resmlp_big_24_distilled_224 | 83.59 | 96.65 | 224 | 0.875 | bicubic | google/baidu(4jk5) |
gmlp_s16_224 | 79.64 | 94.63 | 224 | 0.875 | bicubic | google/baidu(bcth) |
volo_d5_224_86.10 | 86.08 | 97.58 | 224 | 1.0 | bicubic | google/baidu(td49) |
volo_d5_512_87.07 | 87.05 | 97.97 | 512 | 1.15 | bicubic | google/baidu(irik) |
cait_xxs24_224 | 78.38 | 94.32 | 224 | 1.0 | bicubic | google/baidu(j9m8) |
cait_s24_384 | 85.05 | 97.34 | 384 | 1.0 | bicubic | google/baidu(qb86) |
cait_m48_448 | 86.49 | 97.75 | 448 | 1.0 | bicubic | google/baidu(imk5) |
deit_base_distilled_patch16_224 | 83.32 | 96.49 | 224 | 0.875 | bicubic | google/baidu(5f2g) |
deit_base_distilled_patch16_384 | 85.43 | 97.33 | 384 | 1.0 | bicubic | google/baidu(qgj2) |
shuffle_vit_tiny_patch4_window7 | 82.39 | 96.05 | 224 | 0.875 | bicubic | google/baidu(8a1i) |
shuffle_vit_small_patch4_window7 | 83.53 | 96.57 | 224 | 0.875 | bicubic | google/baidu(xwh3) |
shuffle_vit_base_patch4_window7 | 83.95 | 96.91 | 224 | 0.875 | bicubic | google/baidu(1gsr) |
cswin_tiny_224 | 82.81 | 96.30 | 224 | 0.9 | bicubic | google/baidu(4q3h) |
cswin_small_224 | 83.60 | 96.58 | 224 | 0.9 | bicubic | google/baidu(gt1a) |
cswin_base_224 | 84.23 | 96.91 | 224 | 0.9 | bicubic | google/baidu(wj8p) |
cswin_large_224 | 86.52 | 97.99 | 224 | 0.9 | bicubic | google/baidu(b5fs) |
cswin_base_384 | 85.51 | 97.48 | 384 | 1.0 | bicubic | google/baidu(rkf5) |
cswin_large_384 | 87.49 | 98.35 | 384 | 1.0 | bicubic | google/baidu(6235) |
t2t_vit_7 | 71.68 | 90.89 | 224 | 0.9 | bicubic | google/baidu(1hpa) |
t2t_vit_10 | 75.15 | 92.80 | 224 | 0.9 | bicubic | google/baidu(ixug) |
t2t_vit_12 | 76.48 | 93.49 | 224 | 0.9 | bicubic | google/baidu(qpbb) |
t2t_vit_14 | 81.50 | 95.67 | 224 | 0.9 | bicubic | google/baidu(c2u8) |
t2t_vit_19 | 81.93 | 95.74 | 224 | 0.9 | bicubic | google/baidu(4in3) |
t2t_vit_24 | 82.28 | 95.89 | 224 | 0.9 | bicubic | google/baidu(4in3) |
t2t_vit_t_14 | 81.69 | 95.85 | 224 | 0.9 | bicubic | google/baidu(4in3) |
t2t_vit_t_19 | 82.44 | 96.08 | 224 | 0.9 | bicubic | google/baidu(mier) |
t2t_vit_t_24 | 82.55 | 96.07 | 224 | 0.9 | bicubic | google/baidu(6vxc) |
t2t_vit_14_384 | 83.34 | 96.50 | 384 | 1.0 | bicubic | google/baidu(r685) |
Model | backbone | box_mAP | Model |
---|---|---|---|
DETR | ResNet50 | 42.0 | google/baidu(n5gk) |
DETR | ResNet101 | 43.5 | google/baidu(bxz2) |
Model | Backbone | Batch_size | mIoU (ss) | mIoU (ms+flip) | Backbone_checkpoint | Model_checkpoint | ConfigFile |
---|---|---|---|---|---|---|---|
SETR_Naive | ViT_large | 16 | 52.06 | 52.57 | google/baidu(owoj) | google/baidu(xdb8) | config |
SETR_PUP | ViT_large | 16 | 53.90 | 54.53 | google/baidu(owoj) | google/baidu(6sji) | config |
SETR_MLA | ViT_Large | 8 | 54.39 | 55.16 | google/baidu(owoj) | google/baidu(wora) | config |
SETR_MLA | ViT_large | 16 | 55.01 | 55.87 | google/baidu(owoj) | google/baidu(76h2) | config |
Model | Backbone | Batch_size | Iteration | mIoU (ss) | mIoU (ms+flip) | Backbone_checkpoint | Model_checkpoint | ConfigFile |
---|---|---|---|---|---|---|---|---|
SETR_Naive | ViT_Large | 8 | 40k | 76.71 | 79.03 | google/baidu(owoj) | google/baidu(g7ro) | config |
SETR_Naive | ViT_Large | 8 | 80k | 77.31 | 79.43 | google/baidu(owoj) | google/baidu(wn6q) | config |
SETR_PUP | ViT_Large | 8 | 40k | 77.92 | 79.63 | google/baidu(owoj) | google/baidu(zmoi) | config |
SETR_PUP | ViT_Large | 8 | 80k | 78.81 | 80.43 | google/baidu(owoj) | baidu(f793) | config |
SETR_MLA | ViT_Large | 8 | 40k | 76.70 | 78.96 | google/baidu(owoj) | baidu(qaiw) | config |
SETR_MLA | ViT_Large | 8 | 80k | 77.26 | 79.27 | google/baidu(owoj) | baidu(6bgj) | config |
Model | Backbone | Batch_size | Iteration | mIoU (ss) | mIoU (ms+flip) | Backbone_checkpoint | Model_checkpoint | ConfigFile |
---|---|---|---|---|---|---|---|---|
SETR_Naive | ViT_Large | 16 | 160k | 47.57 | 48.12 | google/baidu(owoj) | baidu(lugq) | config |
SETR_PUP | ViT_Large | 16 | 160k | 49.12 | - | google/baidu(owoj) | baidu(udgs) | config |
SETR_MLA | ViT_Large | 8 | 160k | 47.80 | - | google/baidu(owoj) | baidu(mrrv) | config |
DPT | ViT_Large | 16 | 160k | 47.21 | - | google/baidu(owoj) | baidu(ts7h) | config |
Segmenter | ViT_Tiny | 16 | 160k | 38.45 | - | TODO | baidu(1k97) | config |
Segmenter | ViT_Small | 16 | 160k | 46.07 | - | TODO | baidu(i8nv) | config |
Segmenter | ViT_Base | 16 | 160k | 49.08 | - | TODO | baidu(hxrl) | config |
Segmenter_Linear | DeiT_Base | 16 | 160k | 47.34 | - | TODO | baidu(5dpv) | config |
Segmenter | DeiT_Base | 16 | 160k | 49.27 | - | TODO | baidu(3kim) | config |
UperNet | Swin_Tiny | 16 | 160k | 44.90 | 45.37 | - | baidu(lkhg) | config |
UperNet | Swin_Small | 16 | 160k | 47.88 | 48.90 | - | baidu(vvy1) | config |
UperNet | Swin_Base | 16 | 160k | 48.59 | 49.04 | - | baidu(y040) | config |
Model | FID | Image Size | Crop_pct | Interpolation | Model |
---|---|---|---|---|---|
styleformer_cifar10 | 2.73 | 32 | 1.0 | lanczos | google/baidu(7cg2) |
styleformer_stl10 | 15.65 | 48 | 1.0 | lanczos | google/baidu(8pus) |
styleformer_celeba | 3.32 | 64 | 1.0 | lanczos | google/baidu(ymh7) |
styleformer_lsun | 9.68 | 128 | 1.0 | lanczos | google/baidu(ue28) |
*The results are evaluated on Cifar10, STL10, Celeba and LSUNchurch dataset, using fid50k_full metric.
sh run_eval.sh
or you can run the python script:
python main_single_gpu.py
with proper settings.
The script run_eval.sh
calls the main python script main_single_gpu.py
with a number of options, usually you need to change the following settings, e.g., for ViT base model:
python main_single_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \
-eval \
-pretrained='./vit_base_patch16_224'
Note:
- The
-pretrained
option accepts the path of pretrained weights file without the file extension (.pdparams).
sh run_eval_multi.sh
or you can run the python script:
python main_multi_gpu.py
with proper settings.
The script run_eval_multi.sh
calls the main python script main_multi_gpu.py
with a number of options, usually you need to change the following settings, e.g., for ViT base model:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python main_multi_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \
-eval \
-pretrained='./vit_base_patch16_224'
-ngpus=8
Note:
that the
-pretrained
option accepts the path of pretrained weights file without the file extension (.pdparams).If
-ngpu
is not set, all the available GPU devices will be used.
sh run_train.sh
or you can run the python script:
python main_single_gpu.py
with proper settings.
The script run_train.sh
calls the main python script main_single_gpu.py
with a number of options, usually you need to change the following settings, e.g., for ViT base model:
python main_single_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \
Note:
- The training options such as lr, image size, model layers, etc., can be changed in the
.yaml
file set in-cfg
. All the available settings can be found in./config.py
sh run_train_multi.sh
or you can run the python script:
python main_multi_gpu.py
with proper settings.
The script run_train_multi.sh
calls the main python script main_multi_gpu.py
with a number of options, usually you need to change the following settings, e.g., for ViT base model:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python main_multi_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \
-ngpus=8
Note:
- The training options such as lr, image size, model layers, etc., can be changed in the
.yaml
file set in-cfg
. All the available settings can be found in./config.py
- If
-ngpu
is not set, all the available GPU devices will be used.
- optimizers
- Schedulers
- DDP
- Data Augumentation
- DropPath
We encourage and appreciate your contribution to PPViT project, please refer to our workflow and work styles by CONTRIBUTING.md
This repo is under the Apache-2.0 license.