Skip to content

BR-IDL/PaddleViT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PaddleViT

GitHub license GitHub stars

State-of-the-art Visual Transformer and MLP Models for Paddle 2.0

🤖 PaddlePaddle Visual Transformers (PaddleViT or PPViT) is a collection of vision models beyond convolution. Most of the models are based on Visual Transformers, Visual Attentions, and MLPs, etc. PaddleViT also integrates popular layers, utilities, optimizers, schedulers, data augmentations, training/validation scripts for PaddlePaddle 2.0+. The aim is to reproduce a wide variety of state-of-the-art ViT and MLP models with full training/validation procedures. We are passionate about making cuting-edge CV techniques easier to use for everyone.

🤖 PaddleViT provides models and tools for a variety of vision tasks, such as classifications, object detection, semantic segmentation, GAN, and more. Each model architecture is defined in standalone python module and can be modified to enable quick research experiments. At the same time, pretrained weights can be downloaded and used to finetune on your own datasets. PaddleViT also integrates popular tools and modules for custimized dataset, data preprocessing, performance metrics, DDP and more.

🤖 PaddleViT is backed by popular deep learning framework PaddlePaddle, we also provide tutorials and projects on Paddle AI Studio. It's intuitive and straightforward to get started for new users.

Quick Tour

- Image Classification

- Object Detection

- Semantic Segmentation

- GAN

Installation

Model architectures

Image Classification

Now:

  1. ViT (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)
  2. DeiT (Training data-efficient image transformers & distillation through attention)
  3. Swin Transformer (Swin Transformer: Hierarchical Vision Transformer using Shifted Windows)
  4. PVTv2 (PVTv2: Improved Baselines with Pyramid Vision Transformer)
  5. MLP-Mixer (MLP-Mixer: An all-MLP Architecture for Vision)
  6. ResMLP (ResMLP: Feedforward networks for image classification with data-efficient training)
  7. gMLP (Pay Attention to MLPs)
  8. VOLO (VOLO: Vision Outlooker for Visual Recognition)
  9. CaiT (Going deeper with Image Transformers)
  10. Shuffle Transformer (Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer)
  11. CSwin (CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows)
  12. T2T-ViT (Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet)

Coming Soon:

  1. Focal Self-attention (Focal Self-attention for Local-Global Interactions in Vision Transformers)
  2. HaloNet (Scaling Local Self-Attention for Parameter Efficient Visual Backbones)
  3. Refined-ViT (Refiner: Refining Self-attention for Vision Transformers)
  4. CrossViT (CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification)

Detection

Now:

  1. DETR (End-to-End Object Detection with Transformers)

Coming Soon:

  1. Swin Transformer (Swin Transformer: Hierarchical Vision Transformer using Shifted Windows)
  2. Shuffle Transformer (Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer)
  3. PVT (Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions)
  4. Focal Self-attention (Focal Self-attention for Local-Global Interactions in Vision Transformers)
  5. CSwin (CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows)
  6. UP-DETR (UP-DETR: Unsupervised Pre-training for Object Detection with Transformers)

Semantic Segmentation

Now:

  1. SETR (Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers)
  2. DPT (Vision Transformers for Dense Prediction)
  3. Swin Transformer (Swin Transformer: Hierarchical Vision Transformer using Shifted Windows)
  4. Segmenter: (Transformer for Semantic Segmentation)

Coming Soon:

  1. FTN (Fully Transformer Networks for Semantic Image Segmentation)
  2. SegFormer (SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers)
  3. Shuffle Transformer (Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer)
  4. Focal Self-attention (Focal Self-attention for Local-Global Interactions in Vision Transformers)
  5. CSwin (CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows)

GAN

Coming Soon:

  1. TransGAN (TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up)
  2. Styleformer (Styleformer: Transformer based Generative Adversarial Networks with Style Vector)
  3. ViTGAN (ViTGAN: Training GANs with Vision Transformers)

Results (Ported Weights)

Image Classification

Model Acc@1 Acc@5 Image Size Crop_pct Interpolation Link
vit_base_patch16_224 84.58 97.30 224 0.875 bicubic google/baidu(qv4n)
vit_base_patch16_384 85.99 98.00 384 1.0 bicubic google/baidu(wsum)
vit_large_patch16_224 85.81 97.82 224 0.875 bicubic google/baidu(1bgk)
swin_base_patch4_window7_224 85.27 97.56 224 0.9 bicubic google/baidu(wyck)
swin_base_patch4_window12_384 86.43 98.07 384 1.0 bicubic google/baidu(4a95)
swin_large_patch4_window12_384 87.14 98.23 384 1.0 bicubic google/baidu(j71u)
pvtv2_b0 70.47 90.16 224 0.875 bicubic google/baidu(dxgb)
pvtv2_b1 78.70 94.49 224 0.875 bicubic google/baidu(2e5m)
pvtv2_b2 82.02 95.99 224 0.875 bicubic google/baidu(are2)
pvtv2_b3 83.14 96.47 224 0.875 bicubic google/baidu(nc21)
pvtv2_b4 83.61 96.69 224 0.875 bicubic google/baidu(tthf)
pvtv2_b5 83.77 96.61 224 0.875 bicubic google/baidu(9v6n)
pvtv2_b2_linear 82.06 96.04 224 0.875 bicubic google/baidu(a4c8)
mlp_mixer_b16_224 76.60 92.23 224 0.875 bicubic google/baidu(xh8x)
mlp_mixer_l16_224 72.06 87.67 224 0.875 bicubic google/baidu(8q7r)
resmlp_24_224 79.38 94.55 224 0.875 bicubic google/baidu(jdcx)
resmlp_36_224 79.77 94.89 224 0.875 bicubic google/baidu(33w3)
resmlp_big_24_224 81.04 95.02 224 0.875 bicubic google/baidu(r9kb)
resmlp_big_24_distilled_224 83.59 96.65 224 0.875 bicubic google/baidu(4jk5)
gmlp_s16_224 79.64 94.63 224 0.875 bicubic google/baidu(bcth)
volo_d5_224_86.10 86.08 97.58 224 1.0 bicubic google/baidu(td49)
volo_d5_512_87.07 87.05 97.97 512 1.15 bicubic google/baidu(irik)
cait_xxs24_224 78.38 94.32 224 1.0 bicubic google/baidu(j9m8)
cait_s24_384 85.05 97.34 384 1.0 bicubic google/baidu(qb86)
cait_m48_448 86.49 97.75 448 1.0 bicubic google/baidu(imk5)
deit_base_distilled_patch16_224 83.32 96.49 224 0.875 bicubic google/baidu(5f2g)
deit_base_distilled_patch16_384 85.43 97.33 384 1.0 bicubic google/baidu(qgj2)
shuffle_vit_tiny_patch4_window7 82.39 96.05 224 0.875 bicubic google/baidu(8a1i)
shuffle_vit_small_patch4_window7 83.53 96.57 224 0.875 bicubic google/baidu(xwh3)
shuffle_vit_base_patch4_window7 83.95 96.91 224 0.875 bicubic google/baidu(1gsr)
cswin_tiny_224 82.81 96.30 224 0.9 bicubic google/baidu(4q3h)
cswin_small_224 83.60 96.58 224 0.9 bicubic google/baidu(gt1a)
cswin_base_224 84.23 96.91 224 0.9 bicubic google/baidu(wj8p)
cswin_large_224 86.52 97.99 224 0.9 bicubic google/baidu(b5fs)
cswin_base_384 85.51 97.48 384 1.0 bicubic google/baidu(rkf5)
cswin_large_384 87.49 98.35 384 1.0 bicubic google/baidu(6235)
t2t_vit_7 71.68 90.89 224 0.9 bicubic google/baidu(1hpa)
t2t_vit_10 75.15 92.80 224 0.9 bicubic google/baidu(ixug)
t2t_vit_12 76.48 93.49 224 0.9 bicubic google/baidu(qpbb)
t2t_vit_14 81.50 95.67 224 0.9 bicubic google/baidu(c2u8)
t2t_vit_19 81.93 95.74 224 0.9 bicubic google/baidu(4in3)
t2t_vit_24 82.28 95.89 224 0.9 bicubic google/baidu(4in3)
t2t_vit_t_14 81.69 95.85 224 0.9 bicubic google/baidu(4in3)
t2t_vit_t_19 82.44 96.08 224 0.9 bicubic google/baidu(mier)
t2t_vit_t_24 82.55 96.07 224 0.9 bicubic google/baidu(6vxc)
t2t_vit_14_384 83.34 96.50 384 1.0 bicubic google/baidu(r685)

Object Detection

Model backbone box_mAP Model
DETR ResNet50 42.0 google/baidu(n5gk)
DETR ResNet101 43.5 google/baidu(bxz2)

Semantic Segmentation

Pascal Context

Model Backbone Batch_size mIoU (ss) mIoU (ms+flip) Backbone_checkpoint Model_checkpoint ConfigFile
SETR_Naive ViT_large 16 52.06 52.57 google/baidu(owoj) google/baidu(xdb8) config
SETR_PUP ViT_large 16 53.90 54.53 google/baidu(owoj) google/baidu(6sji) config
SETR_MLA ViT_Large 8 54.39 55.16 google/baidu(owoj) google/baidu(wora) config
SETR_MLA ViT_large 16 55.01 55.87 google/baidu(owoj) google/baidu(76h2) config

Cityscapes

Model Backbone Batch_size Iteration mIoU (ss) mIoU (ms+flip) Backbone_checkpoint Model_checkpoint ConfigFile
SETR_Naive ViT_Large 8 40k 76.71 79.03 google/baidu(owoj) google/baidu(g7ro) config
SETR_Naive ViT_Large 8 80k 77.31 79.43 google/baidu(owoj) google/baidu(wn6q) config
SETR_PUP ViT_Large 8 40k 77.92 79.63 google/baidu(owoj) google/baidu(zmoi) config
SETR_PUP ViT_Large 8 80k 78.81 80.43 google/baidu(owoj) baidu(f793) config
SETR_MLA ViT_Large 8 40k 76.70 78.96 google/baidu(owoj) baidu(qaiw) config
SETR_MLA ViT_Large 8 80k 77.26 79.27 google/baidu(owoj) baidu(6bgj) config

ADE20K

Model Backbone Batch_size Iteration mIoU (ss) mIoU (ms+flip) Backbone_checkpoint Model_checkpoint ConfigFile
SETR_Naive ViT_Large 16 160k 47.57 48.12 google/baidu(owoj) baidu(lugq) config
SETR_PUP ViT_Large 16 160k 49.12 - google/baidu(owoj) baidu(udgs) config
SETR_MLA ViT_Large 8 160k 47.80 - google/baidu(owoj) baidu(mrrv) config
DPT ViT_Large 16 160k 47.21 - google/baidu(owoj) baidu(ts7h) config
Segmenter ViT_Tiny 16 160k 38.45 - TODO baidu(1k97) config
Segmenter ViT_Small 16 160k 46.07 - TODO baidu(i8nv) config
Segmenter ViT_Base 16 160k 49.08 - TODO baidu(hxrl) config
Segmenter_Linear DeiT_Base 16 160k 47.34 - TODO baidu(5dpv) config
Segmenter DeiT_Base 16 160k 49.27 - TODO baidu(3kim) config
UperNet Swin_Tiny 16 160k 44.90 45.37 - baidu(lkhg) config
UperNet Swin_Small 16 160k 47.88 48.90 - baidu(vvy1) config
UperNet Swin_Base 16 160k 48.59 49.04 - baidu(y040) config

GAN

Model FID Image Size Crop_pct Interpolation Model
styleformer_cifar10 2.73 32 1.0 lanczos google/baidu(7cg2)
styleformer_stl10 15.65 48 1.0 lanczos google/baidu(8pus)
styleformer_celeba 3.32 64 1.0 lanczos google/baidu(ymh7)
styleformer_lsun 9.68 128 1.0 lanczos google/baidu(ue28)

*The results are evaluated on Cifar10, STL10, Celeba and LSUNchurch dataset, using fid50k_full metric.

Results (Self-Trained Weights)

Image Classification

Object Detection

Segmentation

GAN

Validation Scripts

Run on single GPU:

sh run_eval.sh

or you can run the python script:

python main_single_gpu.py

with proper settings.

The script run_eval.sh calls the main python script main_single_gpu.py with a number of options, usually you need to change the following settings, e.g., for ViT base model:

python main_single_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \
-eval \
-pretrained='./vit_base_patch16_224'

Note:

  • The -pretrained option accepts the path of pretrained weights file without the file extension (.pdparams).

Run on multi GPU:

sh run_eval_multi.sh

or you can run the python script:

python main_multi_gpu.py

with proper settings.

The script run_eval_multi.sh calls the main python script main_multi_gpu.py with a number of options, usually you need to change the following settings, e.g., for ViT base model:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python main_multi_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \
-eval \
-pretrained='./vit_base_patch16_224'
-ngpus=8

Note:

  • that the -pretrained option accepts the path of pretrained weights file without the file extension (.pdparams).

  • If -ngpu is not set, all the available GPU devices will be used.

Training Scripts

Train on single GPU:

sh run_train.sh

or you can run the python script:

python main_single_gpu.py

with proper settings.

The script run_train.sh calls the main python script main_single_gpu.py with a number of options, usually you need to change the following settings, e.g., for ViT base model:

python main_single_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \

Note:

  • The training options such as lr, image size, model layers, etc., can be changed in the .yaml file set in -cfg. All the available settings can be found in ./config.py

Run on multi GPU:

sh run_train_multi.sh

or you can run the python script:

python main_multi_gpu.py

with proper settings.

The script run_train_multi.sh calls the main python script main_multi_gpu.py with a number of options, usually you need to change the following settings, e.g., for ViT base model:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python main_multi_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \
-ngpus=8

Note:

  • The training options such as lr, image size, model layers, etc., can be changed in the .yaml file set in -cfg. All the available settings can be found in ./config.py
  • If -ngpu is not set, all the available GPU devices will be used.

Features

  • optimizers
  • Schedulers
  • DDP
  • Data Augumentation
  • DropPath

Contributing

We encourage and appreciate your contribution to PPViT project, please refer to our workflow and work styles by CONTRIBUTING.md

Licenses

Code

This repo is under the Apache-2.0 license.

Pretrained Weights

Citing