Skip to content

BR-IDL/PaddleViT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PPViT

Implementation of SOTA visual transformers and mlp models on PaddlePaddle 2.0+

Introduction

PaddlePaddle Visual Transformers (PPViT) is a collection of PaddlePaddle image models beyond convolution, which are mostly based on visual transformers, visual attentions, and MLPs, etc. PPViT also integrates popular layers, utilities, optimizers, schedulers, data augmentations, training/validation scripts for PaddlePaddle 2.0+. The aim is to reproduce a wide variety of SOTA ViT models with full training/validation procedures.

Models

Image Classification

Now:

  1. ViT (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)
  2. DeiT (Training data-efficient image transformers & distillation through attention)
  3. Swin Transformer (Swin Transformer: Hierarchical Vision Transformer using Shifted Windows)
  4. PVT (Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions)
  5. MLP-Mixer (MLP-Mixer: An all-MLP Architecture for Vision)
  6. ResMLP (ResMLP: Feedforward networks for image classification with data-efficient training)
  7. gMLP (Pay Attention to MLPs)
  8. VOLO (VOLO: Vision Outlooker for Visual Recognition)
  9. CaiT (Going deeper with Image Transformers)
  10. Shuffle Transformer (Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer)

Coming Soon:

  1. T2T-ViT (Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet)

  2. CSwin (CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows)

  3. Focal Self-attention (Focal Self-attention for Local-Global Interactions in Vision Transformers)

  4. HaloNet (Scaling Local Self-Attention for Parameter Efficient Visual Backbones)

  5. Refined-ViT (Refiner: Refining Self-attention for Vision Transformers)

  6. CrossViT (CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification)

Detection

Now:

  1. DETR (End-to-End Object Detection with Transformers)

Coming Soon:

  1. Swin Transformer (Swin Transformer: Hierarchical Vision Transformer using Shifted Windows)
  2. Shuffle Transformer (Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer)
  3. PVT (Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions)
  4. Focal Self-attention (Focal Self-attention for Local-Global Interactions in Vision Transformers)
  5. CSwin (CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows)
  6. UP-DETR (UP-DETR: Unsupervised Pre-training for Object Detection with Transformers)

Semantic Segmentation

Now:

  1. SETR (Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers)

Coming Soon:

  1. FTN (Fully Transformer Networks for Semantic Image Segmentation)
  2. Swin Transformer (Swin Transformer: Hierarchical Vision Transformer using Shifted Windows)
  3. Segmenter: (Transformer for Semantic Segmentation)
  4. SegFormer (SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers)
  5. Shuffle Transformer (Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer)
  6. Focal Self-attention (Focal Self-attention for Local-Global Interactions in Vision Transformers)
  7. CSwin (CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows)

GAN

Coming Soon:

  1. TransGAN (TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up)
  2. Styleformer (Styleformer: Transformer based Generative Adversarial Networks with Style Vector)
  3. ViTGAN (ViTGAN: Training GANs with Vision Transformers)

Results (Ported Weights)

Image Classification

Model Acc@1 Acc@5 Image Size Crop_pct Interpolation Model
vit_base_patch16_224 81.66 96.09 224 0.875 bilinear google/baidu(nxhy)
vit_base_patch16_384 84.20 97.22 384 1.0 bilinear google/baidu(8ack)
vit_large_patch16_224 83.00 96.47 224 0.875 bicubic google/baidu(g7ij)
swin_base_patch4_window7_224 85.27 97.56 224 0.9 bicubic google/baidu(ps9m)
swin_base_patch4_window12_384 86.43 98.07 384 1.0 bicubic google/baidu(ef9t)
swin_large_patch4_window12_384 87.14 98.23 384 1.0 bicubic google/baidu(5shn)
pvtv2_tiny_224 70.47 90.16 224 0.875 bicubic google/baidu(575w)
pvtv2_medium_224 82.02 95.99 224 0.875 bicubic google/baidu(ezfc)
pvtv2_large_224 83.77 96.61 224 0.875 bicubic google/baidu(fbc4)
mixer_b16_224 76.60 92.23 224 0.875 bicubic google/baidu(xh8x)
resmlp_24_224 79.38 94.55 224 0.875 bicubic google/baidu(jdcx)
gmlp_s16_224 79.64 94.63 224 0.875 bicubic google/baidu(bcth)
volo_d5_224_86.10 86.08 97.58 224 1.0 bicubic google/baidu(td49)
volo_d5_512_87.07 87.05 97.97 512 1.15 bicubic google/baidu(irik)
cait_xxs24_224 78.38 94.32 224 1.0 bicubic google/baidu(j9m8)
cait_s24_384 85.05 97.34 384 1.0 bicubic google/baidu(qb86)
cait_m48_448 86.49 97.75 448 1.0 bicubic google/baidu(imk5)
deit_base_distilled_patch16_224 83.32 96.49 224 0.875 bicubic google/baidu(5f2g)
deit_base_distilled_patch16_384 85.43 97.33 384 1.0 bicubic google/baidu(qgj2)
shuffle_vit_tiny_patch4_window7 82.39 96.05 224 0.875 bicubic google/baidu(8a1i)
shuffle_vit_small_patch4_window7 83.53 96.57 224 0.875 bicubic google/baidu(xwh3)
shuffle_vit_base_patch4_window7 83.95 96.91 224 0.875 bicubic google/baidu(1gsr)

Object Detection

Model backbone box_mAP Model
DETR ResNet50 42.0 google/baidu(n5gk)
DETR ResNet101 43.5 google/baidu(bxz2)

Semantic Segmentation

Pascal Context

Model Backbone Batch_size mIoU (ss) mIoU (ms+flip) Backbone_checkpoint Model_checkpoint ConfigFile
SETR_Naive ViT_large 16 52.06 52.57 google/baidu(owoj) google/baidu(xdb8) config
SETR_PUP ViT_large 16 53.90 54.53 google/baidu(owoj) google/baidu(6sji) config
SETR_MLA ViT_Large 8 54.39 55.16 google/baidu(owoj) google/baidu(wora) config
SETR_MLA ViT_large 16 55.01 55.87 google/baidu(owoj) google/baidu(76h2) config

Cityscapes

Model Backbone Batch_size Iteration mIoU (ss) mIoU (ms+flip) Backbone_checkpoint Model_checkpoint ConfigFile
SETR_Naive ViT_Large 8 40k 76.71 - google/baidu(owoj) google/baidu(g7ro) config
SETR_Naive ViT_Large 8 80k 77.31 - google/baidu(owoj) google/baidu(wn6q) config
SETR_PUP ViT_Large 8 40k 77.92 - google/baidu(owoj) google/baidu(zmoi) config
SETR_PUP ViT_Large 8 80k 78.81 - google/baidu(owoj) baidu(f793) config
SETR_MLA ViT_Large 8 40k 76.70 - google/baidu(owoj) baidu(qaiw) config
SETR_MLA ViT_Large 8 80k 77.26 - google/baidu(owoj) baidu(6bgj) config

ADE20K

Model Backbone Batch_size Iteration mIoU (ss) mIoU (ms+flip) Backbone_checkpoint Model_checkpoint ConfigFile
SETR_Naive ViT_Large 16 160k 47.57 48.12 google/baidu(owoj) baidu(lugq) config
SETR_PUP ViT_Large 16 160k 49.12 - google/baidu(owoj) baidu(udgs) config
SETR_MLA ViT_Large 8 160k 47.80 - google/baidu(owoj) baidu(mrrv) config
DPT ViT_Large 16 160k 47.21 - google/baidu(owoj) baidu(ts7h) config

GAN

Results (Self-Trained Weights)

Image Classification

Object Detection

Segmentation

GAN

Validation Scripts

Run on single GPU:

sh run_eval.sh

or you can run the python script:

python main_single_gpu.py

with proper settings.

The script run_eval.sh calls the main python script main_single_gpu.py with a number of options, usually you need to change the following settings, e.g., for ViT base model:

python main_single_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \
-eval \
-pretrained='./vit_base_patch16_224'

Note:

  • The -pretrained option accepts the path of pretrained weights file without the file extension (.pdparams).

Run on multi GPU:

sh run_eval_multi.sh

or you can run the python script:

python main_multi_gpu.py

with proper settings.

The script run_eval_multi.sh calls the main python script main_multi_gpu.py with a number of options, usually you need to change the following settings, e.g., for ViT base model:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python main_multi_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \
-eval \
-pretrained='./vit_base_patch16_224'
-ngpus=8

Note:

  • that the -pretrained option accepts the path of pretrained weights file without the file extension (.pdparams).

  • If -ngpu is not set, all the available GPU devices will be used.

Training Scripts

Train on single GPU:

sh run_train.sh

or you can run the python script:

python main_single_gpu.py

with proper settings.

The script run_train.sh calls the main python script main_single_gpu.py with a number of options, usually you need to change the following settings, e.g., for ViT base model:

python main_single_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \

Note:

  • The training options such as lr, image size, model layers, etc., can be changed in the .yaml file set in -cfg. All the available settings can be found in ./config.py

Run on multi GPU:

sh run_train_multi.sh

or you can run the python script:

python main_multi_gpu.py

with proper settings.

The script run_train_multi.sh calls the main python script main_multi_gpu.py with a number of options, usually you need to change the following settings, e.g., for ViT base model:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python main_multi_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \
-ngpus=8

Note:

  • The training options such as lr, image size, model layers, etc., can be changed in the .yaml file set in -cfg. All the available settings can be found in ./config.py
  • If -ngpu is not set, all the available GPU devices will be used.

Features

  • optimizers
  • Schedulers
  • DDP
  • Data Augumentation
  • DropPath

Contributing

We encourage and appreciate your contribution to PPViT project, please refer to our workflow and work styles by CONTRIBUTING.md

Licenses

Code

This repo is under the Apache-2.0 license.

Pretrained Weights

Citing