Name	Name	Last commit message	Last commit date
Latest commit History 130 Commits
docs	docs
image_classification	image_classification
object_detection/DETR	object_detection/DETR
semantic_segmentation	semantic_segmentation
.gitignore	.gitignore
CONTRIBUTING.md	CONTRIBUTING.md
LICENSE	LICENSE
README.md	README.md

PPViT

Implementation of SOTA visual transformers and mlp models on PaddlePaddle 2.0+

Introduction

PaddlePaddle Visual Transformers (PPViT) is a collection of PaddlePaddle image models beyond convolution, which are mostly based on visual transformers, visual attentions, and MLPs, etc. PPViT also integrates popular layers, utilities, optimizers, schedulers, data augmentations, training/validation scripts for PaddlePaddle 2.0+. The aim is to reproduce a wide variety of SOTA ViT models with full training/validation procedures.

Models

Image Classification

Now:

ViT (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)
DeiT (Training data-efficient image transformers & distillation through attention)
Swin Transformer (Swin Transformer: Hierarchical Vision Transformer using Shifted Windows)
PVT (Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions)
MLP-Mixer (MLP-Mixer: An all-MLP Architecture for Vision)
ResMLP (ResMLP: Feedforward networks for image classification with data-efficient training)
gMLP (Pay Attention to MLPs)
VOLO (VOLO: Vision Outlooker for Visual Recognition)
CaiT (Going deeper with Image Transformers)
Shuffle Transformer (Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer)

Coming Soon:

T2T-ViT (Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet)
CSwin (CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows)
Focal Self-attention (Focal Self-attention for Local-Global Interactions in Vision Transformers)
HaloNet (Scaling Local Self-Attention for Parameter Efficient Visual Backbones)
Refined-ViT (Refiner: Refining Self-attention for Vision Transformers)
CrossViT (CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification)

Detection

Now:

DETR (End-to-End Object Detection with Transformers)

Coming Soon:

Swin Transformer (Swin Transformer: Hierarchical Vision Transformer using Shifted Windows)
Shuffle Transformer (Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer)
PVT (Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions)
Focal Self-attention (Focal Self-attention for Local-Global Interactions in Vision Transformers)
CSwin (CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows)
UP-DETR (UP-DETR: Unsupervised Pre-training for Object Detection with Transformers)

Semantic Segmentation

Now:

SETR (Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers)

Coming Soon:

FTN (Fully Transformer Networks for Semantic Image Segmentation)
Swin Transformer (Swin Transformer: Hierarchical Vision Transformer using Shifted Windows)
Segmenter: (Transformer for Semantic Segmentation)
SegFormer (SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers)
Shuffle Transformer (Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer)
Focal Self-attention (Focal Self-attention for Local-Global Interactions in Vision Transformers)
CSwin (CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows)

GAN

Coming Soon:

TransGAN (TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up)
Styleformer (Styleformer: Transformer based Generative Adversarial Networks with Style Vector)
ViTGAN (ViTGAN: Training GANs with Vision Transformers)

Results (Ported Weights)

Image Classification

Model	Acc@1	Acc@5	Image Size	Crop_pct	Interpolation	Model
vit_base_patch16_224	81.66	96.09	224	0.875	bilinear	google/baidu(nxhy)
vit_base_patch16_384	84.20	97.22	384	1.0	bilinear	google/baidu(8ack)
vit_large_patch16_224	83.00	96.47	224	0.875	bicubic	google/baidu(g7ij)
swin_base_patch4_window7_224	85.27	97.56	224	0.9	bicubic	google/baidu(ps9m)
swin_base_patch4_window12_384	86.43	98.07	384	1.0	bicubic	google/baidu(ef9t)
swin_large_patch4_window12_384	87.14	98.23	384	1.0	bicubic	google/baidu(5shn)
pvtv2_tiny_224	70.47	90.16	224	0.875	bicubic	google/baidu(575w)
pvtv2_medium_224	82.02	95.99	224	0.875	bicubic	google/baidu(ezfc)
pvtv2_large_224	83.77	96.61	224	0.875	bicubic	google/baidu(fbc4)
mixer_b16_224	76.60	92.23	224	0.875	bicubic	google/baidu(xh8x)
resmlp_24_224	79.38	94.55	224	0.875	bicubic	google/baidu(jdcx)
gmlp_s16_224	79.64	94.63	224	0.875	bicubic	google/baidu(bcth)
volo_d5_224_86.10	86.08	97.58	224	1.0	bicubic	google/baidu(td49)
volo_d5_512_87.07	87.05	97.97	512	1.15	bicubic	google/baidu(irik)
cait_xxs24_224	78.38	94.32	224	1.0	bicubic	google/baidu(j9m8)
cait_s24_384	85.05	97.34	384	1.0	bicubic	google/baidu(qb86)
cait_m48_448	86.49	97.75	448	1.0	bicubic	google/baidu(imk5)
deit_base_distilled_patch16_224	83.32	96.49	224	0.875	bicubic	google/baidu(5f2g)
deit_base_distilled_patch16_384	85.43	97.33	384	1.0	bicubic	google/baidu(qgj2)
shuffle_vit_tiny_patch4_window7	82.39	96.05	224	0.875	bicubic	google/baidu(8a1i)
shuffle_vit_small_patch4_window7	83.53	96.57	224	0.875	bicubic	google/baidu(xwh3)
shuffle_vit_base_patch4_window7	83.95	96.91	224	0.875	bicubic	google/baidu(1gsr)

Object Detection

Model	backbone	box_mAP	Model
DETR	ResNet50	42.0	google/baidu(n5gk)
DETR	ResNet101	43.5	google/baidu(bxz2)

Semantic Segmentation

Pascal Context

Model	Backbone	Batch_size	mIoU (ss)	mIoU (ms+flip)	Backbone_checkpoint	Model_checkpoint	ConfigFile
SETR_Naive	ViT_large	16	52.06	52.57	google/baidu(owoj)	google/baidu(xdb8)	config
SETR_PUP	ViT_large	16	53.90	54.53	google/baidu(owoj)	google/baidu(6sji)	config
SETR_MLA	ViT_Large	8	54.39	55.16	google/baidu(owoj)	google/baidu(wora)	config
SETR_MLA	ViT_large	16	55.01	55.87	google/baidu(owoj)	google/baidu(76h2)	config

Cityscapes

Model	Backbone	Batch_size	Iteration	mIoU (ss)	mIoU (ms+flip)	Backbone_checkpoint	Model_checkpoint	ConfigFile
SETR_Naive	ViT_Large	8	40k	76.71	-	google/baidu(owoj)	google/baidu(g7ro)	config
SETR_Naive	ViT_Large	8	80k	77.31	-	google/baidu(owoj)	google/baidu(wn6q)	config
SETR_PUP	ViT_Large	8	40k	77.92	-	google/baidu(owoj)	google/baidu(zmoi)	config
SETR_PUP	ViT_Large	8	80k	78.81	-	google/baidu(owoj)	baidu(f793)	config
SETR_MLA	ViT_Large	8	40k	76.70	-	google/baidu(owoj)	baidu(qaiw)	config
SETR_MLA	ViT_Large	8	80k	77.26	-	google/baidu(owoj)	baidu(6bgj)	config

ADE20K

Model	Backbone	Batch_size	Iteration	mIoU (ss)	mIoU (ms+flip)	Backbone_checkpoint	Model_checkpoint	ConfigFile
SETR_Naive	ViT_Large	16	160k	47.57	48.12	google/baidu(owoj)	baidu(lugq)	config
SETR_PUP	ViT_Large	16	160k	49.12	-	google/baidu(owoj)	baidu(udgs)	config
SETR_MLA	ViT_Large	8	160k	47.80	-	google/baidu(owoj)	baidu(mrrv)	config
DPT	ViT_Large	16	160k	47.21	-	google/baidu(owoj)	baidu(ts7h)	config

GAN

Results (Self-Trained Weights)

Image Classification

Object Detection

Segmentation

GAN

Validation Scripts

Run on single GPU:

sh run_eval.sh

or you can run the python script:

python main_single_gpu.py

with proper settings.

The script run_eval.sh calls the main python script main_single_gpu.py with a number of options, usually you need to change the following settings, e.g., for ViT base model:

python main_single_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \
-eval \
-pretrained='./vit_base_patch16_224'

Note:

The -pretrained option accepts the path of pretrained weights file without the file extension (.pdparams).

Run on multi GPU:

sh run_eval_multi.sh

or you can run the python script:

python main_multi_gpu.py

with proper settings.

The script run_eval_multi.sh calls the main python script main_multi_gpu.py with a number of options, usually you need to change the following settings, e.g., for ViT base model:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python main_multi_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \
-eval \
-pretrained='./vit_base_patch16_224'
-ngpus=8

Note:

that the -pretrained option accepts the path of pretrained weights file without the file extension (.pdparams).

If -ngpu is not set, all the available GPU devices will be used.

Training Scripts

Train on single GPU:

sh run_train.sh

or you can run the python script:

python main_single_gpu.py

with proper settings.

The script run_train.sh calls the main python script main_single_gpu.py with a number of options, usually you need to change the following settings, e.g., for ViT base model:

python main_single_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \

Note:

The training options such as lr, image size, model layers, etc., can be changed in the .yaml file set in -cfg. All the available settings can be found in ./config.py

Run on multi GPU:

sh run_train_multi.sh

or you can run the python script:

python main_multi_gpu.py

with proper settings.

The script run_train_multi.sh calls the main python script main_multi_gpu.py with a number of options, usually you need to change the following settings, e.g., for ViT base model:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python main_multi_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \
-ngpus=8

Note:

The training options such as lr, image size, model layers, etc., can be changed in the .yaml file set in -cfg. All the available settings can be found in ./config.py

If -ngpu is not set, all the available GPU devices will be used.

Features

optimizers
Schedulers
DDP
Data Augumentation
DropPath

Contributing

We encourage and appreciate your contribution to PPViT project, please refer to our workflow and work styles by CONTRIBUTING.md

License

BR-IDL/PaddleViT

Folders and files

Latest commit

History

Repository files navigation

PPViT

Introduction

Models

Image Classification

Now:

Coming Soon:

Detection

Now:

Coming Soon:

Semantic Segmentation

Now:

Coming Soon:

GAN

Coming Soon:

Results (Ported Weights)

Image Classification

Object Detection

Semantic Segmentation

Pascal Context

Cityscapes

ADE20K

GAN

Results (Self-Trained Weights)

Image Classification

Object Detection

Segmentation

GAN

Validation Scripts

Run on single GPU:

Run on multi GPU:

Training Scripts

Train on single GPU:

Run on multi GPU:

Features

Contributing

Licenses

Code

Pretrained Weights

Citing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 22

Uh oh!

Languages

Packages