Name	Name	Last commit message	Last commit date
Latest commit History 243 Commits
docs	docs
gan	gan
image_classification	image_classification
object_detection/DETR	object_detection/DETR
semantic_segmentation	semantic_segmentation
.gitignore	.gitignore
CONTRIBUTING.md	CONTRIBUTING.md
LICENSE	LICENSE
PaddleViT.png	PaddleViT.png
README.md	README.md

PaddleViT

State-of-the-art Visual Transformer and MLP Models for Paddle 2.0

🤖 PaddlePaddle Visual Transformers (PaddleViT or PPViT) is a collection of vision models beyond convolution. Most of the models are based on Visual Transformers, Visual Attentions, and MLPs, etc. PaddleViT also integrates popular layers, utilities, optimizers, schedulers, data augmentations, training/validation scripts for PaddlePaddle 2.0+. The aim is to reproduce a wide variety of state-of-the-art ViT and MLP models with full training/validation procedures. We are passionate about making cuting-edge CV techniques easier to use for everyone.

🤖 PaddleViT provides models and tools for a variety of vision tasks, such as classifications, object detection, semantic segmentation, GAN, and more. Each model architecture is defined in standalone python module and can be modified to enable quick research experiments. At the same time, pretrained weights can be downloaded and used to finetune on your own datasets. PaddleViT also integrates popular tools and modules for custimized dataset, data preprocessing, performance metrics, DDP and more.

🤖 PaddleViT is backed by popular deep learning framework PaddlePaddle, we also provide tutorials and projects on Paddle AI Studio. It's intuitive and straightforward to get started for new users.

Quick Tour

- Image Classification

- Object Detection

- Semantic Segmentation

- GAN

Installation

Model architectures

Image Classification

Now:

ViT (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)
DeiT (Training data-efficient image transformers & distillation through attention)
Swin Transformer (Swin Transformer: Hierarchical Vision Transformer using Shifted Windows)
PVTv2 (PVTv2: Improved Baselines with Pyramid Vision Transformer)
MLP-Mixer (MLP-Mixer: An all-MLP Architecture for Vision)
ResMLP (ResMLP: Feedforward networks for image classification with data-efficient training)
gMLP (Pay Attention to MLPs)
VOLO (VOLO: Vision Outlooker for Visual Recognition)
CaiT (Going deeper with Image Transformers)
Shuffle Transformer (Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer)
CSwin (CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows)
T2T-ViT (Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet)

Coming Soon:

Focal Self-attention (Focal Self-attention for Local-Global Interactions in Vision Transformers)
HaloNet (Scaling Local Self-Attention for Parameter Efficient Visual Backbones)
Refined-ViT (Refiner: Refining Self-attention for Vision Transformers)
CrossViT (CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification)

Detection

Now:

DETR (End-to-End Object Detection with Transformers)

Coming Soon:

Swin Transformer (Swin Transformer: Hierarchical Vision Transformer using Shifted Windows)
Shuffle Transformer (Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer)
PVT (Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions)
Focal Self-attention (Focal Self-attention for Local-Global Interactions in Vision Transformers)
CSwin (CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows)
UP-DETR (UP-DETR: Unsupervised Pre-training for Object Detection with Transformers)

Semantic Segmentation

Now:

SETR (Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers)
DPT (Vision Transformers for Dense Prediction)
Swin Transformer (Swin Transformer: Hierarchical Vision Transformer using Shifted Windows)
Segmenter: (Transformer for Semantic Segmentation)

Coming Soon:

FTN (Fully Transformer Networks for Semantic Image Segmentation)
SegFormer (SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers)
Shuffle Transformer (Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer)
Focal Self-attention (Focal Self-attention for Local-Global Interactions in Vision Transformers)
CSwin (CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows)

GAN

Coming Soon:

TransGAN (TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up)
Styleformer (Styleformer: Transformer based Generative Adversarial Networks with Style Vector)
ViTGAN (ViTGAN: Training GANs with Vision Transformers)

Results (Ported Weights)

Image Classification

Model	Acc@1	Acc@5	Image Size	Crop_pct	Interpolation	Link
vit_base_patch16_224	84.58	97.30	224	0.875	bicubic	google/baidu(qv4n)
vit_base_patch16_384	85.99	98.00	384	1.0	bicubic	google/baidu(wsum)
vit_large_patch16_224	85.81	97.82	224	0.875	bicubic	google/baidu(1bgk)
swin_base_patch4_window7_224	85.27	97.56	224	0.9	bicubic	google/baidu(wyck)
swin_base_patch4_window12_384	86.43	98.07	384	1.0	bicubic	google/baidu(4a95)
swin_large_patch4_window12_384	87.14	98.23	384	1.0	bicubic	google/baidu(j71u)
pvtv2_b0	70.47	90.16	224	0.875	bicubic	google/baidu(dxgb)
pvtv2_b1	78.70	94.49	224	0.875	bicubic	google/baidu(2e5m)
pvtv2_b2	82.02	95.99	224	0.875	bicubic	google/baidu(are2)
pvtv2_b3	83.14	96.47	224	0.875	bicubic	google/baidu(nc21)
pvtv2_b4	83.61	96.69	224	0.875	bicubic	google/baidu(tthf)
pvtv2_b5	83.77	96.61	224	0.875	bicubic	google/baidu(9v6n)
pvtv2_b2_linear	82.06	96.04	224	0.875	bicubic	google/baidu(a4c8)
mlp_mixer_b16_224	76.60	92.23	224	0.875	bicubic	google/baidu(xh8x)
mlp_mixer_l16_224	72.06	87.67	224	0.875	bicubic	google/baidu(8q7r)
resmlp_24_224	79.38	94.55	224	0.875	bicubic	google/baidu(jdcx)
resmlp_36_224	79.77	94.89	224	0.875	bicubic	google/baidu(33w3)
resmlp_big_24_224	81.04	95.02	224	0.875	bicubic	google/baidu(r9kb)
resmlp_big_24_distilled_224	83.59	96.65	224	0.875	bicubic	google/baidu(4jk5)
gmlp_s16_224	79.64	94.63	224	0.875	bicubic	google/baidu(bcth)
volo_d5_224_86.10	86.08	97.58	224	1.0	bicubic	google/baidu(td49)
volo_d5_512_87.07	87.05	97.97	512	1.15	bicubic	google/baidu(irik)
cait_xxs24_224	78.38	94.32	224	1.0	bicubic	google/baidu(j9m8)
cait_s24_384	85.05	97.34	384	1.0	bicubic	google/baidu(qb86)
cait_m48_448	86.49	97.75	448	1.0	bicubic	google/baidu(imk5)
deit_base_distilled_patch16_224	83.32	96.49	224	0.875	bicubic	google/baidu(5f2g)
deit_base_distilled_patch16_384	85.43	97.33	384	1.0	bicubic	google/baidu(qgj2)
shuffle_vit_tiny_patch4_window7	82.39	96.05	224	0.875	bicubic	google/baidu(8a1i)
shuffle_vit_small_patch4_window7	83.53	96.57	224	0.875	bicubic	google/baidu(xwh3)
shuffle_vit_base_patch4_window7	83.95	96.91	224	0.875	bicubic	google/baidu(1gsr)
cswin_tiny_224	82.81	96.30	224	0.9	bicubic	google/baidu(4q3h)
cswin_small_224	83.60	96.58	224	0.9	bicubic	google/baidu(gt1a)
cswin_base_224	84.23	96.91	224	0.9	bicubic	google/baidu(wj8p)
cswin_large_224	86.52	97.99	224	0.9	bicubic	google/baidu(b5fs)
cswin_base_384	85.51	97.48	384	1.0	bicubic	google/baidu(rkf5)
cswin_large_384	87.49	98.35	384	1.0	bicubic	google/baidu(6235)
t2t_vit_7	71.68	90.89	224	0.9	bicubic	google/baidu(1hpa)
t2t_vit_10	75.15	92.80	224	0.9	bicubic	google/baidu(ixug)
t2t_vit_12	76.48	93.49	224	0.9	bicubic	google/baidu(qpbb)
t2t_vit_14	81.50	95.67	224	0.9	bicubic	google/baidu(c2u8)
t2t_vit_19	81.93	95.74	224	0.9	bicubic	google/baidu(4in3)
t2t_vit_24	82.28	95.89	224	0.9	bicubic	google/baidu(4in3)
t2t_vit_t_14	81.69	95.85	224	0.9	bicubic	google/baidu(4in3)
t2t_vit_t_19	82.44	96.08	224	0.9	bicubic	google/baidu(mier)
t2t_vit_t_24	82.55	96.07	224	0.9	bicubic	google/baidu(6vxc)
t2t_vit_14_384	83.34	96.50	384	1.0	bicubic	google/baidu(r685)

Object Detection

Model	backbone	box_mAP	Model
DETR	ResNet50	42.0	google/baidu(n5gk)
DETR	ResNet101	43.5	google/baidu(bxz2)

Semantic Segmentation

Pascal Context

Model	Backbone	Batch_size	mIoU (ss)	mIoU (ms+flip)	Backbone_checkpoint	Model_checkpoint	ConfigFile
SETR_Naive	ViT_large	16	52.06	52.57	google/baidu(owoj)	google/baidu(xdb8)	config
SETR_PUP	ViT_large	16	53.90	54.53	google/baidu(owoj)	google/baidu(6sji)	config
SETR_MLA	ViT_Large	8	54.39	55.16	google/baidu(owoj)	google/baidu(wora)	config
SETR_MLA	ViT_large	16	55.01	55.87	google/baidu(owoj)	google/baidu(76h2)	config

Cityscapes

Model	Backbone	Batch_size	Iteration	mIoU (ss)	mIoU (ms+flip)	Backbone_checkpoint	Model_checkpoint	ConfigFile
SETR_Naive	ViT_Large	8	40k	76.71	79.03	google/baidu(owoj)	google/baidu(g7ro)	config
SETR_Naive	ViT_Large	8	80k	77.31	79.43	google/baidu(owoj)	google/baidu(wn6q)	config
SETR_PUP	ViT_Large	8	40k	77.92	79.63	google/baidu(owoj)	google/baidu(zmoi)	config
SETR_PUP	ViT_Large	8	80k	78.81	80.43	google/baidu(owoj)	baidu(f793)	config
SETR_MLA	ViT_Large	8	40k	76.70	78.96	google/baidu(owoj)	baidu(qaiw)	config
SETR_MLA	ViT_Large	8	80k	77.26	79.27	google/baidu(owoj)	baidu(6bgj)	config

ADE20K

Model	Backbone	Batch_size	Iteration	mIoU (ss)	mIoU (ms+flip)	Backbone_checkpoint	Model_checkpoint	ConfigFile
SETR_Naive	ViT_Large	16	160k	47.57	48.12	google/baidu(owoj)	baidu(lugq)	config
SETR_PUP	ViT_Large	16	160k	49.12	-	google/baidu(owoj)	baidu(udgs)	config
SETR_MLA	ViT_Large	8	160k	47.80	-	google/baidu(owoj)	baidu(mrrv)	config
DPT	ViT_Large	16	160k	47.21	-	google/baidu(owoj)	baidu(ts7h)	config
Segmenter	ViT_Tiny	16	160k	38.45	-	TODO	baidu(1k97)	config
Segmenter	ViT_Small	16	160k	46.07	-	TODO	baidu(i8nv)	config
Segmenter	ViT_Base	16	160k	49.08	-	TODO	baidu(hxrl)	config
Segmenter_Linear	DeiT_Base	16	160k	47.34	-	TODO	baidu(5dpv)	config
Segmenter	DeiT_Base	16	160k	49.27	-	TODO	baidu(3kim)	config
UperNet	Swin_Tiny	16	160k	44.90	45.37	-	baidu(lkhg)	config
UperNet	Swin_Small	16	160k	47.88	48.90	-	baidu(vvy1)	config
UperNet	Swin_Base	16	160k	48.59	49.04	-	baidu(y040)	config

GAN

Model	FID	Image Size	Crop_pct	Interpolation	Model
styleformer_cifar10	2.73	32	1.0	lanczos	google/baidu(7cg2)
styleformer_stl10	15.65	48	1.0	lanczos	google/baidu(8pus)
styleformer_celeba	3.32	64	1.0	lanczos	google/baidu(ymh7)
styleformer_lsun	9.68	128	1.0	lanczos	google/baidu(ue28)

*The results are evaluated on Cifar10, STL10, Celeba and LSUNchurch dataset, using fid50k_full metric.

Results (Self-Trained Weights)

Image Classification

Object Detection

Segmentation

GAN

Validation Scripts

Run on single GPU:

sh run_eval.sh

or you can run the python script:

python main_single_gpu.py

with proper settings.

The script run_eval.sh calls the main python script main_single_gpu.py with a number of options, usually you need to change the following settings, e.g., for ViT base model:

python main_single_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \
-eval \
-pretrained='./vit_base_patch16_224'

Note:

The -pretrained option accepts the path of pretrained weights file without the file extension (.pdparams).

Run on multi GPU:

sh run_eval_multi.sh

or you can run the python script:

python main_multi_gpu.py

with proper settings.

The script run_eval_multi.sh calls the main python script main_multi_gpu.py with a number of options, usually you need to change the following settings, e.g., for ViT base model:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python main_multi_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \
-eval \
-pretrained='./vit_base_patch16_224'
-ngpus=8

Note:

that the -pretrained option accepts the path of pretrained weights file without the file extension (.pdparams).

If -ngpu is not set, all the available GPU devices will be used.

Training Scripts

Train on single GPU:

sh run_train.sh

or you can run the python script:

python main_single_gpu.py

with proper settings.

The script run_train.sh calls the main python script main_single_gpu.py with a number of options, usually you need to change the following settings, e.g., for ViT base model:

python main_single_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \

Note:

The training options such as lr, image size, model layers, etc., can be changed in the .yaml file set in -cfg. All the available settings can be found in ./config.py

Run on multi GPU:

sh run_train_multi.sh

or you can run the python script:

python main_multi_gpu.py

with proper settings.

The script run_train_multi.sh calls the main python script main_multi_gpu.py with a number of options, usually you need to change the following settings, e.g., for ViT base model:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python main_multi_gpu.py \
-cfg='./configs/vit_base_patch16_224.yaml' \
-dataset='imagenet2021' \
-data_path='/dataset/imagenet' \
-batch_size=128 \
-ngpus=8

Note:

The training options such as lr, image size, model layers, etc., can be changed in the .yaml file set in -cfg. All the available settings can be found in ./config.py

If -ngpu is not set, all the available GPU devices will be used.

Features

optimizers
Schedulers
DDP
Data Augumentation
DropPath

Contributing

We encourage and appreciate your contribution to PPViT project, please refer to our workflow and work styles by CONTRIBUTING.md

License

BR-IDL/PaddleViT

Folders and files

Latest commit

History

Repository files navigation

PaddleViT

State-of-the-art Visual Transformer and MLP Models for Paddle 2.0

Quick Tour

- Image Classification

- Object Detection

- Semantic Segmentation

- GAN

Installation

Model architectures

Image Classification

Now:

Coming Soon:

Detection

Now:

Coming Soon:

Semantic Segmentation

Now:

Coming Soon:

GAN

Coming Soon:

Results (Ported Weights)

Image Classification

Object Detection

Semantic Segmentation

Pascal Context

Cityscapes

ADE20K

GAN

Results (Self-Trained Weights)

Image Classification

Object Detection

Segmentation

GAN

Validation Scripts

Run on single GPU:

Run on multi GPU:

Training Scripts

Train on single GPU:

Run on multi GPU:

Features

Contributing

Licenses

Code

Pretrained Weights

Citing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 22

Uh oh!

Languages

Packages