Name	Name	Last commit message	Last commit date
parent directory ..
dall_e	dall_e
semantic_segmentation	semantic_segmentation
.gitignore	.gitignore
10percent.txt	10percent.txt
1percent.txt	1percent.txt
README.md	README.md
README_Original.md	README_Original.md
dataset_folder.py	dataset_folder.py
datasets.py	datasets.py
engine_for_cyclical.py	engine_for_cyclical.py
engine_for_cyclical_joint.py	engine_for_cyclical_joint.py
engine_for_finetuning.py	engine_for_finetuning.py
engine_for_pretraining.py	engine_for_pretraining.py
get_started_for_image_classification.md	get_started_for_image_classification.md
masking_generator.py	masking_generator.py
modeling_cyclical.py	modeling_cyclical.py
modeling_cyclical_joint.py	modeling_cyclical_joint.py
modeling_discrete_vae.py	modeling_discrete_vae.py
modeling_finetune.py	modeling_finetune.py
modeling_pretrain.py	modeling_pretrain.py
optim_factory.py	optim_factory.py
requirements.txt	requirements.txt
run_beit_pretraining.py	run_beit_pretraining.py
run_class_finetuning.py	run_class_finetuning.py
run_cyclical.py	run_cyclical.py
run_cyclical_joint.py	run_cyclical_joint.py
transforms.py	transforms.py
utils.py	utils.py

data2vec

data2vec is a framework for self-supervised representation learning for images, speech, and text as described in data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language (Baevski et al., 2022). The algorithm uses the same learning mechanism for different modalities. You can read more about this work here arxiv and fairseq repo

For details about how to setup your BEIT environment, please refer the original README here. Below you can find the necessary commands to reproduce the vision results reported in data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

Model Checkpoints

Pretrained Model	Version	Link
data2vec ViT-B	800 epochs pretrained	download
data2vec ViT-L	800 epochs pretrained	download
data2vec ViT-L	1600 epochs pretrained	download
data2vec ViT-B	Finetuned	download
data2vec ViT-L	Finetuned	download

VIT-B Pretraining and Finetuning

Command to pretrain the ViT-B model for 800 epochs


OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=16 run_cyclical.py \
        --data_path ${DATA_PATH} --output_dir ${OUTPUT_DIR} --log_dir ${OUTPUT_DIR} --num_mask_patches 120 \
        --model beit_base_patch16_224 \
        --seed 0 \
        --target_layers [6,7,8,9,10,11] \
        --ema_decay 0.9998 --ema_start_at 0 --ema_decay_init 0.999 \
        --batch_size 128 --lr 2e-3 --warmup_epochs 10 --epochs 800 \
        --clip_grad 3.0 --drop_path 0.25 --layer_scale_init_value 1e-4 \
        --layer_results 'end' \
        --var_w0 0.0 --var_w1 0.0 \
        --max_mask_patches_per_block 196 --min_mask_patches_per_block 16 \
        --l1_beta=2.0 \
        --weight_decay 0.05 \
        --imagenet_default_mean_and_std --dist_url $dist_url --loss_scale -1 --mask_dropout_prob -1.0 \
        --post_target_layer_norm --world_size 16 --attn_drop_rate 0.05

Command to finetune the ViT-B model


OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=8 run_class_finetuning.py \
        --model beit_base_patch16_224 \
        --finetune $CHECKPOINT \
        --data_path ${DATA_PATH} --output_dir ${OUTPUT_DIR} --log_dir ${OUTPUT_DIR} --batch_size 128 --lr 4e-3 --update_freq 1 \
        --warmup_epochs 10 --epochs 100 --layer_decay 0.65 --drop_path 0.2 --drop 0.0 \
        --weight_decay 0.0 --mixup 0.8 --cutmix 1.0 --enable_deepspeed --nb_classes 1000 \
        --target_layer -1 --world_size 8 --dist_url $dist_url

VIT-L Pretraining and Finetuning

Command to pretrain the ViT-L model for 800 epochs


OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=64 run_cyclical.py \
        --data_path ${DATA_PATH} --output_dir ${OUTPUT_DIR} --log_dir ${OUTPUT_DIR} --num_mask_patches 120 \
        --model beit_large_patch16_224 \
        --seed 0 \
        --target_layers [18,19,20,21,22,23] \
        --ema_decay 0.9998 --ema_start_at 0 \
        --batch_size 64 --lr 1e-3 --warmup_epochs 80 --epochs 800 \
        --clip_grad 3.0 --drop_path 0.2 --layer_scale_init_value 1e-5 \
        --layer_results 'end' \
        --l1_beta=2 \
	--var_w0 0.0 --var_w1 0.0 --var_margin0 0.5 \
	--max_mask_patches_per_block 196 --min_mask_patches_per_block 16 \
        --imagenet_default_mean_and_std --dist_url $dist_url --world_size 64 \
         --post_target_layer_norm  --attn_drop_rate 0.15

You further pretrain the ViT-L model for another 800 epochs with constant ema decay


OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=64 run_cyclical.py \
        --data_path ${DATA_PATH} --output_dir ${OUTPUT_DIR} --log_dir ${OUTPUT_DIR} --num_mask_patches 120 \
        --model beit_large_patch16_224 \
        --seed 0 \
        --target_layers [18,19,20,21,22,23] \
        --ema_decay 0.9999 --ema_start_at 0 --ema_decay_init 0.999 \
        --batch_size 64 --lr 1e-3 --warmup_epochs 40 --epochs 800 \
        --clip_grad 3.0 --drop_path 0.2 --layer_scale_init_value 1e-5 \
        --layer_results 'end' \
        --l1_beta=2 \
	--var_w0 0.0 --var_w1 0.0 --var_margin0 0.5 \
	--max_mask_patches_per_block 196 --min_mask_patches_per_block 16 \
        --imagenet_default_mean_and_std --dist_url $dist_url --world_size 64 \
         --post_target_layer_norm  --attn_drop_rate 0.15 \
         --seed_model {PATH_TO_800EPOCH_MODEL}

Command to finetune the ViT-L model


OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=16 run_cyclical.py \
        --model beit_large_patch16_224 \
        --finetune $CHECKPOINT \
        --data_path ${DATA_PATH} --output_dir ${OUTPUT_DIR} --log_dir ${OUTPUT_DIR} --batch_size 64 --lr 5e-3 --update_freq 1 \
        --warmup_epochs $WARMUP --epochs 50 --layer_decay 0.65 --drop_path 0.25 --drop 0.0 \
        --weight_decay 0.05 --mixup 0.8 --cutmix 1.0 --enable_deepspeed --nb_classes 1000 --seed 0\
        --target_layer -1 --world_size 16 --dist_url $dist_url --attn_drop_rate 0.0

LICENSE

Data2Vec is licensed under CC-BY-NC, however portions of the project are available under separate license terms: Unilm is licensed under the MIT license.

CITATION

If you find this repository useful, please consider citing our work:

@misc{https://doi.org/10.48550/arxiv.2202.03555,
  doi = {10.48550/ARXIV.2202.03555},
  url = {https://arxiv.org/abs/2202.03555},
  author = {Baevski, Alexei and Hsu, Wei-Ning and Xu, Qiantong and Babu, Arun and Gu, Jiatao and Auli, Michael},
  keywords = {Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

beit

beit

README.md

data2vec

Model Checkpoints

VIT-B Pretraining and Finetuning

VIT-L Pretraining and Finetuning

LICENSE

CITATION

Files

beit

Directory actions

More options

Directory actions

More options

Latest commit

History

beit

Folders and files

parent directory

README.md

data2vec

Model Checkpoints

VIT-B Pretraining and Finetuning

VIT-L Pretraining and Finetuning

LICENSE

CITATION