Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2022 CVPR] FLAVA: A Foundational Language And Vision Alignment Model #221

Open
Jasonlee1995 opened this issue Oct 8, 2024 · 0 comments
Labels
Language Related with Natural Language Processing tasks Optimization Related with loss, optimization Vision Related with Computer Vision tasks

Comments

@Jasonlee1995
Copy link
Owner

Jasonlee1995 commented Oct 8, 2024

image

기존의 vision-and-language space에 있는 models를 2가지 category로 분류할 수 있음

  1. dual encoder approach
    contrastive pre-training으로 모델 학습 (ex. CLIP, ALIGN)
    장점 : uni-modal & cross-modal task 수행 가능
    단점 : multi-modal task 수행 불가능
  2. fusion encoder approach
    다양한 pre-training task로 모델 학습 (ex. MLM, prefixLM, ITM, ...)
    장점 : multi-modal task 수행 가능
    단점 : uni-modal, cross-modal task 성능이 안좋음

즉, 기존의 vision-language 모델들 중 uni-modal, cross-modal, multi-modal을 모두 target하는 모델이 없음

당연하게도, single universal foundation 모델이 all modality task를 수행하는 것이 바람직함

그렇다면 어떻게 학습해야 다양한 modality task를 잘 수행할 수 있을까?
→ uni-modal, cross-modal, multi-modal objective를 모두 이용하여 모델을 학습하는 방식이면 될 것 같음

그렇다면 어떻게 모델을 설계해야 multi-task objective로 학습할 수 있을까?
→ dual encoder와 fusion encoder를 결합한 구조를 이용하자

opensource dataset + multi-task objective로 학습한 foundation model인 FLAVA는 다양한 modality task를 수행할 수 있음

중요하다고 생각되는 부분만 간단히 요약

1. FLAVA: A Foundational Language And Vision Alignment Model

image
model architecture

image encoder : ViT-B/16
add image classification token

text encoder : pre-norm Transformer with hidden size 768
text tokenizer : BERT tokenzier (Wordpiece with 30,522 vocab size)
add text classification token

multi-modal encoder : pre-norm Transformer with hidden size 768
image feature, text feature에 각각 linear projection layer 적용한 다음, concat
add multi-modal classification token

pre-training objectives

uni-modal objective

Masked Image Modeling (MIM)
BEiT 방식 : rectangular block-wise masking + masked token의 visual codebook index를 맞추도록 학습
(pre-trained dVAE tokenizer를 이용하여 각 image patch마다 visual codebook index 부여)

Masked Language Modeling (MLM)
BERT 방식 : text tokens 15% masking + masked token의 word vocabulary index를 맞추도록 학습

cross-modal objective

Global Contrastive (GC) loss
open-source CLIP implementation은 local GPU embeddings gradient에 대해서만 back-propagate함
모든 GPUs간의 full back-propagation으로 변경해주며, 이를 global contrastive loss라고 명칭
GC loss 적용할 때는 image, text에 masking X

multi-modal objective

Masked Multi-modal Modeling (MMM)
tokenize image & text
BEiT 방식으로 image masking, BERT 방식으로 text masking
multi-modal encoder output에 MLP를 붙여서 visual codebook index, word vocabulary index를 맞추도록 학습

Image-Text Matching (ITM)
image-text pair dataset을 이용하여 matching, mismatching image-text pair data를 생성할 수 있음
multi-modal encoder의 class token feature에 classifier를 붙여서 input image, text가 match하는지를 맞추도록 학습
(binary classification)

2-stage training

stage 1 : encoder initialization from uni-modal pre-training
ImageNet-1K DINO pre-trained weight로 image encoder initialize
CCNews, BookCorpus MLM pre-trained weight로 text encoder initialize

stage 2 : joint uni-modal and multi-modal training
3가지 유형의 objective로 FLAVA 학습
uni-modal → MIM, MLM
cross-modal → GC
multi-modal → MMM, ITM

stage 2에서는 BEiT로 학습하는데, stage 1에서는 왜 DINO 사용하는지?
→ 실험해보니 DINO로 initialize하는게 BEiT보다 성능이 좋아서 DINO 사용

2. Experiments

2.1. Setup

data : public multi-modal datasets (PMD) image image

publicly available sources로 dataset 구축

YFCC100M dataset은 다음과 같이 filtering

  • text-based filtering
    discard non-English captions
    only keeping captions that contain more than two words
  • image-base filtering
    image의 description field, title field를 고려하여 filtering
    (구체적으로 어떻게 했는지에 대한 내용이 없음)
implementation details

1e-3 learning rate
8,192 batch size
0.1 weight decay
10,000 iteration warm-up
AdamW optimizer
Fully-Shared Data Parallel (FSDP) 사용
full FP16 precision training except layer norm

large learning rate로 학습하기 위해 large batch size, large weight decay, long warm-up이 필수적이었다고 함
pre-norm Transformer를 사용한 이유 : provides more robust learning for text encoder under large learning rate
learning curve & zero-shot image classification accuracy를 monitoring하여 hyperparameter search

2.2. Ablation

Table 3 Table 3

full FLAVA pre-training achieves the best results

Table 4 Table 4

$\textrm{CLIP}$ vs $\textrm{FLAVA}_C$
global back-propagation implementation in contrastive loss is critical to effective pre-training

$\textrm{FLAVA}_C$ vs $\textrm{FLAVA}_{MM}$
multi-modal objectives allow FLAVA to learn powerful representations for both uni-modal and multi-modal downstream tasks

$\textrm{FLAVA}_{MM}$ vs $\textrm{FLAVA}$ w/o init
macro average decreases slightly
adding different tasks to the mix makes the optimization much harder, especially when the whole model is randomly initialized
having some vision and language understanding is important before learning multi-modal tasks

$\textrm{FLAVA}$ w/o init vs $\textrm{FLAVA}$
pre-trained encoders boost the performance of FLAVA on all tasks

Table D.1 image

CLIP architecture
text tokenizer : lower-cased byte pair encoding with 49,152 vocab size
text encoder : post-norm Transformer with 512 hidden dim

FLAVA architecture
text tokenizer : BERT tokenizer (Wordpiece with 30,522 vocab size)
text encoder : pre-norm Transformer with 768 hidden dim

our architecture optimizations help achieve a better macro average overall

2.3. Comparison

vision task → linear probing
NLP tasks → fine-tuning
multi-modal tasks → fine-tuning

Table 5 image

FLAVA largely outperforms previous multi-modal approaches pre-trained on public data on both language and multi-modal tasks and approaches the well established BERT model on several GLUE tasks

FLAVA outperforms the variant of the CLIP model pre-trained only on the PMD dataset

FLAVA also has comparable performance to SimVLM on language tasks while underperforms on multi-modal tasks and ImageNet linear evaluation
FLAVA is pre-trained using a much smaller dataset compared to 1.8B image-text pairs of SimVLM
we anticipate that FLAVA’s performance will further heavily improve as the pre-training dataset size increases

Figure 4 image

FLAVA는 70M data로 학습한 반면, CLIP은 400M data로 학습함
however, FLAVA works significantly better on language and multi-modal tasks while slightly worse than CLIP on some vision-only tasks

FLAVA가 SST task에서 유난히 성능이 안좋음
PMD dataset에는 scene text information이 거의 없어서, 모델이 text reading ability from images를 학습하지 못했기 때문
이에 대한 증거로, PMD dataset으로 학습한 CLIP도 SST task에서 못함
→ we anticipate that FLAVA will also be able to perform scene text reading when pre-trained on a larger dataset with enough scene text information

@Jasonlee1995 Jasonlee1995 added Vision Related with Computer Vision tasks Language Related with Natural Language Processing tasks Optimization Related with loss, optimization labels Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Language Related with Natural Language Processing tasks Optimization Related with loss, optimization Vision Related with Computer Vision tasks
Projects
None yet
Development

No branches or pull requests

1 participant