[2023 CVPR] Reproducible scaling laws for contrastive language-image learning #209

Jasonlee1995 · 2024-07-14T10:26:12Z

기존 연구들에서 pre-training loss가 감소하면 downstream task performance가 좋아지며, pre-training loss는 scaling law를 따른다는 것을 empirically observe함

scaling law : power law relationships between model performance and training set size, model size, compute
(model size 키우고 + data 늘리고 + compute 많아지면 power law를 따르면서 성능 향상)

scaling law를 알게 되면, 정해진 compute budget에서 최적의 성능을 낼 수 있는 valuable guidance를 얻을 수 있음

scaling law에 대한 기존 연구는 2가지로 나뉨

uni-modal에서만 실험
uni-modal language or vision
multi-modal에서 실험
private data & models, customized multi-stage training procedure 사용
thorough scaling investigation 수행 X

저자들은 2번 연구들의 단점들을 보완하여, contrastive language-image pre-training (CLIP)에서 scaling law를 잘 실험함
(public LAION dataset + open-source OpenCLIP repository 이용)

논문의 main contribution 2가지는 다음과 같음

contrastive language-image pre-training에서도 scaling law가 관측됨
consistent increase in performance when scaling model, data, and compute
identify power law scaling for multiple downstream tasks
(zero-shot classification, zero-shot retrieval, linear probing, end-to-end fine-tuning)
pre-training dataset distribution에 따라 downstream task performance가 달라짐
OpenAI's CLIP, OpenCLIP의 scaling behavior가 다름
→ OpenAI's CLIP : larger scaling coefficients on zero-shot classification
→ OpenCLIP : larger scaling coefficients on zero-shot retrieval
OpenAI's CLIP과 OpenCLIP의 차이가 dataset 말고는 없기에, 저자들의 pre-training dataset distribution이 task-dependent differences in scaling behavior에 영향을 미친다고 분석
즉, pre-training dataset에 따라 downstream task의 different scaling behavior가 나타남

실험 위주의 논문으로 내용 자체는 별게 없어서, 중요하다고 생각되는 부분만 간단히 요약

1. Datasets and Methods

dataset
LAION-400M, LAION-2B, LAION-5B
model
CLIP architecture 사용
visual encoder와 text encoder를 같이 키워줌
training duration
3B, 13B, 34B samples seen
other details
- learning rate schedule
  separate training experiment with a cosine annealing learning schedule adapted to the number of samples
- hyperparameter tuning
  tune a small number of hyperparameters to optimize validation loss and prevent training instabilities
- mixed precision with bfloat16 for larger models
  large model (ViT-L/14, H/14, g/14)를 float16 mixed precision으로 학습하면 loss spikes가 발생하여 성능이 안좋아짐
  learning rate, learning rate schedule, gradient clipping해줘도 loss spike가 계속 발생
  float16 mixed precision을 bfloat16 mixed precision으로 바꿔주니 해결
  왜 bfloat16으로 바꿔주는게 이를 해결해주는가?
  → larger models typically show larger activation values
  → bfloat16 is more suitable with its wider dynamic range

model details

hyperparameters details

2. Scaling laws for different downstream tasks

2.1. Zero-shot image classification & retrieval

Figure 1 (a) - zero-shot image classification
model, data, samples seen을 scale하면 성능이 consistent하게 좋아짐
bottleneck effect : model, samples seen을 유지한채로 train dataset size를 키워봤자 성능이 증가하지 않음
→ 고루고루 올려야 성능이 좋아짐

Figure 1 (b) - zero-shot image retrieval
image-text data에서 text caption과 embedding이 가까운 top-K images에 original image가 있는지를 측정
model, data, samples seen을 scale하면 성능이 consistent하게 좋아짐
bottleneck effect : model, train dataset size를 유지한채로 samples seen을 키워봤자 성능이 증가하지 않음
→ 고루고루 올려야 성능이 좋아짐

Figure 1
OpenAI CLIP은 ImageNet classification에서 좋은 성능을 냄
OpenCLIP은 image retrieval에서 좋은 성능을 냄
→ OpenAI CLIP and OpenCLIP have distinct scaling advantages over each other depending on the downstream task
→ task-specific scaling differences originate from the different pre-training datasets

2.2. Full and few-shot linear probing

linear probing : backbone frozen한 다음에 linear classifier만 학습
scaling up consistently improves the accuracy of a linear classifier

2.3. Fine-tuning

text encoder는 frozen한 다음, CLIP 방식으로 vision encoder만 학습
scaling up consistently improves the accuracy

The text was updated successfully, but these errors were encountered:

Jasonlee1995 added Principle Understanding the AI Vision Related with Computer Vision tasks Language Related with Natural Language Processing tasks labels Jul 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2023 CVPR] Reproducible scaling laws for contrastive language-image learning #209

[2023 CVPR] Reproducible scaling laws for contrastive language-image learning #209

Jasonlee1995 commented Jul 14, 2024 •

edited

Loading

[2023 CVPR] Reproducible scaling laws for contrastive language-image learning #209

[2023 CVPR] Reproducible scaling laws for contrastive language-image learning #209

Comments

Jasonlee1995 commented Jul 14, 2024 • edited Loading

1. Datasets and Methods

2. Scaling laws for different downstream tasks

Jasonlee1995 commented Jul 14, 2024 •

edited

Loading