Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2023 CVPR] Reproducible scaling laws for contrastive language-image learning #209

Open
Jasonlee1995 opened this issue Jul 14, 2024 · 0 comments
Labels
Language Related with Natural Language Processing tasks Principle Understanding the AI Vision Related with Computer Vision tasks

Comments

@Jasonlee1995
Copy link
Owner

Jasonlee1995 commented Jul 14, 2024

기존 연구들에서 pre-training loss가 감소하면 downstream task performance가 좋아지며, pre-training loss는 scaling law를 따른다는 것을 empirically observe함

scaling law : power law relationships between model performance and training set size, model size, compute
(model size 키우고 + data 늘리고 + compute 많아지면 power law를 따르면서 성능 향상)

scaling law를 알게 되면, 정해진 compute budget에서 최적의 성능을 낼 수 있는 valuable guidance를 얻을 수 있음

scaling law에 대한 기존 연구는 2가지로 나뉨

  1. uni-modal에서만 실험
    uni-modal language or vision
  2. multi-modal에서 실험
    private data & models, customized multi-stage training procedure 사용
    thorough scaling investigation 수행 X

저자들은 2번 연구들의 단점들을 보완하여, contrastive language-image pre-training (CLIP)에서 scaling law를 잘 실험함
(public LAION dataset + open-source OpenCLIP repository 이용)

논문의 main contribution 2가지는 다음과 같음

  1. contrastive language-image pre-training에서도 scaling law가 관측됨
    consistent increase in performance when scaling model, data, and compute
    identify power law scaling for multiple downstream tasks
    (zero-shot classification, zero-shot retrieval, linear probing, end-to-end fine-tuning)
  2. pre-training dataset distribution에 따라 downstream task performance가 달라짐
    OpenAI's CLIP, OpenCLIP의 scaling behavior가 다름
    → OpenAI's CLIP : larger scaling coefficients on zero-shot classification
    → OpenCLIP : larger scaling coefficients on zero-shot retrieval
    OpenAI's CLIP과 OpenCLIP의 차이가 dataset 말고는 없기에, 저자들의 pre-training dataset distribution이 task-dependent differences in scaling behavior에 영향을 미친다고 분석
    즉, pre-training dataset에 따라 downstream task의 different scaling behavior가 나타남

실험 위주의 논문으로 내용 자체는 별게 없어서, 중요하다고 생각되는 부분만 간단히 요약

1. Datasets and Methods

  • dataset
    LAION-400M, LAION-2B, LAION-5B
  • model
    CLIP architecture 사용
    visual encoder와 text encoder를 같이 키워줌
  • training duration
    3B, 13B, 34B samples seen
  • other details
    • learning rate schedule
      separate training experiment with a cosine annealing learning schedule adapted to the number of samples
    • hyperparameter tuning
      tune a small number of hyperparameters to optimize validation loss and prevent training instabilities
    • mixed precision with bfloat16 for larger models
      large model (ViT-L/14, H/14, g/14)를 float16 mixed precision으로 학습하면 loss spikes가 발생하여 성능이 안좋아짐
      learning rate, learning rate schedule, gradient clipping해줘도 loss spike가 계속 발생
      float16 mixed precision을 bfloat16 mixed precision으로 바꿔주니 해결
      왜 bfloat16으로 바꿔주는게 이를 해결해주는가?
      → larger models typically show larger activation values
      → bfloat16 is more suitable with its wider dynamic range
model details image
hyperparameters details image

2. Scaling laws for different downstream tasks

2.1. Zero-shot image classification & retrieval image

Figure 1 (a) - zero-shot image classification
model, data, samples seen을 scale하면 성능이 consistent하게 좋아짐
bottleneck effect : model, samples seen을 유지한채로 train dataset size를 키워봤자 성능이 증가하지 않음
→ 고루고루 올려야 성능이 좋아짐

Figure 1 (b) - zero-shot image retrieval
image-text data에서 text caption과 embedding이 가까운 top-K images에 original image가 있는지를 측정
model, data, samples seen을 scale하면 성능이 consistent하게 좋아짐
bottleneck effect : model, train dataset size를 유지한채로 samples seen을 키워봤자 성능이 증가하지 않음
→ 고루고루 올려야 성능이 좋아짐

Figure 1
OpenAI CLIP은 ImageNet classification에서 좋은 성능을 냄
OpenCLIP은 image retrieval에서 좋은 성능을 냄
→ OpenAI CLIP and OpenCLIP have distinct scaling advantages over each other depending on the downstream task
→ task-specific scaling differences originate from the different pre-training datasets

2.2. Full and few-shot linear probing image image

linear probing : backbone frozen한 다음에 linear classifier만 학습
scaling up consistently improves the accuracy of a linear classifier

2.3. Fine-tuning image

text encoder는 frozen한 다음, CLIP 방식으로 vision encoder만 학습
scaling up consistently improves the accuracy

@Jasonlee1995 Jasonlee1995 added Principle Understanding the AI Vision Related with Computer Vision tasks Language Related with Natural Language Processing tasks labels Jul 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Language Related with Natural Language Processing tasks Principle Understanding the AI Vision Related with Computer Vision tasks
Projects
None yet
Development

No branches or pull requests

1 participant