You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
기존 연구들에서 pre-training loss가 감소하면 downstream task performance가 좋아지며, pre-training loss는 scaling law를 따른다는 것을 empirically observe함
scaling law : power law relationships between model performance and training set size, model size, compute
(model size 키우고 + data 늘리고 + compute 많아지면 power law를 따르면서 성능 향상)
scaling law를 알게 되면, 정해진 compute budget에서 최적의 성능을 낼 수 있는 valuable guidance를 얻을 수 있음
scaling law에 대한 기존 연구는 2가지로 나뉨
uni-modal에서만 실험
uni-modal language or vision
multi-modal에서 실험
private data & models, customized multi-stage training procedure 사용
thorough scaling investigation 수행 X
contrastive language-image pre-training에서도 scaling law가 관측됨
consistent increase in performance when scaling model, data, and compute
identify power law scaling for multiple downstream tasks
(zero-shot classification, zero-shot retrieval, linear probing, end-to-end fine-tuning)
pre-training dataset distribution에 따라 downstream task performance가 달라짐
OpenAI's CLIP, OpenCLIP의 scaling behavior가 다름
→ OpenAI's CLIP : larger scaling coefficients on zero-shot classification
→ OpenCLIP : larger scaling coefficients on zero-shot retrieval
OpenAI's CLIP과 OpenCLIP의 차이가 dataset 말고는 없기에, 저자들의 pre-training dataset distribution이 task-dependent differences in scaling behavior에 영향을 미친다고 분석
즉, pre-training dataset에 따라 downstream task의 different scaling behavior가 나타남
실험 위주의 논문으로 내용 자체는 별게 없어서, 중요하다고 생각되는 부분만 간단히 요약
1. Datasets and Methods
dataset
LAION-400M, LAION-2B, LAION-5B
model
CLIP architecture 사용
visual encoder와 text encoder를 같이 키워줌
training duration
3B, 13B, 34B samples seen
other details
learning rate schedule
separate training experiment with a cosine annealing learning schedule adapted to the number of samples
hyperparameter tuning
tune a small number of hyperparameters to optimize validation loss and prevent training instabilities
mixed precision with bfloat16 for larger models
large model (ViT-L/14, H/14, g/14)를 float16 mixed precision으로 학습하면 loss spikes가 발생하여 성능이 안좋아짐
learning rate, learning rate schedule, gradient clipping해줘도 loss spike가 계속 발생
float16 mixed precision을 bfloat16 mixed precision으로 바꿔주니 해결
왜 bfloat16으로 바꿔주는게 이를 해결해주는가?
→ larger models typically show larger activation values
→ bfloat16 is more suitable with its wider dynamic range
Figure 1 (b) - zero-shot image retrieval
image-text data에서 text caption과 embedding이 가까운 top-K images에 original image가 있는지를 측정
model, data, samples seen을 scale하면 성능이 consistent하게 좋아짐
bottleneck effect : model, train dataset size를 유지한채로 samples seen을 키워봤자 성능이 증가하지 않음
→ 고루고루 올려야 성능이 좋아짐
Figure 1
OpenAI CLIP은 ImageNet classification에서 좋은 성능을 냄
OpenCLIP은 image retrieval에서 좋은 성능을 냄
→ OpenAI CLIP and OpenCLIP have distinct scaling advantages over each other depending on the downstream task
→ task-specific scaling differences originate from the different pre-training datasets
2.2. Full and few-shot linear probing
linear probing : backbone frozen한 다음에 linear classifier만 학습
scaling up consistently improves the accuracy of a linear classifier
2.3. Fine-tuning
text encoder는 frozen한 다음, CLIP 방식으로 vision encoder만 학습
scaling up consistently improves the accuracy
The text was updated successfully, but these errors were encountered:
기존 연구들에서 pre-training loss가 감소하면 downstream task performance가 좋아지며, pre-training loss는
scaling law
를 따른다는 것을 empirically observe함scaling law
: power law relationships between model performance and training set size, model size, compute(model size 키우고 + data 늘리고 + compute 많아지면 power law를 따르면서 성능 향상)
scaling law
를 알게 되면, 정해진 compute budget에서 최적의 성능을 낼 수 있는 valuable guidance를 얻을 수 있음scaling law
에 대한 기존 연구는 2가지로 나뉨uni-modal language or vision
private data & models, customized multi-stage training procedure 사용
thorough scaling investigation 수행 X
저자들은 2번 연구들의 단점들을 보완하여, contrastive language-image pre-training (CLIP)에서
scaling law
를 잘 실험함(public LAION dataset + open-source OpenCLIP repository 이용)
논문의 main contribution 2가지는 다음과 같음
scaling law
가 관측됨consistent increase in performance when scaling model, data, and compute
identify power law scaling for multiple downstream tasks
(zero-shot classification, zero-shot retrieval, linear probing, end-to-end fine-tuning)
OpenAI's CLIP, OpenCLIP의 scaling behavior가 다름
→ OpenAI's CLIP : larger scaling coefficients on zero-shot classification
→ OpenCLIP : larger scaling coefficients on zero-shot retrieval
OpenAI's CLIP과 OpenCLIP의 차이가 dataset 말고는 없기에, 저자들의 pre-training dataset distribution이 task-dependent differences in scaling behavior에 영향을 미친다고 분석
즉, pre-training dataset에 따라 downstream task의 different scaling behavior가 나타남
실험 위주의 논문으로 내용 자체는 별게 없어서, 중요하다고 생각되는 부분만 간단히 요약
1. Datasets and Methods
LAION-400M, LAION-2B, LAION-5B
CLIP architecture 사용
visual encoder와 text encoder를 같이 키워줌
3B, 13B, 34B samples seen
separate training experiment with a cosine annealing learning schedule adapted to the number of samples
tune a small number of hyperparameters to optimize validation loss and prevent training instabilities
large model (ViT-L/14, H/14, g/14)를 float16 mixed precision으로 학습하면 loss spikes가 발생하여 성능이 안좋아짐
learning rate, learning rate schedule, gradient clipping해줘도 loss spike가 계속 발생
float16 mixed precision을 bfloat16 mixed precision으로 바꿔주니 해결
왜 bfloat16으로 바꿔주는게 이를 해결해주는가?
→ larger models typically show larger activation values
→ bfloat16 is more suitable with its wider dynamic range
model details
hyperparameters details
2. Scaling laws for different downstream tasks
2.1. Zero-shot image classification & retrieval
2.2. Full and few-shot linear probing
2.3. Fine-tuning
The text was updated successfully, but these errors were encountered: