We provide scripts for evaluating and training models on task datasets. The following benchmark results are included for reference.
Pretraining | COCO (download) | script |
Visual Genome (download) | ||
SBU (download) | ||
CC3M (download) | ||
CC12M (download) |
Retrieval | R1 | R5 | R10 | Training | Evaluation | |
---|---|---|---|---|---|---|
TR | COCO (download) | 77.6 | 94.1 | 97.2 | script | script |
IR | COCO (download) | 61.0 | 84.5 | 90.7 | script | script |
TR | Flickr30k (download) | 77.6 | 94.1 | 97.2 | script | script |
IR | Flickr30k (download) | 61.0 | 84.5 | 90.7 | script | script |
VQA | test-dev | test-std/test | Training | Evaluation |
---|---|---|---|---|
VQAv2 (download) | 76.35 | 76.54 | script | script |
OKVQA (download) | NA | 54.7 | script | NA |
AOKVQA (download) | 54.5 | NA | script | NA |
Multimodal Classification | val | test | Training | Evaluation |
---|---|---|---|---|
SNLI-VE (download) | 80.60 | 81.04 | script | script |
NLVR2 (download) | 82.47 | 82.91 | script | script |
Pretraining (14M) | COCO (download) | script |
Visual Genome (download) | ||
SBU (download) | ||
CC3M (download) | ||
CC12M (download) |
Tasks | Retrieval | R1 | R5 | R10 | Training | Evaluation |
---|---|---|---|---|---|---|
TR | COCO (download) | 82.0 | 95.8 | 98.1 | script | script |
IR | COCO (download) | 64.5 | 86.0 | 91.7 | script | script |
TR | Flickr30k (download) | 96.9 | 99.9 | 100.0 | script | script |
IR | Flickr30k (download) | 87.5 | 97.6 | 98.9 | script | script |
VQA | test-dev | test-std/test | Training | Evaluation |
---|---|---|---|---|
VQAv2 (download) | 78.23 | 78.29 | script | script |
OKVQA (download) | NA | 55.4 | script | script |
AOKVQA (download) | 56.2 | 50.1 | script | script |
Image Captioning | BLEU@4 | CIDEr | SPICE | Training | Evaluation |
---|---|---|---|---|---|
COCO (download) | 39.9 | 133.5 | 23.7 | script | script |
NoCaps (download) | 31.9 | 109.1 | 14.7 | NA | script |
Multimodal Classification | val | test | Training | Evaluation |
---|---|---|---|---|
NLVR2 (download) | 82.48 | 83.25 | script | script |
Tasks | Retrieval (Zero-shot) | R1 | R5 | R10 | Evaluation |
---|---|---|---|---|---|
TR | COCO (download) | 57.2 | 80.5 | 87.8 | script |
IR | COCO (download) | 36.5 | 60.8 | 71.0 | script |
TR | Flickr30k (download) | 86.5 | 98.0 | 99.1 | script |
IR | Flickr30k (download) | 67.0 | 88.9 | 93.3 | script |
Multimodal Classification | val | Evaluation |
---|---|---|
ImageNet | 76.5 | script |
Tasks | Retrieval | R1 | R5 | R10 | Training | Evaluation |
---|---|---|---|---|---|---|
TR | MSRVTT (download) | 33.2 | 60.5 | 71.7 | script | script |
VR | MSRVTT (download) | 33.8 | 61.4 | 72.7 | script | script |
TR | DiDeMo (download) | 38.8 | 66.4 | 76.8 | script | script |
VR | DiDeMo (download) | 36.6 | 67.5 | 77.9 | script | script |
Video QA | test | Training | Evaluation |
---|---|---|---|
MSRVTT | 42.1 | script | script |
MSVD | 46.0 | script | script |