Table of Contents
Title | Venue | Date | Code | Page |
---|---|---|---|---|
VCR: Visual Caption Restoration |
arXiv | 2024-06-10 | Github | - |
SEED-Bench-2: Benchmarking Multimodal Large Language Models |
NeurIPS | 2023-11-28 | Github | - |
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models |
arXiv | 2023-5-13 | Github | Page |
WebSRC: A Dataset for Web-Based Structural Reading Comprehension |
CVPR | 2023-01-23 | Github | Page |
OCR-VQA: Visual Question Answering by Reading Text in Images | ICDAR | 2019-9-20 | - | Page |
Towards VQA Models That Can Read |
CVPR | 2019-04-18 | Github | Page |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM |
arXiv | 2025-03-18 | Github | Page |
MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI |
CVPR | 2024-11-27 | Github | Page |
MMMU-Pro: A More Robust Multi-Discipline Multimodal Understanding Benchmark |
arXiv | 2024-09-04 | Github | - |
CMMU: A Benchmark for Chinese Multi-Modal Multi-Type Question Understanding and Reasoning |
arXiv | 2024-01-25 | Github | Page |
CMMMU: A Chinese Massive Multi-Discipline Multimodal Understanding Benchmark |
arXiv | 2024-01-22 | Github | Page |
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering |
NeurIPS | 2022-9-20 | Github | Page |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs |
arXiv | 2024-07-01 | Github | - |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs |
NeurIPS | 2024-06-17 | Github | Page |
ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models | arXiv | 2024-05-29 | - | - |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI |
CVPR | 2024-11-27 | Github | Page |
Sparkles: Unlocking Chats Across Multiple Images for Multi-Modal Instruction-Following Models |
arXiv | 2023-08-23 | Github | Page |
Vega: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models |
arXiv | 2024-06-14 | Github | Page |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? |
arXiv | 2024-8-23 | Github | Page |
V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs |
CVPR | 2023-12-21 | Github | Page |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
ReferItGame: Referring to Objects in Photographs of Natural Scenes | EMNLP | 2014-10-25 | - | - |
Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models |
arXiv | 2024-06-24 | Github | - |
Generation and Comprehension of Unambiguous Object Descriptions |
CVPR | 2016-06-26 | Github | - |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples |
NeurIPS | 2024-10-18 | Github | Page |
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification |
arXiv | 2024-06-20 | Github | - |
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs |
CVPR | 2024-06-11 | Github | Page |
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-Level Vision |
ICLR | 2023-09-25 | Github | Page |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Models |
arXiv | 2024-06-20 | Github | - |
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges |
arXiv | 2023-11-06 | Github | Page |
MM-SPUBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs | arXiv | 2024-06-24 | - | Page |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
MossBench: Is Your Multimodal Language Model Oversensitive to Safe Queries? |
arXiv | 2024-06-22 | Github | Page |
Efficiently Adversarial Examples Generation for Visual-Language Models under Targeted Transfer Scenarios Using Diffusion Models | arXiv | 2024-04-16 | - | - |
On Evaluating Adversarial Robustness of Large Vision-Language Models |
NeurIPS | 2024-05-26 | Github | Page |
Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study |
arXiv | 2024-06-11 | Github | Page |
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs |
arXiv | 2023-11-28 | Github | - |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
Cello: Causal Evaluation of Large Vision-Language Models |
arXiv | 2024-06-27 | Github | - |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
GMAI-MMbench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI |
arXiv | 2024-08-06 | Github | Page |
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM |
CVPR | 2024-02-14 | Github | - |
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering |
arXiv | 2023-05-17 | Github | Page |
Towards Generalist Foundation Model for Radiology by Leveraging Web-scale 2D&3D Medical Data |
arXiv | 2023-08-04 | Github | Page |
PathVQA: 30000+ Questions for Medical Visual Question Answering | arXiv | 2020-05-07 | - | - |
SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering | ISBI | 2021-04-13 | - | Page |
A Dataset of Clinically Generated Visual Questions and Answers about Radiology Images | Scientific data | 2018-02-51 | - | - |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
EMoLLM: Multimodal Emotional Understanding Meets Large Language Models |
arXiv | 2024-06-24 | Github | - |
VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding |
arXiv | 2024-06-18 | Github | - |
Facial Affective Behavior Analysis with Instruction Tuning |
ECCV | 2024-04-07 | Github | Page |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? |
arXiv | 2024-8-23 | Github | Page |
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation |
CVPR | 2023-12-19 | Github | - |
RSGPT: A Remote Sensing Vision Language Model and Benchmark |
arXiv | 2023-07-28 | Github | - |
RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data |
IEEE TGRS | 2022-10-23 | Github | - |
Visual Grounding in Remote Sensing Images | ACM MM | 2022-10-10 | - | Page |
Open-Ended Remote Sensing Visual Question Answering with Transformers | Int. J. Remote Sens. | 2022-08-10 | - | - |
Mutual Attention Inception Network for Remote Sensing Visual Question Answering |
IEEE TGRS | 2021-05-20 | Github | - |
RSVQA: Visual Question Answering for Remote Sensing Data | IEEE TGRS | 2020-03-16 | - | Page |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception |
arXiv | 2024-01-29 | Github | - |
GPT4Tools: Teaching Large Language Model to Use Tools via Self-Instruction |
NeurIPS | 2023-05-30 | Github | Page |
AppAgent: Multimodal Agents as Smartphone Users |
arXiv | 2023-12-21 | Github | Page |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
Web2Code: A Large-Scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs |
arXiv | 2024-06-28 | Github | Page |
ChartMimic: Evaluating LLM’s Cross-Modal Reasoning Capability via Chart-to-Code Generation |
arXiv | 2024-06-14 | Github | Page |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
ScreenAI: A Vision-Language Model for UI and Infographics Understanding |
IJCAI | 2024-02-07 | Github | - |
ScreenQA: Large-Scale Question-Answer Pairs Over Mobile App Screenshots |
arXiv | 2022-09-16 | Github | - |
Towards Better Semantic Understanding of Mobile Interfaces |
arXiv | 2022-10-06 | Github | - |
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning |
UIST | 2021-08-07 | Github | - |
Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements |
arXiv | 2020-10-08 | Github | - |
Resolving Referring Expressions in Images with Labeled Elements | SLT | 2018-10-24 | - | - |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
Benchmarking Large Multimodal Models Against Common Corruptions |
arXiv | 2024-01-22 | Github | - |
BenchLMM: Benchmarking Cross-Style Visual Capability of Large Multimodal Models |
ECCV | 2023-12-05 | Github | - |
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs |
arXiv | 2023-11-27 | Github | - |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
VLKEB: A Large Vision-Language Model Knowledge Editing Benchmark |
NeurIPS | 2024-03-12 | Github | Page |
Can We Edit Multimodal Large Language Models? |
EMNLP | 2023-10-12 | Github | - |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents | arXiv | 2024-03-28 | - | Page |
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI |
CVPR | 2023-12-26 | Github | Page |
SQA3D: Situated Question Answering in 3D Scenes |
ICLR | 2022-10-14 | Github | Page |
Episodic Memory Question Answering | CVPR | 2022-05-03 | - | Page |
Embodied Question Answering |
CVPR | 2017-11-30 | Github | Page |
The EPIC-Kitchens Dataset: Collection, Challenges and Baselines | IEEE TPAMI | 2020-04-29 | - | Page |
EGO4D: Around the World in 3,000 Hours of Egocentric Video | CVPR | 2021-10-13 | - | Page |
A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility |
ECCV | 2022-02-04 | Github | - |