Name		Name	Last commit message	Last commit date
Latest commit History 732 Commits
images		images
README.md		README.md

Repository files navigation

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Table of Contents

Foundational Capability
Model Self-Analysis
- Hallucination
- Bias
- Safety
- Causation
Extended Applications

Foundational Capability

Comprehensive Evaluation

Title	Venue	Date	Code	Page
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models	arXiv	2025-02-14	Github	Page
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?	arXiv	2024-8-23	Github	Page
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences	arXiv	2024-6-16	Github	Page
Are We on the Right Way for Evaluating Large Vision-Language Models?	NeurIPS	2024-03-29	Github	Page
BLINK: Multimodal Large Language Models Can See but Not Perceive	ECCV	2024-04-18	Github	Page
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI	ICML	2024-04-24	Github	Page
SEED-Bench-2: Benchmarking Multimodal Large Language Models	NeurIPS	2023-11-28	Github	-
InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models	arXiv	2023-11-20	Github	Page
TouchStone: Evaluating Vision-Language Models by Language Models	arXiv	2023-08-31	Github	-
VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use	NeurIPS	2023-08-12	Github	Page
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities	ICML	2023-08-04	Github	-
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension	NeurIPS	2023-07-30	Github	-
MMBench: Is Your Multi-modal Model an All-around Player?	NeurIPS	2023-07-12	Github	-
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models	arXiv	2023-6-23	Github	-
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	NeurIPS	2023-06-11	Github	-
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models	arXiv	2023-06-15	Github	-
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering	CVPR	2018-2-22	-	-
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering	arXiv	2016-12-02	Github	Page

OCR

Title	Venue	Date	Code	Page
VCR: Visual Caption Restoration	arXiv	2024-06-10	Github	-
SEED-Bench-2: Benchmarking Multimodal Large Language Models	NeurIPS	2023-11-28	Github	-
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models	arXiv	2023-5-13	Github	Page
WebSRC: A Dataset for Web-Based Structural Reading Comprehension	CVPR	2023-01-23	Github	Page
OCR-VQA: Visual Question Answering by Reading Text in Images	ICDAR	2019-9-20	-	Page
Towards VQA Models That Can Read	CVPR	2019-04-18	Github	Page

Chart and Documentation

Title	Venue	Date	Code	Page
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content	arXiv	2024-10-14	Github	Page
DocVQA: A Dataset for VQA on Document Images	WACV	2020-07-01	-	Page
InfographicVQA	WACV	2021-04-26	-	Page
MMLongbench-Doc: Benchmarking Long-Context Document Understanding with Visualizations	NeurIPS	2024-07-01	Github	Page
Charxiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs	NeurIPS	2024-06-26	Github	Page
DocGenome: An Open Large-Scale Scientific Document Benchmark for Training and Testing Multi-Modal Large Language Models	arXiv	2024-06-17	Github	Page
A Diagram is Worth a Dozen Images	ECCV	2016-10-08	Github	-
TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy	arXiv	2024-06-03	Github	Page
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning	arXiv	2022-03-19	Github	Page
VisualMRC: Machine Reading Comprehension on Document Images	AAAI	2021-01-27	Github	-
Leaf-QA: Locate, Encode, Attend for Figure Question Answering	WACV	2019-7-30	-	-
FigureQA: An Annotated Figure Dataset for Visual Reasoning	ICLR	2017-10-19	Github	-

Mathematical

Title	Venue	Date	Code	Page
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models	arXiv	2025-02-02	Github	Page
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?	arXiv	2024-07-01	Github	Page
MathVerse: Does Your Multimodal LLM Truly See the Diagrams in Visual Math Problems?	ECCV	2024-05-21	Github	Page
Measuring Multimodal Mathematical Reasoning with Math-Vision Dataset	arXiv	2024-02-22	Github	Page
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems	ACL	2024-02-21	Github	Page
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts	ICLR	2023-10-03	Github	Page

Multidisciplinary

Title	Venue	Date	Code	Page
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM	arXiv	2025-03-18	Github	Page
MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI	CVPR	2024-11-27	Github	Page
MMMU-Pro: A More Robust Multi-Discipline Multimodal Understanding Benchmark	arXiv	2024-09-04	Github	-
CMMU: A Benchmark for Chinese Multi-Modal Multi-Type Question Understanding and Reasoning	arXiv	2024-01-25	Github	Page
CMMMU: A Chinese Massive Multi-Discipline Multimodal Understanding Benchmark	arXiv	2024-01-22	Github	Page
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering	NeurIPS	2022-9-20	Github	Page

Multilingual

Title	Venue	Date	Code	Page
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages	arXiv	2024-11-25	Github	Page
CMMU: A Benchmark for Chinese Multi-Modal Multi-Type Question Understanding and Reasoning	arXiv	2024-01-25	Github	Page
CMMMU: A Chinese Massive Multi-Discipline Multimodal Understanding Benchmark	arXiv	2024-01-22	Github	Page
CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation	arXiv	2024-07-01	Github	-
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models	arXiv	2024-06-13	Github	Page
Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering	arXiv	2024-05-21	Github	-
The First Swahili Language Scene Text Detection and Recognition Dataset	ICDAR	2024-05-19	Github	-
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering	arXiv	2024-05-20	Github	Page
VIOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images	arXiv	2024-04-29	Github	-
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models	NeurIPS	2023-06-08	Github	-

Instruction Following

Title	Venue	Date	Code	Page
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs	arXiv	2024-07-01	Github	-

Multi-Round QA

Title	Venue	Date	Code	Page
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs	NeurIPS	2024-06-17	Github	Page
ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models	arXiv	2024-05-29	-	-

Multi-Image

Title	Venue	Date	Code	Page
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models	arXiv	2024-08-05	Github	Page
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs	NeurIPS	2024-06-17	Github	Page
A Corpus for Reasoning about Natural Language Grounded in Photographs	ACL	2018-11-01	Github	-
Sparkles: Unlocking Chats Across Multiple Images for Multi-Modal Instruction-Following Models	arXiv	2023-08-23	Github	Page
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences	arXiv	2024-01-19	Github	-
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning	arXiv	2024-06-18	Github	Page
REMI: A Dataset for Reasoning with Multiple Images	arXiv	2024-06-13	-	Page
MUIRBench: A Comprehensive Benchmark for Robust Multi-Image Understanding	arXiv	2024-06-13	Github	Page

Interleaved Data

Title	Venue	Date	Code	Page
MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI	CVPR	2024-11-27	Github	Page
Sparkles: Unlocking Chats Across Multiple Images for Multi-Modal Instruction-Following Models	arXiv	2023-08-23	Github	Page
Vega: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models	arXiv	2024-06-14	Github	Page

High Resolution

Title	Venue	Date	Code	Page
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?	arXiv	2024-8-23	Github	Page
V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs	CVPR	2023-12-21	Github	Page

Visual Grounding

Title	Venue	Date	Code	Page
ReferItGame: Referring to Objects in Photographs of Natural Scenes	EMNLP	2014-10-25	-	-
Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models	arXiv	2024-06-24	Github	-
Generation and Comprehension of Unambiguous Object Descriptions	CVPR	2016-06-26	Github	-

Fine-Grained Perception

Title	Venue	Date	Code	Page
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples	NeurIPS	2024-10-18	Github	Page
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification	arXiv	2024-06-20	Github	-
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs	CVPR	2024-06-11	Github	Page
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-Level Vision	ICLR	2023-09-25	Github	Page

Video Understanding

Title	Venue	Date	Code	Page
VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning	arXiv	2025-04-10	Github	Page
STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?	arXiv	2025-04-09	Github	Page
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning	arXiv	2025-03-14	Github	Page
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding	NeurIPS	2024-09-26	Github	Page
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-Modal LLMs in Video Analysis	arXiv	2024-05-31	Github	Page
MVBench: A Comprehensive Multi-Modal Video Understanding Benchmark	CVPR	2023-11-28	Github	-
MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding	arXiv	2024-06-06	Github	Page
LVBench: An Extreme Long Video Understanding Benchmark	arXiv	2024-06-12	Github	Page
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding	arXiv	2024-06-20	Github	-
Towards Event-Oriented Long Video Understanding	arXiv	2024-06-20	Github	Page
Needle in a Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs	arXiv	2024-06-13	Github	Page
EgoSchema: A Diagnostic Benchmark for Very Long-Form Video Language Understanding	NeurIPS	2023-08-17	Github	Page
TempCompass: Do Video LLMs Really Understand Videos?	arXiv	2024-05-01	Github	Page
Video Question Answering via Gradually Refined Attention over Appearance and Motion	ACM MM	2017-10-23	Github	-
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering	CVPR	2017-07-22	Github	-
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering	AAAI	2019-06-06	Github	-

Model Self-Analysis

Hallucination

Title	Venue	Date	Code	Page
Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models	arXiv	2024-06-24	Github	-
Evaluating and Analyzing Relationship Hallucinations in LVLMs	ICML	2024-06-24	Github	-
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models	arXiv	2024-06-24	Github	Page
VLIND-Bench: Measuring Language Priors in Large Vision-Language Models	arXiv	2024-06-13	-	-
PHD: A Prompted Visual Hallucination Evaluation Dataset	arXiv	2024-05-17	Github	-
VALOR-Eval: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models	ACL Findings	2024-04-22	Github	-
Visual Hallucinations of Multi-Modal Large Language Models	arXiv	2024-02-22	Github	-
Unified Hallucination Detection for Multimodal Large Language Models	ACL	2024-02-05	Github	-
MOCHA: Multi-Objective Reinforcement Mitigating Caption Hallucinations	arXiv	2023-12-03	Github	-
An LLM-Free Multi-Dimensional Benchmark for MLLMs Hallucination Evaluation	arXiv	2023-11-13	Github	-
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges	arXiv	2023-11-06	Github	-
Aligning Large Multimodal Models with Factually Augmented RLHF	arXiv	2023-09-25	Github	Page
Evaluation and Analysis of Hallucination in Large Vision-Language Models	EMNLP	2023-08-29	-	-
Detecting and Preventing Hallucinations in Large Vision-Language Models	AAAI	2023-08-11	-	-
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	ICLR	2023-06-26	Github	-
Evaluating Object Hallucination in Large Vision-Language Models	EMNLP	2023-05-17	Github	-

Bias

Title	Venue	Date	Code	Page
VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Models	arXiv	2024-06-20	Github	-
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges	arXiv	2023-11-06	Github	Page
MM-SPUBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs	arXiv	2024-06-24	-	Page

Safety

Title	Venue	Date	Code	Page
MossBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?	arXiv	2024-06-22	Github	Page
Efficiently Adversarial Examples Generation for Visual-Language Models under Targeted Transfer Scenarios Using Diffusion Models	arXiv	2024-04-16	-	-
On Evaluating Adversarial Robustness of Large Vision-Language Models	NeurIPS	2024-05-26	Github	Page
Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study	arXiv	2024-06-11	Github	Page
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs	arXiv	2023-11-28	Github	-

Causation

Title	Venue	Date	Code	Page
Cello: Causal Evaluation of Large Vision-Language Models	arXiv	2024-06-27	Github	-

Extended Applications

Medical Image

Title	Venue	Date	Code	Page
GMAI-MMbench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI	arXiv	2024-08-06	Github	Page
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM	CVPR	2024-02-14	Github	-
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering	arXiv	2023-05-17	Github	Page
Towards Generalist Foundation Model for Radiology by Leveraging Web-scale 2D&3D Medical Data	arXiv	2023-08-04	Github	Page
PathVQA: 30000+ Questions for Medical Visual Question Answering	arXiv	2020-05-07	-	-
SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering	ISBI	2021-04-13	-	Page
A Dataset of Clinically Generated Visual Questions and Answers about Radiology Images	Scientific data	2018-02-51	-	-

Sentiment Analysis

Title	Venue	Date	Code	Page
EMoLLM: Multimodal Emotional Understanding Meets Large Language Models	arXiv	2024-06-24	Github	-
VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding	arXiv	2024-06-18	Github	-
Facial Affective Behavior Analysis with Instruction Tuning	ECCV	2024-04-07	Github	Page

Remote Sensing

Title	Venue	Date	Code	Page
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?	arXiv	2024-8-23	Github	Page
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation	CVPR	2023-12-19	Github	-
RSGPT: A Remote Sensing Vision Language Model and Benchmark	arXiv	2023-07-28	Github	-
RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data	IEEE TGRS	2022-10-23	Github	-
Visual Grounding in Remote Sensing Images	ACM MM	2022-10-10	-	Page
Open-Ended Remote Sensing Visual Question Answering with Transformers	Int. J. Remote Sens.	2022-08-10	-	-
Mutual Attention Inception Network for Remote Sensing Visual Question Answering	IEEE TGRS	2021-05-20	Github	-
RSVQA: Visual Question Answering for Remote Sensing Data	IEEE TGRS	2020-03-16	-	Page

Agent

Title	Venue	Date	Code	Page
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception	arXiv	2024-01-29	Github	-
GPT4Tools: Teaching Large Language Model to Use Tools via Self-Instruction	NeurIPS	2023-05-30	Github	Page
AppAgent: Multimodal Agents as Smartphone Users	arXiv	2023-12-21	Github	Page

Code Generation

Title	Venue	Date	Code	Page
Web2Code: A Large-Scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs	arXiv	2024-06-28	Github	Page
ChartMimic: Evaluating LLM’s Cross-Modal Reasoning Capability via Chart-to-Code Generation	arXiv	2024-06-14	Github	Page

GUI

Title	Venue	Date	Code	Page
ScreenAI: A Vision-Language Model for UI and Infographics Understanding	IJCAI	2024-02-07	Github	-
ScreenQA: Large-Scale Question-Answer Pairs Over Mobile App Screenshots	arXiv	2022-09-16	Github	-
Towards Better Semantic Understanding of Mobile Interfaces	arXiv	2022-10-06	Github	-
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning	UIST	2021-08-07	Github	-
Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements	arXiv	2020-10-08	Github	-
Resolving Referring Expressions in Images with Labeled Elements	SLT	2018-10-24	-	-

Transfer Capability

Title	Venue	Date	Code	Page
Benchmarking Large Multimodal Models Against Common Corruptions	arXiv	2024-01-22	Github	-
BenchLMM: Benchmarking Cross-Style Visual Capability of Large Multimodal Models	ECCV	2023-12-05	Github	-
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs	arXiv	2023-11-27	Github	-

Knowledge Editing

Title	Venue	Date	Code	Page
VLKEB: A Large Vision-Language Model Knowledge Editing Benchmark	NeurIPS	2024-03-12	Github	Page
Can We Edit Multimodal Large Language Models?	EMNLP	2023-10-12	Github	-

Embodied AI

Title	Venue	Date	Code	Page
RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents	arXiv	2024-03-28	-	Page
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI	CVPR	2023-12-26	Github	Page
SQA3D: Situated Question Answering in 3D Scenes	ICLR	2022-10-14	Github	Page
Episodic Memory Question Answering	CVPR	2022-05-03	-	Page
Embodied Question Answering	CVPR	2017-11-30	Github	Page
The EPIC-Kitchens Dataset: Collection, Challenges and Baselines	IEEE TPAMI	2020-04-29	-	Page
EGO4D: Around the World in 3,000 Hours of Egocentric Video	CVPR	2021-10-13	-	Page
A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility	ECCV	2022-02-04	Github	-

Autonomous Driving

Title	Venue	Date	Code	Page
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?	arXiv	2024-8-23	Github	Page
Can LVLMs Obtain a Driver’s License? A Benchmark Towards Reliable AGI for Autonomous Driving	arXiv	2024-09-02	Github	Page
NuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario	AAAI	2023-05-24	Github	-
Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning	WACV	2023-09-12	-	Page
Reason2Drive: Towards Interpretable and Chain-Based Reasoning for Autonomous Driving	ECCV Benchmark	2023-12-21	Github	-
LingoQA: Video Question Answering for Autonomous Driving	arXiv	2023-12-15	Github	-
DriveLM: Driving with Graph Visual Question Answering	ECCV	2023-12-21	Github
Language Prompt for Autonomous Driving	arXiv	2023-09-08	Github	-
DRAMA: Joint Risk Localization and Captioning in Driving	WACV	2022-09-22	-	Page
Grounding Human-to-Vehicle Advice for Self-Driving Vehicles	CVPR	2019-09-16	-	Page
Talk2Car: Taking Control of Your Self-Driving Car	arXiv	2019-09-24	-	Page