Skip to content

BradyFU/Awesome-Multimodal-Large-Language-Models

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs


Table of Contents


Foundational Capability

Comprehensive Evaluation

Title Venue Date Code Page
GitHub Repo stars
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
arXiv 2025-02-14 Github Page
GitHub Repo stars
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
arXiv 2024-8-23 Github Page
GitHub Repo stars
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
arXiv 2024-6-16 Github Page
GitHub Repo stars
Are We on the Right Way for Evaluating Large Vision-Language Models?
NeurIPS 2024-03-29 Github Page
GitHub Repo stars
BLINK: Multimodal Large Language Models Can See but Not Perceive
ECCV 2024-04-18 Github Page
GitHub Repo stars
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
ICML 2024-04-24 Github Page
GitHub Repo stars
SEED-Bench-2: Benchmarking Multimodal Large Language Models
NeurIPS 2023-11-28 Github -
GitHub Repo stars
InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models
arXiv 2023-11-20 Github Page
GitHub Repo stars
TouchStone: Evaluating Vision-Language Models by Language Models
arXiv 2023-08-31 Github -
GitHub Repo stars
VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use
NeurIPS 2023-08-12 Github Page
GitHub Repo stars
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
ICML 2023-08-04 Github -
GitHub Repo stars
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
NeurIPS 2023-07-30 Github -
GitHub Repo stars
MMBench: Is Your Multi-modal Model an All-around Player?
NeurIPS 2023-07-12 Github -
GitHub Repo stars
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
arXiv 2023-6-23 Github -
GitHub Repo stars
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
NeurIPS 2023-06-11 Github -
GitHub Repo stars
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
arXiv 2023-06-15 Github -
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering CVPR 2018-2-22 - -
GitHub Repo stars
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
arXiv 2016-12-02 Github Page

OCR

Title Venue Date Code Page
GitHub Repo stars
VCR: Visual Caption Restoration
arXiv 2024-06-10 Github -
GitHub Repo stars
SEED-Bench-2: Benchmarking Multimodal Large Language Models
NeurIPS 2023-11-28 Github -
GitHub Repo stars
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
arXiv 2023-5-13 Github Page
GitHub Repo stars
WebSRC: A Dataset for Web-Based Structural Reading Comprehension
CVPR 2023-01-23 Github Page
OCR-VQA: Visual Question Answering by Reading Text in Images ICDAR 2019-9-20 - Page
GitHub Repo stars
Towards VQA Models That Can Read
CVPR 2019-04-18 Github Page

Chart and Documentation

Title Venue Date Code Page
GitHub Repo stars
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content
arXiv 2024-10-14 Github Page
DocVQA: A Dataset for VQA on Document Images WACV 2020-07-01 - Page
InfographicVQA WACV 2021-04-26 - Page
GitHub Repo stars
MMLongbench-Doc: Benchmarking Long-Context Document Understanding with Visualizations
NeurIPS 2024-07-01 Github Page
GitHub Repo stars
Charxiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
NeurIPS 2024-06-26 Github Page
GitHub Repo stars
DocGenome: An Open Large-Scale Scientific Document Benchmark for Training and Testing Multi-Modal Large Language Models
arXiv 2024-06-17 Github Page
GitHub Repo stars
A Diagram is Worth a Dozen Images
ECCV 2016-10-08 Github -
GitHub Repo stars
TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy
arXiv 2024-06-03 Github Page
GitHub Repo stars
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
arXiv 2022-03-19 Github Page
GitHub Repo stars
VisualMRC: Machine Reading Comprehension on Document Images
AAAI 2021-01-27 Github -
Leaf-QA: Locate, Encode, Attend for Figure Question Answering WACV 2019-7-30 - -
GitHub Repo stars
FigureQA: An Annotated Figure Dataset for Visual Reasoning
ICLR 2017-10-19 Github -

Mathematical

Title Venue Date Code Page
GitHub Repo stars
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models
arXiv 2025-02-02 Github Page
GitHub Repo stars
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
arXiv 2024-07-01 Github Page
GitHub Repo stars
MathVerse: Does Your Multimodal LLM Truly See the Diagrams in Visual Math Problems?
ECCV 2024-05-21 Github Page
GitHub Repo stars
Measuring Multimodal Mathematical Reasoning with Math-Vision Dataset
arXiv 2024-02-22 Github Page
GitHub Repo stars
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems
ACL 2024-02-21 Github Page
GitHub Repo stars
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
ICLR 2023-10-03 Github Page

Multidisciplinary

Title Venue Date Code Page
GitHub Repo stars
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
arXiv 2025-03-18 Github Page
GitHub Repo stars
MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
CVPR 2024-11-27 Github Page
GitHub Repo stars
MMMU-Pro: A More Robust Multi-Discipline Multimodal Understanding Benchmark
arXiv 2024-09-04 Github -
GitHub Repo stars
CMMU: A Benchmark for Chinese Multi-Modal Multi-Type Question Understanding and Reasoning
arXiv 2024-01-25 Github Page
GitHub Repo stars
CMMMU: A Chinese Massive Multi-Discipline Multimodal Understanding Benchmark
arXiv 2024-01-22 Github Page
GitHub Repo stars
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
NeurIPS 2022-9-20 Github Page

Multilingual

Title Venue Date Code Page
GitHub Repo stars
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages
arXiv 2024-11-25 Github Page
GitHub Repo stars
CMMU: A Benchmark for Chinese Multi-Modal Multi-Type Question Understanding and Reasoning
arXiv 2024-01-25 Github Page
GitHub Repo stars
CMMMU: A Chinese Massive Multi-Discipline Multimodal Understanding Benchmark
arXiv 2024-01-22 Github Page
GitHub Repo stars
CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation
arXiv 2024-07-01 Github -
GitHub Repo stars
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models
arXiv 2024-06-13 Github Page
GitHub Repo stars
Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering
arXiv 2024-05-21 Github -
GitHub Repo stars
The First Swahili Language Scene Text Detection and Recognition Dataset
ICDAR 2024-05-19 Github -
GitHub Repo stars
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering
arXiv 2024-05-20 Github Page
GitHub Repo stars
VIOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images
arXiv 2024-04-29 Github -
GitHub Repo stars
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
NeurIPS 2023-06-08 Github -

Instruction Following

Title Venue Date Code Page
GitHub Repo stars
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
arXiv 2024-07-01 Github -

Multi-Round QA

Title Venue Date Code Page
GitHub Repo stars
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
NeurIPS 2024-06-17 Github Page
ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models arXiv 2024-05-29 - -

Multi-Image

Title Venue Date Code Page
GitHub Repo stars
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
arXiv 2024-08-05 Github Page
GitHub Repo stars
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
NeurIPS 2024-06-17 Github Page
GitHub Repo stars
A Corpus for Reasoning about Natural Language Grounded in Photographs
ACL 2018-11-01 Github -
GitHub Repo stars
Sparkles: Unlocking Chats Across Multiple Images for Multi-Modal Instruction-Following Models
arXiv 2023-08-23 Github Page
GitHub Repo stars
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences
arXiv 2024-01-19 Github -
GitHub Repo stars
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
arXiv 2024-06-18 Github Page
REMI: A Dataset for Reasoning with Multiple Images arXiv 2024-06-13 - Page
GitHub Repo stars
MUIRBench: A Comprehensive Benchmark for Robust Multi-Image Understanding
arXiv 2024-06-13 Github Page

Interleaved Data

Title Venue Date Code Page
GitHub Repo stars
MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
CVPR 2024-11-27 Github Page
GitHub Repo stars
Sparkles: Unlocking Chats Across Multiple Images for Multi-Modal Instruction-Following Models
arXiv 2023-08-23 Github Page
GitHub Repo stars
Vega: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models
arXiv 2024-06-14 Github Page

High Resolution

Title Venue Date Code Page
GitHub Repo stars
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
arXiv 2024-8-23 Github Page
GitHub Repo stars
V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs
CVPR 2023-12-21 Github Page

Visual Grounding

Title Venue Date Code Page
ReferItGame: Referring to Objects in Photographs of Natural Scenes EMNLP 2014-10-25 - -
GitHub Repo stars
Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models
arXiv 2024-06-24 Github -
GitHub Repo stars
Generation and Comprehension of Unambiguous Object Descriptions
CVPR 2016-06-26 Github -

Fine-Grained Perception

Title Venue Date Code Page
GitHub Repo stars
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
NeurIPS 2024-10-18 Github Page
GitHub Repo stars
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification
arXiv 2024-06-20 Github -
GitHub Repo stars
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
CVPR 2024-06-11 Github Page
GitHub Repo stars
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-Level Vision
ICLR 2023-09-25 Github Page

Video Understanding

Title Venue Date Code Page
GitHub Repo stars
VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning
arXiv 2025-04-10 Github Page
GitHub Repo stars
STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
arXiv 2025-04-09 Github Page
GitHub Repo stars
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
arXiv 2025-03-14 Github Page
GitHub Repo stars
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
NeurIPS 2024-09-26 Github Page
GitHub Repo stars
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-Modal LLMs in Video Analysis
arXiv 2024-05-31 Github Page
GitHub Repo stars
MVBench: A Comprehensive Multi-Modal Video Understanding Benchmark
CVPR 2023-11-28 Github -
GitHub Repo stars
MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding
arXiv 2024-06-06 Github Page
GitHub Repo stars
LVBench: An Extreme Long Video Understanding Benchmark
arXiv 2024-06-12 Github Page
GitHub Repo stars
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding
arXiv 2024-06-20 Github -
GitHub Repo stars
Towards Event-Oriented Long Video Understanding
arXiv 2024-06-20 Github Page
GitHub Repo stars
Needle in a Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs
arXiv 2024-06-13 Github Page
GitHub Repo stars
EgoSchema: A Diagnostic Benchmark for Very Long-Form Video Language Understanding
NeurIPS 2023-08-17 Github Page
GitHub Repo stars
TempCompass: Do Video LLMs Really Understand Videos?
arXiv 2024-05-01 Github Page
GitHub Repo stars
Video Question Answering via Gradually Refined Attention over Appearance and Motion
ACM MM 2017-10-23 Github -
GitHub Repo stars
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
CVPR 2017-07-22 Github -
GitHub Repo stars
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
AAAI 2019-06-06 Github -

Model Self-Analysis

Hallucination

Title Venue Date Code Page
GitHub Repo stars
Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models
arXiv 2024-06-24 Github -
GitHub Repo stars
Evaluating and Analyzing Relationship Hallucinations in LVLMs
ICML 2024-06-24 Github -
GitHub Repo stars
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models
arXiv 2024-06-24 Github Page
VLIND-Bench: Measuring Language Priors in Large Vision-Language Models arXiv 2024-06-13 - -
GitHub Repo stars
PHD: A Prompted Visual Hallucination Evaluation Dataset
arXiv 2024-05-17 Github -
GitHub Repo stars
VALOR-Eval: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models
ACL Findings 2024-04-22 Github -
GitHub Repo stars
Visual Hallucinations of Multi-Modal Large Language Models
arXiv 2024-02-22 Github -
GitHub Repo stars
Unified Hallucination Detection for Multimodal Large Language Models
ACL 2024-02-05 Github -
GitHub Repo stars
MOCHA: Multi-Objective Reinforcement Mitigating Caption Hallucinations
arXiv 2023-12-03 Github -
GitHub Repo stars
An LLM-Free Multi-Dimensional Benchmark for MLLMs Hallucination Evaluation
arXiv 2023-11-13 Github -
GitHub Repo stars
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
arXiv 2023-11-06 Github -
GitHub Repo stars
Aligning Large Multimodal Models with Factually Augmented RLHF
arXiv 2023-09-25 Github Page
Evaluation and Analysis of Hallucination in Large Vision-Language Models EMNLP 2023-08-29 - -
Detecting and Preventing Hallucinations in Large Vision-Language Models AAAI 2023-08-11 - -
GitHub Repo stars
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
ICLR 2023-06-26 Github -
GitHub Repo stars
Evaluating Object Hallucination in Large Vision-Language Models
EMNLP 2023-05-17 Github -

Bias

Title Venue Date Code Page
GitHub Repo stars
VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Models
arXiv 2024-06-20 Github -
GitHub Repo stars
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
arXiv 2023-11-06 Github Page
MM-SPUBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs arXiv 2024-06-24 - Page

Safety

Title Venue Date Code Page
GitHub Repo stars
MossBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?
arXiv 2024-06-22 Github Page
Efficiently Adversarial Examples Generation for Visual-Language Models under Targeted Transfer Scenarios Using Diffusion Models arXiv 2024-04-16 - -
GitHub Repo stars
On Evaluating Adversarial Robustness of Large Vision-Language Models
NeurIPS 2024-05-26 Github Page
GitHub Repo stars
Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study
arXiv 2024-06-11 Github Page
GitHub Repo stars
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
arXiv 2023-11-28 Github -

Causation

Title Venue Date Code Page
GitHub Repo stars
Cello: Causal Evaluation of Large Vision-Language Models
arXiv 2024-06-27 Github -

Extended Applications

Medical Image

Title Venue Date Code Page
GitHub Repo stars
GMAI-MMbench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI
arXiv 2024-08-06 Github Page
GitHub Repo stars
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
CVPR 2024-02-14 Github -
GitHub Repo stars
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
arXiv 2023-05-17 Github Page
GitHub Repo stars
Towards Generalist Foundation Model for Radiology by Leveraging Web-scale 2D&3D Medical Data
arXiv 2023-08-04 Github Page
PathVQA: 30000+ Questions for Medical Visual Question Answering arXiv 2020-05-07 - -
SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering ISBI 2021-04-13 - Page
A Dataset of Clinically Generated Visual Questions and Answers about Radiology Images Scientific data 2018-02-51 - -

Sentiment Analysis

Title Venue Date Code Page
GitHub Repo stars
EMoLLM: Multimodal Emotional Understanding Meets Large Language Models
arXiv 2024-06-24 Github -
GitHub Repo stars
VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding
arXiv 2024-06-18 Github -
GitHub Repo stars
Facial Affective Behavior Analysis with Instruction Tuning
ECCV 2024-04-07 Github Page

Remote Sensing

Title Venue Date Code Page
GitHub Repo stars
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
arXiv 2024-8-23 Github Page
GitHub Repo stars
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation
CVPR 2023-12-19 Github -
GitHub Repo stars
RSGPT: A Remote Sensing Vision Language Model and Benchmark
arXiv 2023-07-28 Github -
GitHub Repo stars
RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data
IEEE TGRS 2022-10-23 Github -
Visual Grounding in Remote Sensing Images ACM MM 2022-10-10 - Page
Open-Ended Remote Sensing Visual Question Answering with Transformers Int. J. Remote Sens. 2022-08-10 - -
GitHub Repo stars
Mutual Attention Inception Network for Remote Sensing Visual Question Answering
IEEE TGRS 2021-05-20 Github -
RSVQA: Visual Question Answering for Remote Sensing Data IEEE TGRS 2020-03-16 - Page

Agent

Title Venue Date Code Page
GitHub Repo stars
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
arXiv 2024-01-29 Github -
GitHub Repo stars
GPT4Tools: Teaching Large Language Model to Use Tools via Self-Instruction
NeurIPS 2023-05-30 Github Page
GitHub Repo stars
AppAgent: Multimodal Agents as Smartphone Users
arXiv 2023-12-21 Github Page

Code Generation

Title Venue Date Code Page
GitHub Repo stars
Web2Code: A Large-Scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
arXiv 2024-06-28 Github Page
GitHub Repo stars
ChartMimic: Evaluating LLM’s Cross-Modal Reasoning Capability via Chart-to-Code Generation
arXiv 2024-06-14 Github Page

GUI

Title Venue Date Code Page
GitHub Repo stars
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
IJCAI 2024-02-07 Github -
GitHub Repo stars
ScreenQA: Large-Scale Question-Answer Pairs Over Mobile App Screenshots
arXiv 2022-09-16 Github -
GitHub Repo stars
Towards Better Semantic Understanding of Mobile Interfaces
arXiv 2022-10-06 Github -
GitHub Repo stars
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning
UIST 2021-08-07 Github -
GitHub Repo stars
Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements
arXiv 2020-10-08 Github -
Resolving Referring Expressions in Images with Labeled Elements SLT 2018-10-24 - -

Transfer Capability

Title Venue Date Code Page
GitHub Repo stars
Benchmarking Large Multimodal Models Against Common Corruptions
arXiv 2024-01-22 Github -
GitHub Repo stars
BenchLMM: Benchmarking Cross-Style Visual Capability of Large Multimodal Models
ECCV 2023-12-05 Github -
GitHub Repo stars
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
arXiv 2023-11-27 Github -

Knowledge Editing

Title Venue Date Code Page
GitHub Repo stars
VLKEB: A Large Vision-Language Model Knowledge Editing Benchmark
NeurIPS 2024-03-12 Github Page
GitHub Repo stars
Can We Edit Multimodal Large Language Models?
EMNLP 2023-10-12 Github -

Embodied AI

Title Venue Date Code Page
RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents arXiv 2024-03-28 - Page
GitHub Repo stars
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
CVPR 2023-12-26 Github Page
GitHub Repo stars
SQA3D: Situated Question Answering in 3D Scenes
ICLR 2022-10-14 Github Page
Episodic Memory Question Answering CVPR 2022-05-03 - Page
GitHub Repo stars
Embodied Question Answering
CVPR 2017-11-30 Github Page
The EPIC-Kitchens Dataset: Collection, Challenges and Baselines IEEE TPAMI 2020-04-29 - Page
EGO4D: Around the World in 3,000 Hours of Egocentric Video CVPR 2021-10-13 - Page
GitHub Repo stars
A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility
ECCV 2022-02-04 Github -

Autonomous Driving

Title Venue Date Code Page
GitHub Repo stars
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
arXiv 2024-8-23 Github Page
GitHub Repo stars
Can LVLMs Obtain a Driver’s License? A Benchmark Towards Reliable AGI for Autonomous Driving
arXiv 2024-09-02 Github Page
GitHub Repo stars
NuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario
AAAI 2023-05-24 Github -
Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning WACV 2023-09-12 - Page
GitHub Repo stars
Reason2Drive: Towards Interpretable and Chain-Based Reasoning for Autonomous Driving
ECCV Benchmark 2023-12-21 Github -
GitHub Repo stars
LingoQA: Video Question Answering for Autonomous Driving
arXiv 2023-12-15 Github -
GitHub Repo stars
DriveLM: Driving with Graph Visual Question Answering
ECCV 2023-12-21 Github
GitHub Repo stars
Language Prompt for Autonomous Driving
arXiv 2023-09-08 Github -
DRAMA: Joint Risk Localization and Captioning in Driving WACV 2022-09-22 - Page
Grounding Human-to-Vehicle Advice for Self-Driving Vehicles CVPR 2019-09-16 - Page
Talk2Car: Taking Control of Your Self-Driving Car arXiv 2019-09-24 - Page