-
University of Chinese Academy of Sciences
- Beijing
Stars
A MemAgent framework that can be extrapolated to 3.5M, along with a training framework for RL training of any agent workflow.
[ICCV 2025] LVBench: An Extreme Long Video Understanding Benchmark
(ICCV2025) Official repository of paper "ViSpeak: Visual Instruction Feedback in Streaming Videos"
verl: Volcano Engine Reinforcement Learning for LLMs
verl-agent is an extension of veRL, designed for training LLM/VLM agents via RL. verl-agent is also the official code for paper "Group-in-Group Policy Optimization for LLM Agent Training"
StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding
Resources and paper list for "Thinking with Images for LVLMs". This repository accompanies our survey on how LVLMs can leverage visual information for complex reasoning, planning, and generation.
Pixel-Level Reasoning Model trained with RL
Video-R1: Reinforcing Video Reasoning in MLLMs [🔥the first paper to explore R1 for video]
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 500+ LLMs (Qwen3, Qwen3-MoE, Llama4, GLM4.5, InternLM3, DeepSeek-R1, ...) and 200+ MLLMs (Qwen2.5-VL, Qwen2.5-Omni, InternVL3.5, Ovis2.5, Llava, GLM4v…
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment
✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Mulberry, an o1-like Reasoning and Reflection MLLM Implemented via Collective MCTS
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework based on veRL
Skywork-R1V is an advanced multimodal AI model series developed by Skywork AI (Kunlun Inc.), specializing in vision-language reasoning.
This is the first paper to explore how to effectively use R1-like RL for MLLMs and introduce Vision-R1, a reasoning MLLM that leverages cold-start initialization and RL training to incentivize reas…
MM-EUREKA: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
A Comprehensive Survey on Evaluating Reasoning Capabilities in Multimodal Large Language Models.
Explore the Multimodal “Aha Moment” on 2B Model
MM-Eureka V0 also called R1-Multimodal-Journey, Latest version is in MM-Eureka