CVPR 2024 decisions are now available on OpenReview!
Note 1: Welcome to submit issues and share CVPR 2024 papers and open source projects!
Note 2: For details about previous years' CV top conference papers and other high-quality CV papers and reviews, please see: https://github.com/amusi/daily-paper-computer-vision
Welcome to scan the QR code to join the [CVer Academic Exchange Group], which is the largest computer vision AI knowledge planet! Updated daily, we will share the latest and most cutting-edge learning materials in computer vision, AI painting, image processing, deep learning, autonomous driving, medical imaging, AIGC and other fields as soon as possible. Let's learn!

- 3DGS(Gaussian Splatting)
- Avatars
- Backbone
- CLIP
- [MAE] (#MAE)
- Embodied AI
- GAN
- [GNN] (#GNN)
- Multimodal Large Language Model (MLLM)
- Large Language Model (LLM)
- NAS
- OCR
- NeRF
- DETR
- Prompt
- Diffusion Models
- ReID (Re-identification)
- Long-Tail
- Vision Transformer
- Vision-Language
- Self-supervised Learning
- Data Augmentation
- Object Detection
- Anomaly Detection
- Visual Tracking
- Semantic Segmentation
- Instance Segmentation
- Panoptic Segmentation
- Medical Image#MI
- Medical Image Segmentation
- Video Object Segmentation
- Video Instance Segmentation
- Referring Image Segmentation
- Image Matting
- Image Editing
- Low-level Vision
- Super-Resolution
- Denoising
- Deblur
- Autonomous Driving
- 3D Point Cloud
- 3D Object Detection (#3DOD)
- 3D Semantic Segmentation
- 3D Object Tracking
- 3D Semantic Scene Completion (#3DSSC)
- 3D Registration
- 3D Human Pose Estimation
- 3D Human Mesh Estimation
- Medical Image
- Image Generation
- Video Generation
- 3D Generation
- Video Understanding
- Action Detection
- Text Detection
- Knowledge Distillation (#KD)
- Model Pruning
- Image Compression
- 3D Reconstruction
- Depth Estimation
- Trajectory Prediction
- Lane Detection
- Image Captioning
- Visual Question Answering (VQA)
- Sign Language Recognition
- Video Prediction
- Novel View Synthesis (#NVS)
- Zero-Shot Learning (#ZSL)
- Stereo Matching
- Feature Matching
- Scene Graph Generation
- Implicit Neural Representations
- Image Quality Assessment (#IQA)
- Video Quality Assessment
- Datasets
- New Tasks
- Others
Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering
- Homepage: https://city-super.github.io/scaffold-gs/
- Paper: https://arxiv.org/abs/2312.00109
- Code: https://github.com/city-super/Scaffold-GS
GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis
- Homepage: https://shunyuanzheng.github.io/GPS-Gaussian
- Paper: https://arxiv.org/abs/2312.02155
- Code: https://github.com/ShunyuanZheng/GPS-Gaussian
GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians
GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting
Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction
- Homepage: https://ingra14m.github.io/Deformable-Gaussians/
- Paper: https://arxiv.org/abs/2309.13101
- Code: https://github.com/ingra14m/Deformable-3D-Gaussians
SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes
- Homepage: https://yihua7.github.io/SC-GS-web/
- Paper: https://arxiv.org/abs/2312.14937
- Code: https://github.com/yihua7/SC-GS
Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis
- Homepage: https://oppo-us-research.github.io/SpacetimeGaussians-website/
- Paper: https://arxiv.org/abs/2312.16812
- Code: https://github.com/oppo-us-research/SpacetimeGaussians
DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization
- Homepage: https://fictionarry.github.io/DNGaussian/
- Paper: https://arxiv.org/abs/2403.06912
- Code: https://github.com/Fictionarry/DNGaussian
4D Gaussian Splatting for Real-Time Dynamic Scene Rendering
GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models
GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians
Real-Time Simulated Avatar from Head-Mounted Sensors
- Homepage: https://www.zhengyiluo.com/SimXR/
- Paper: https://arxiv.org/abs/2403.06862
RepViT: Revisiting Mobile CNN From ViT Perspective
TransNeXt: Robust Foveal Visual Perception for Vision Transformers
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
FairCLIP: Harnessing Fairness in Vision-Language Learning
- Paper: https://arxiv.org/abs/2403.19949
- Code: https://github.com/Harvard-Ophthalmology-AI-Lab/FairCLIP
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
- Homepage: https://tai-wang.github.io/embodiedscan/
- Paper: https://arxiv.org/abs/2312.16170
- Code: https://github.com/OpenRobotLab/EmbodiedScan
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception
- Homepage: https://iranqin.github.io/MP5.github.io/
- Paper: https://arxiv.org/abs/2312.07472
- Code: https://github.com/IranQin/MP5
LEMON: Learning 3D Human-Object Interaction Relation from 2D Images
An Empirical Study of Scaling Law for OCR
- Paper: https://arxiv.org/abs/2401.00028
- Code: https://github.com/large-ocr-model/large-ocr-model.github.io
ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting
PIE-NeRF🍕: Physics-based Interactive Elastodynamics with NeRF
DETRs Beat YOLOs on Real-time Object Detection
Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
- Paper: https://arxiv.org/abs/2311.04257
- Code: https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2
Link-Context Learning for Multimodal LLMs
- Paper: https://arxiv.org/abs/2308.07891
- Code: https://github.com/isekai-portal/Link-Context-Learning/tree/main
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
Making Large Multimodal Models Understand Arbitrary Visual Prompts
- Homepage: https://vip-llava.github.io/
- Paper: https://arxiv.org/abs/2312.00784
Pink: Unveiling the power of referential comprehension for multi-modal llms
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
OneLLM: One Framework to Align All Modalities with Language
VTimeLLM: Empower LLM to Grasp Video Moments
Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification
Noisy-Correspondence Learning for Text-to-Image Person Re-identification
InstanceDiffusion: Instance-level Control for Image Generation
Residual Denoising Diffusion Models
DeepCache: Accelerating Diffusion Models for Free
DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations
-
Homepage: https://tianhao-qi.github.io/DEADiff/
SVGDreamer: Text Guided SVG Generation with Diffusion Model
InteractDiffusion: Interaction-Control for Text-to-Image Diffusion Model
MMA-Diffusion: MultiModal Attack on Diffusion Models
VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models
- Homeoage: https://video-motion-customization.github.io/
- Paper: https://arxiv.org/abs/2312.00845
- Code: https://github.com/HyeonHo99/Video-Motion-Customization
TransNeXt: Robust Foveal Visual Perception for Vision Transformers
RepViT: Revisiting Mobile CNN From ViT Perspective
A General and Efficient Training for Transformer via Token Expansion
PromptKD: Unsupervised Prompt Distillation for Vision-Language Models
FairCLIP: Harnessing Fairness in Vision-Language Learning
- Paper: https://arxiv.org/abs/2403.19949
- Code: https://github.com/Harvard-Ophthalmology-AI-Lab/FairCLIP
DETRs Beat YOLOs on Real-time Object Detection
Boosting Object Detection with Zero-Shot Day-Night Domain Adaptation
- Paper: https://arxiv.org/abs/2312.01220
- Code: https://github.com/ZPDu/Boosting-Object-Detection-with-Zero-Shot-Day-Night-Domain-Adaptation
YOLO-World: Real-Time Open-Vocabulary Object Detection
Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement
Anomaly Heterogeneity Learning for Open-set Supervised Anomaly Detection
Delving into the Trajectory Long-tail Distribution for Muti-object Tracking
- Paper: https://arxiv.org/abs/2403.04700
- Code: https://github.com/chen-si-jia/Trajectory-Long-tail-Distribution-for-MOT
Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation
SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation
Feature Re-Embedding: Towards Foundation Model-Level Performance in Computational Pathology
VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis
ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning of Heterogeneous Microscopy Images
UniPAD: A Universal Pre-training Paradigm for Autonomous Driving
Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications
Memory-based Adapters for Online 3D Scene Perception
Symphonize 3D Semantic Scene Completion with Contextual Instance Queries
A Real-world Large-scale Dataset for Roadside Cooperative Perception
Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving
Traffic Scene Parsing through the TSP6K Dataset
PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection
UniMODE: Unified Monocular 3D Object Detection
Edit One for All: Interactive Batch Image Editing
- Homepage: https://thaoshibe.github.io/edit-one-for-all
- Paper: https://arxiv.org/abs/2401.10219
- Code: https://github.com/thaoshibe/edit-one-for-all
MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers
-
Homepage: https://maskint.github.io
Residual Denoising Diffusion Models
Boosting Image Restoration via Priors from Pre-trained Models
SeD: Semantic-Aware Discriminator for Image Super-Resolution
APISR: Anime Production Inspired Real-World Anime Super-Resolution
Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation
InstanceDiffusion: Instance-level Control for Image Generation
ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations
-
Homepage: https://eclipse-t2i.vercel.app/
Instruct-Imagen: Image Generation with Multi-modal Instruction
Residual Denoising Diffusion Models
UniGS: Unified Representation for Image Generation and Segmentation
Multi-Instance Generation Controller for Text-to-Image Synthesis
SVGDreamer: Text Guided SVG Generation with Diffusion Model
InteractDiffusion: Interaction-Control for Text-to-Image Diffusion Model
Ranni: Taming Text-to-Image Diffusion for Accurate Prompt Following
Vlogger: Make Your Dream A Vlog
VBench: Comprehensive Benchmark Suite for Video Generative Models
- Homepage: https://vchitect.github.io/VBench-project/
- Paper: https://arxiv.org/abs/2311.17982
- Code: https://github.com/Vchitect/VBench
VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models
- Homeoage: https://video-motion-customization.github.io/
- Paper: https://arxiv.org/abs/2312.00845
- Code: https://github.com/HyeonHo99/Video-Motion-Customization
CityDreamer: Compositional Generative Model of Unbounded 3D Cities
- Homepage: https://haozhexie.com/project/city-dreamer/
- Paper: https://arxiv.org/abs/2309.00610
- Code: https://github.com/hzxie/city-dreamer
LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
- Paper: https://arxiv.org/abs/2311.17005
- Code: https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2
Logit Standardization in Knowledge Distillation
- Paper: https://arxiv.org/abs/2403.01427
- Code: https://github.com/sunshangquan/logit-standardization-KD
Efficient Dataset Distillation via Minimax Diffusion
Neural Markov Random Field for Stereo Matching
HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation
- Homepage: https://zhangce01.github.io/HiKER-SGG/
- Paper : https://arxiv.org/abs/2403.12033
- Code: https://github.com/zhangce01/HiKER-SGG
KVQ: Kaleidoscope Video Quality Assessment for Short-form Videos
#Datasets
A Real-world Large-scale Dataset for Roadside Cooperative Perception
Traffic Scene Parsing through the TSP6K Dataset
Object Recognition as Next Token Prediction
ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks
Seamless Human Motion Composition with Blended Positional Encodings
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
-
Homepage: https://ll3da.github.io/
CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update
- Homepage: https://clova-tool.github.io/
- Paper: https://arxiv.org/abs/2312.10908
MoMask: Generative Masked Modeling of 3D Human Motions
Amodal Ground Truth and Completion in the Wild
- Homepage: https://www.robots.ox.ac.uk/~vgg/research/amodal/
- Paper: https://arxiv.org/abs/2312.17247
- Code: https://github.com/Championchess/Amodal-Completion-in-the-Wild
Improved Visual Grounding through Self-Consistent Explanations
ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object
- Homepage: https://chenshuang-zhang.github.io/imagenet_d/
- Paper: https://arxiv.org/abs/2403.18775
- Code: https://github.com/chenshuang-zhang/imagenet_d
Learning from Synthetic Human Group Activities
- Homepage: https://cjerry1243.github.io/M3Act/
- Paper https://arxiv.org/abs/2306.16772
- Code: https://github.com/cjerry1243/M3Act
A Cross-Subject Brain Decoding Framework
- Homepage: https://littlepure2333.github.io/MindBridge/
- Paper: https://arxiv.org/abs/2404.07850
- Code: https://github.com/littlepure2333/MindBridge
Multi-Task Dense Prediction via Mixture of Low-Rank Experts
Contrastive Mean-Shift Learning for Generalized Category Discovery
- Homepage: https://postech-cvlab.github.io/cms/
- Paper: https://arxiv.org/abs/2404.09451
- Code: https://github.com/sua-choi/CMS