Skip to content

tecworks-dev/CVPR2024-Papers-with-Code

 
 

Repository files navigation

CVPR 2024 Papers and Open Source Projects Collection (Papers with Code)

CVPR 2024 decisions are now available on OpenReview!

Note 1: Welcome to submit issues and share CVPR 2024 papers and open source projects!

Note 2: For details about previous years' CV top conference papers and other high-quality CV papers and reviews, please see: https://github.com/amusi/daily-paper-computer-vision

Welcome to scan the QR code to join the [CVer Academic Exchange Group], which is the largest computer vision AI knowledge planet! Updated daily, we will share the latest and most cutting-edge learning materials in computer vision, AI painting, image processing, deep learning, autonomous driving, medical imaging, AIGC and other fields as soon as possible. Let's learn!

![](CVer academic exchange group.png)

[CVPR 2024 Paper Open Source Directory]

3DGS(Gaussian Splatting)

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis

GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians

GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting

Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction

SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes

Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis

DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization

4D Gaussian Splatting for Real-Time Dynamic Scene Rendering

GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models

Avatars

GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians

Real-Time Simulated Avatar from Head-Mounted Sensors

Backbone

RepViT: Revisiting Mobile CNN From ViT Perspective

TransNeXt: Robust Foveal Visual Perception for Vision Transformers

CLIP

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

FairCLIP: Harnessing Fairness in Vision-Language Learning

IT IS

Embodied AI

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception

LEMON: Learning 3D Human-Object Interaction Relation from 2D Images

THOUGH

OCR

An Empirical Study of Scaling Law for OCR

ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting

NeRF

PIE-NeRF🍕: Physics-based Interactive Elastodynamics with NeRF

DETR

DETRs Beat YOLOs on Real-time Object Detection

Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement

Prompt

Multimodal Large Language Model (MLLM)

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Link-Context Learning for Multimodal LLMs

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Making Large Multimodal Models Understand Arbitrary Visual Prompts

Pink: Unveiling the power of referential comprehension for multi-modal llms

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

OneLLM: One Framework to Align All Modalities with Language

Large Language Model (LLM)

VTimeLLM: Empower LLM to Grasp Video Moments

IN THE

ReID (Re-identification)

Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification

Noisy-Correspondence Learning for Text-to-Image Person Re-identification

Diffusion Models

InstanceDiffusion: Instance-level Control for Image Generation

Residual Denoising Diffusion Models

DeepCache: Accelerating Diffusion Models for Free

DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

SVGDreamer: Text Guided SVG Generation with Diffusion Model

InteractDiffusion: Interaction-Control for Text-to-Image Diffusion Model

MMA-Diffusion: MultiModal Attack on Diffusion Models

VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models

Vision Transformer

TransNeXt: Robust Foveal Visual Perception for Vision Transformers

RepViT: Revisiting Mobile CNN From ViT Perspective

A General and Efficient Training for Transformer via Token Expansion

Vision-Language

PromptKD: Unsupervised Prompt Distillation for Vision-Language Models

FairCLIP: Harnessing Fairness in Vision-Language Learning

Object Detection

DETRs Beat YOLOs on Real-time Object Detection

Boosting Object Detection with Zero-Shot Day-Night Domain Adaptation

YOLO-World: Real-Time Open-Vocabulary Object Detection

Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement

Anomaly Detection

Anomaly Heterogeneity Learning for Open-set Supervised Anomaly Detection

Object Tracking

Delving into the Trajectory Long-tail Distribution for Muti-object Tracking

Semantic Segmentation

Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation

Medical Image

Feature Re-Embedding: Towards Foundation Model-Level Performance in Computational Pathology

VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis

ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning of Heterogeneous Microscopy Images

Medical Image Segmentation

Autonomous Driving

UniPAD: A Universal Pre-training Paradigm for Autonomous Driving

Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications

Memory-based Adapters for Online 3D Scene Perception

Symphonize 3D Semantic Scene Completion with Contextual Instance Queries

A Real-world Large-scale Dataset for Roadside Cooperative Perception

Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving

Traffic Scene Parsing through the TSP6K Dataset

3D-Point-Cloud

3D Object Detection

PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection

UniMODE: Unified Monocular 3D Object Detection

3D Semantic Segmentation

Image Editing

Edit One for All: Interactive Batch Image Editing

Video Editing

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers

Low-level Vision

Residual Denoising Diffusion Models

Boosting Image Restoration via Priors from Pre-trained Models

Super-Resolution

SeD: Semantic-Aware Discriminator for Image Super-Resolution

APISR: Anime Production Inspired Real-World Anime Super-Resolution

Denoising

Image Denoising

3D Human Pose Estimation

Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation

Image Generation

InstanceDiffusion: Instance-level Control for Image Generation

ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations

Instruct-Imagen: Image Generation with Multi-modal Instruction

Residual Denoising Diffusion Models

UniGS: Unified Representation for Image Generation and Segmentation

Multi-Instance Generation Controller for Text-to-Image Synthesis

SVGDreamer: Text Guided SVG Generation with Diffusion Model

InteractDiffusion: Interaction-Control for Text-to-Image Diffusion Model

Ranni: Taming Text-to-Image Diffusion for Accurate Prompt Following

Video Generation

Vlogger: Make Your Dream A Vlog

VBench: Comprehensive Benchmark Suite for Video Generative Models

VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models

3D Generation

CityDreamer: Compositional Generative Model of Unbounded 3D Cities

LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching

Video Understanding

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Knowledge Distillation

Logit Standardization in Knowledge Distillation

Efficient Dataset Distillation via Minimax Diffusion

Stereo Matching

Neural Markov Random Field for Stereo Matching

Scene Graph Generation

HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation

Video Quality Assessment

KVQ: Kaleidoscope Video Quality Assessment for Short-form Videos

#Datasets

A Real-world Large-scale Dataset for Roadside Cooperative Perception

Traffic Scene Parsing through the TSP6K Dataset

Others

Object Recognition as Next Token Prediction

ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks

Seamless Human Motion Composition with Blended Positional Encodings

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update

MoMask: Generative Masked Modeling of 3D Human Motions

Amodal Ground Truth and Completion in the Wild

Improved Visual Grounding through Self-Consistent Explanations

ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object

Learning from Synthetic Human Group Activities

A Cross-Subject Brain Decoding Framework

Multi-Task Dense Prediction via Mixture of Low-Rank Experts

Contrastive Mean-Shift Learning for Generalized Category Discovery

About

CVPR 2024 论文和开源项目合集

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published