Skip to content

amusi/CVPR2024-Papers-with-Code

Repository files navigation

CVPR 2024 论文和开源项目合集(Papers with Code)

CVPR 2024 decisions are now available on OpenReview!

注1:欢迎各位大佬提交issue,分享CVPR 2024论文和开源项目!

注2:关于往年CV顶会论文以及其他优质CV论文和大盘点,详见: https://github.com/amusi/daily-paper-computer-vision

欢迎扫码加入【CVer学术交流群】,这是最大的计算机视觉AI知识星球!每日更新,第一时间分享最新最前沿的计算机视觉、AI绘画、图像处理、深度学习、自动驾驶、医疗影像和AIGC等方向的学习资料,学起来!

【CVPR 2024 论文开源目录】

3DGS(Gaussian Splatting)

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis

GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians

GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting

Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction

SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes

Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis

DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization

4D Gaussian Splatting for Real-Time Dynamic Scene Rendering

GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models

Avatars

GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians

Real-Time Simulated Avatar from Head-Mounted Sensors

Backbone

RepViT: Revisiting Mobile CNN From ViT Perspective

TransNeXt: Robust Foveal Visual Perception for Vision Transformers

CLIP

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

FairCLIP: Harnessing Fairness in Vision-Language Learning

MAE

Embodied AI

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception

LEMON: Learning 3D Human-Object Interaction Relation from 2D Images

GAN

OCR

An Empirical Study of Scaling Law for OCR

ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting

NeRF

PIE-NeRF🍕: Physics-based Interactive Elastodynamics with NeRF

DETR

DETRs Beat YOLOs on Real-time Object Detection

Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement

Prompt

多模态大语言模型(MLLM)

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Link-Context Learning for Multimodal LLMs

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Making Large Multimodal Models Understand Arbitrary Visual Prompts

Pink: Unveiling the power of referential comprehension for multi-modal llms

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

OneLLM: One Framework to Align All Modalities with Language

大语言模型(LLM)

VTimeLLM: Empower LLM to Grasp Video Moments

NAS

ReID(重识别)

Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification

Noisy-Correspondence Learning for Text-to-Image Person Re-identification

扩散模型(Diffusion Models)

InstanceDiffusion: Instance-level Control for Image Generation

Residual Denoising Diffusion Models

DeepCache: Accelerating Diffusion Models for Free

DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

SVGDreamer: Text Guided SVG Generation with Diffusion Model

InteractDiffusion: Interaction-Control for Text-to-Image Diffusion Model

MMA-Diffusion: MultiModal Attack on Diffusion Models

VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models

Vision Transformer

TransNeXt: Robust Foveal Visual Perception for Vision Transformers

RepViT: Revisiting Mobile CNN From ViT Perspective

A General and Efficient Training for Transformer via Token Expansion

视觉和语言(Vision-Language)

PromptKD: Unsupervised Prompt Distillation for Vision-Language Models

FairCLIP: Harnessing Fairness in Vision-Language Learning

目标检测(Object Detection)

DETRs Beat YOLOs on Real-time Object Detection

Boosting Object Detection with Zero-Shot Day-Night Domain Adaptation

YOLO-World: Real-Time Open-Vocabulary Object Detection

Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement

异常检测(Anomaly Detection)

Anomaly Heterogeneity Learning for Open-set Supervised Anomaly Detection

目标跟踪(Object Tracking)

Delving into the Trajectory Long-tail Distribution for Muti-object Tracking

语义分割(Semantic Segmentation)

Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation

医学图像(Medical Image)

Feature Re-Embedding: Towards Foundation Model-Level Performance in Computational Pathology

VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis

ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning of Heterogeneous Microscopy Images

医学图像分割(Medical Image Segmentation)

自动驾驶(Autonomous Driving)

UniPAD: A Universal Pre-training Paradigm for Autonomous Driving

Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications

Memory-based Adapters for Online 3D Scene Perception

Symphonize 3D Semantic Scene Completion with Contextual Instance Queries

A Real-world Large-scale Dataset for Roadside Cooperative Perception

Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving

Traffic Scene Parsing through the TSP6K Dataset

3D点云(3D-Point-Cloud)

3D目标检测(3D Object Detection)

PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection

UniMODE: Unified Monocular 3D Object Detection

3D语义分割(3D Semantic Segmentation)

图像编辑(Image Editing)

Edit One for All: Interactive Batch Image Editing

视频编辑(Video Editing)

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers

Low-level Vision

Residual Denoising Diffusion Models

Boosting Image Restoration via Priors from Pre-trained Models

超分辨率(Super-Resolution)

SeD: Semantic-Aware Discriminator for Image Super-Resolution

APISR: Anime Production Inspired Real-World Anime Super-Resolution

去噪(Denoising)

图像去噪(Image Denoising)

3D人体姿态估计(3D Human Pose Estimation)

Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation

图像生成(Image Generation)

InstanceDiffusion: Instance-level Control for Image Generation

ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations

Instruct-Imagen: Image Generation with Multi-modal Instruction

Residual Denoising Diffusion Models

UniGS: Unified Representation for Image Generation and Segmentation

Multi-Instance Generation Controller for Text-to-Image Synthesis

SVGDreamer: Text Guided SVG Generation with Diffusion Model

InteractDiffusion: Interaction-Control for Text-to-Image Diffusion Model

Ranni: Taming Text-to-Image Diffusion for Accurate Prompt Following

视频生成(Video Generation)

Vlogger: Make Your Dream A Vlog

VBench: Comprehensive Benchmark Suite for Video Generative Models

VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models

3D生成

CityDreamer: Compositional Generative Model of Unbounded 3D Cities

LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching

视频理解(Video Understanding)

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

知识蒸馏(Knowledge Distillation)

Logit Standardization in Knowledge Distillation

Efficient Dataset Distillation via Minimax Diffusion

立体匹配(Stereo Matching)

Neural Markov Random Field for Stereo Matching

场景图生成(Scene Graph Generation)

HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation

视频质量评价(Video Quality Assessment)

KVQ: Kaleidoscope Video Quality Assessment for Short-form Videos

数据集(Datasets)

A Real-world Large-scale Dataset for Roadside Cooperative Perception

Traffic Scene Parsing through the TSP6K Dataset

其他(Others)

Object Recognition as Next Token Prediction

ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks

Seamless Human Motion Composition with Blended Positional Encodings

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update

MoMask: Generative Masked Modeling of 3D Human Motions

Amodal Ground Truth and Completion in the Wild

Improved Visual Grounding through Self-Consistent Explanations

ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object

Learning from Synthetic Human Group Activities

A Cross-Subject Brain Decoding Framework

Multi-Task Dense Prediction via Mixture of Low-Rank Experts

Contrastive Mean-Shift Learning for Generalized Category Discovery