- Super-brief summaries of AI papers I've read
- May not perfectly align with authors' claims or intentions
- Some papers I think important include detailed summary links, which leads to my blog posts
Augmenting Perceptual Super-Resolution via Image Quality Predictors
CVPR 2025, arxiv
Task: single image super-resolution
Improve the perceptual quality of SR using NR-IQA metric (MUSIQ) for patch selection and direct optimization.
Dual Prompting Image Restoration with Diffusion Transformers
CVPR 2025, arxiv
Task: image restoration
- Diffusion Transformers (DiTs) recently show promising generative capabilities.
- However, effectively incorporating low-quality (LQ) image information into DiTs remains underexplored.
- Propose Dual Prompting Image Restoration (DPIR), which effectively integrate control signals from LQ images into the DiT.
- 3 key components: degradation-robust VAE encoder, low-quality image conditioning branch, dual prompting control branch.
- Low-quality image conditioning: use VAE latent of LQ image for conditioning.
- Dual prompting control: use VAE output of LQ image for prompt instead of text.
Acquire and then Adapt: Squeezing out Text-to-Image Model for Image Restoration
CVPR 2025, arxiv
Task: image restoration
- Pre-trained T2I model based restoration approach can enrich low-quality images of any type with realistic details, but requires immense dataset and extensive training to prevent alterations of image content.
- Propose FluxGen: generate diverse realistic high-quality images using Flux with empty prompt, and then filter using NR-IQA metrics.
- Propose FluxIR: light-weighted ControlNet-like adapter, which controls all the MM-DiT block of Flux model using squeeze-and-excitation (SE) layers.
- Train FluxIR with FluxGen with modified timestep sampling & additional pixel-space loss.
Progressive Focused Transformer for Single Image Super-Resolution
CVPR 2025, arxiv, code
Task: single-image super-resolution
- Previous Transformer-based SR methods use vanilla or sparse self-attention, which still compute attention between query tokens and irrelevant tokens, leading to unnecessary computations.
- Intuition: highly relevant tokens will be consistently similar to each other across layers, so use previous layer's attention maps to identify relevant tokens.
- Propose Progressive Focused Attention (PFA): calculate current layer's PFA maps by the Hadamard product of the previous layer's PFA maps and the current layer's self-attention map, followed by top-k selection to construct sparse attention maps.
Exploring Semantic Feature Discrimination for Perceptual Image Super-Resolution and Opinion-Unaware No-Reference Image Quality Assessment
CVPR 2025, arxiv, code
Task: single-image super-resolution
- Previous GAN-based SR methods directly discriminate on images without semantic awareness, causing the generated textures to misalign with image semantics.
- Propose Semantic Feature Discrimination (SFD): discriminate multi-scale CLIP semantic features.
- Since previous discriminator-based opinion-unaware no-reference image quality assessment (OU NR-IQA) methods ignore the assessment for semantic, the discriminator trained with SFD achieves better performance on OU NR-IQA.
Uncertainty-guided Perturbation for Image Super-Resolution Diffusion Model
CVPR 2025, arxiv, code
Task: single image super-resolution
- Previous diffusion-based SR methods inject a large initial noise into the LR image, which is not specialized and effective for SR.
- Four core methods: Uncertainty-guided Noise Weighting (UNW), additional SR image conditioning, loss design, and network architecture design.
- UNW: low noise in flat areas, large noise in edge and texture areas.
- SR image conditioning: combine the pre-trained SR model output to provide more accurate conditional information.
- Loss design: fidelity loss (RMSE) + perceptual loss (LPIPS)
- Network archiecture design: encoder & decoder → PixelUnshuffle & nearest neighbor sampling.
The Power of Context: How Multimodality Improves Image Super-Resolution
CVPR 2025, arxiv
Task: single image super-resolution
- Previous text-prompt-driven SR methods use captions generated by VLMs, but VLMs cannot accurately represent spatial information, leading to hallucinated details.
- Propose Multimodal Super-Resolution (MMSR): incorporate additional spatial modalities (depth, segmentation, edge) into a diffusion model to implicitly align language descriptions with spatial regions.
- Instead of using ControlNet-style conditioning, use pretrained VQGAN image tokenizer to encode diverse modalities into a unified token representation, which are then concatenated with text embeddings and used in cross-attention within the diffusion model.
Vision-Language Models Do Not Understand Negation
CVPR 2025, arxiv, website, code
Task: understanding how well VLMs handle negation
- How well do current VLMs understand negation?
- To comprehensively evaluate how well VLMs handle negation, propose NegBench.
- Joint embedding-based VLMs, such as CLIP, frequently collapse affirmative and negated statements into similar embeddings.
- Data-centric approach is effective: fine-tuning CLIP-based models on large-scale datasets containing millions of negated captions.
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
CVPR 2025 Highlight, arxiv, code
Task: image generation
Mitigate the reconstruction-generation trade-off in LDMs by aligning the VAE latent space with vision foundation models, which promotes spread-out latent distribution.
Arbitrary-steps Image Super-resolution via Diffusion Inversion
CVPR 2025, arxiv, code
Task: single-image super-resolution
Diffusion inversion for SR by training a noise prediction network, enabling pre-trained diffusion models to reconstruct HR image from noise-perturbed LR input.
Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis
CVPR 2025 Highlight, arxiv, website, code
Task: generative photography
Scene-consistent text-to-image generation with camera intrinsics control by fine-tuning a T2V with a differential camera encoder.
Classifier-Free Guidance inside the Attraction Basin May Cause Memorization
CVPR 2025, arxiv, code
Task: mitigate memorization in diffusion models
- Applying classifier-free guidance (CFG) before a certain timestep (transition point) tends to produce memorized samples.
- Applying CFG after the transition point is unlikely to yield a memorized image.
- Although every prompt and initialization pair leads to a different transition point, the transition point can be found by identifying the first local minima of the graph of
$$\epsilon_{\theta}(x_t, e_{prompt}) - \epsilon_{\theta}(x_t, e_{\phi})$$ .- Propose Opposite Guidance (OG): apply opposite CFG until the transition point, and switch to traditional positive CFG after the transition point.
ArtiFade: Learning to Generate High-quality Subject from Blemished Images
CVPR 2025, arxiv
Task: blemished subject-driven generation
- Current subject-driven T2I methods are vulnerable to artifacts such as watermarks, stickers, or adversarial noise.
- This limitation arises because current methods lack the discriminative power to distinguish subject-related features from disruptive artifacts.
- Propose ArtiFade, which generates high-quality artifact-free images from blemished datasets.
- Core method: artifact rectification training, which first reconstructs the blemished image and then learns to rectify it into an unblemished version.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
arXiv 2024, arxiv
Task: multi-modal generation
- Previous works quantize continuous modalities and train with next-token prediction.
- Propose Transfusion: generate discrete and continuous modalities using a different objective for each modality.
- Use next token prediction for text and diffuion for images.
- For images, use unrestricted (bidirectional) attention.
Detecting, Explaining, and Mitigating Memorization in Diffusion Models
ICLR 2024 Oral, arxiv, review, code
Task: detect and mitigate memorization in diffusion models
- For memorized prompts, the text condition consistently guides the generation towards the memorized solution, regardless of the initializations.
- Thus, memorized prompts tend to exhibit larger magnitudes than non-memorized ones.
- Use
$$\epsilon_{\theta}(x_t, e_{prompt}) - \epsilon_{\theta}(x_t, e_{\phi})$$ to detect and mitigate memorization.- Detect trigger tokens by measure the influence of
$$\epsilon_{\theta}(x_t, e_{prompt}) - \epsilon_{\theta}(x_t, e_{\phi})$$ per each token.
DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance
AAAI 2025, arxiv, website, code
Task: higher-resolution image generation
Training-free higher-resolution image generation via input image sharpening and Discrete Wavelet Transform (DWT)-based structural guidance.
Image Neural Field Diffusion Models
CVPR 2024 Highlight, arxiv, website, code
Task: image generation
Any-resolution image generation using a latent diffusion model trained on neural field representations via a neural field autoencoder.
Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models
NeurIPS 2024, arxiv, review, website, code
Task: multi-subject personalization
- Previous works on personalization suffer from identity mixing when composing multiple subjects.
- During training, use detailed descriptions and Seg-Mix augmentation, which randomly composes segmented subjects.
- During inference, use mean-shifted noise instead of Gaussian noise, which use the segmented subjects to initialize.
- Propose new metric Detect-and-Compare (D&C) to evaluate multi-subject fidelity.
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
ECCV 2024, arxiv, code
Task: referring, grounding, and reasoning on mobile UI screens
- Directly adapting MLLMs to UI screens has limitation, since UI screens exhibit more elongated aspect ratios and contain smaller objects of interests than natural images.
- Incorporate "any resolution" (anyres) on top of Ferret, and then train with curated dataset.
- During training, both the decoder and the projection layer are updated while the vision encoder is kept frozen.
Measuring Style Similarity in Diffusion Models
ECCV 2024, arxiv, website, code
Task: image style retrieval
In contrast to existing feature extractors that prioritize image content, propose Contrastive Style Descriptors (CSD), specifically designed to extract image style.
Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention
ECCV 2024, arxiv, code
Task: detect and mitigate memorization in text-to-image diffusion models
- Since memorized images are usually triggered by the specific text tokens, use entropy of cross-attention to detect and mitigate memorization.
Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder
CVPR 2024, arxiv, code
Task: image generation, single-image super-resolution
Arbitrary-scale image generation using LDM with implicit neural decoder on VAE, and arbitrary-scale super-resolution by conditioning the diffusion process through concatenation of low-resolution features with the noisy latent.
AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection
ICLR 2024, arxiv, review, code, summary
Task: zero-shot anomaly detection (ZSAD)
Since anomaly patterns remain quite similar regardless of foreground object semantics, use CLIP with learnable object-agnostic text prompts.
Tiny and Efficient Model for the Edge Detection Generalization
ICCV 2023 Workshop, arxiv, code
Task: edge detection
Propose Tiny and Efficient Edge Detector (TEED), which generates thinner and clearer edge-maps by training the model with paired dataset using weighted cross-entropy and tracing loss.
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
CoRL 2023, arxiv, review, website
Task: robot manipulation
Propose RT-2, which directly integrates large pre-trained VLMs into low-level robot control by tokenizing the actions into text tokens & co-fine-tuning robotics data with the original web data.
Understanding and Mitigating Copying in Diffusion Models
NeurIPS 2023, arxiv, review, code
Task: analyze and mitigate memorization in T2I diffusion models
Since text conditioning plays a major role in memorization, propose train-time mitiagation (use multiple captions) and inference-time mitigation (use random token replacement or addition).
Implicit Diffusion Models for Continuous Super-Resolution
CVPR 2023, arxiv, code
Task: single-image super-resolution
Continuous image super-resolution by replacing the U-Net decoder with an implicit neural representation and conditioning on multi-resolution LR features.
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
RSS 2023, arxiv, website, code
Task: robot manipulation Diffusion formulation for visuomotor policy, which predicts high-dimensional action sequences given visual representation, works effectively for real-world robot control.
Adding Conditional Control to Text-to-Image Diffusion Models
ICCV 2023 Oral, arxiv, code, summary
Task: image-based conditional image generation
Fine-tune a trainable copy of a T2I diffusion model, connected via zero convolution, to achieve fine-grained spatial control using additional images as conditioning inputs.
Learning Universal Policies via Text-Guided Video Generation
NeurIPS 2023 Spotlight, arxiv, review, website
Task: robot manipulation
Plans actions by generating a goal-directed video using a T2V diffusion model, and then infers control actions from the video using an inverse dynamics model.
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
CVPR 2023 Award Candidate, arxiv, website, code, summary
Task: subject-driven image generation
Generate novel photorealistic images of the subject contextualized in different scenes via fine-tuning with rare tokens and class-specific prior preservation loss.
Prompt-to-Prompt Image Editing with Cross Attention Control
ICLR 2023 Spotlight, arxiv, review, website, code, summary
Task: text-driven image editing
Text-driven image editing by injecting the cross-attention maps of original prompt to the cross-attention maps of edited prompt.
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
ICLR 2023 Spotlight, arxiv, review, website, code, summary
Task: personalized text-to-image generation
Generate novel photorealistic images of the subject via optimizing only a single word embedding.
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
CoRL 2022 Oral, arxiv, review, website, code
Task: robot manipulation and navigation
Enable robots to perform complex real-world tasks by selecting appropriate low-level skills through high-level planning using LLM + affordance model.
MuLUT: Cooperating Multiple Look-Up Tables for Efficient Image Super-Resolution
ECCV 2022, paper, website, code
Task: single-image super-resolution
Increase receptive field size of LUT efficiently by using complementary indexing (parallel), hierarchical indexing (cascade), and fine-tuning interpolation values.
Learning to generate line drawings that convey geometry and semantics
CVPR 2022, arxiv, website, code
Task: automatic line generation
Line drawing via unpaired image-to-image translation with 4 losses: adversarial loss (LSGAN), geometry loss (pseudo depth map), semantic loss (CLIP), appearance loss (cycle consistency).
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
ICLR 2022, arxiv, review, website, code, summary
Task: guided image synthesis & editing
Generate realistic images by adding small noise and denoising with score-based models trained on the target domain.
Learning Continuous Image Representation with Local Implicit Image Function
accept info, arxiv, website, code
Task: single-image super-resolution
Learn a continuous image representation that enables arbitrary super-resolution using a coordinate-based decoder.
Tackling the Ill-Posedness of Super-Resolution Through Adaptive Target Generation
CVPR 2021, paper, code
Task: single-image super-resolution
Modeling the one-to-many problem in SR by creating an adaptive target, which is an affine transformed version of the HR patch designed to be closer to the SR patch.
Practical Single-Image Super-Resolution Using Look-Up Table
CVPR 2021, paper, code
Task: single-image super-resolution
Practical SR by approximating small receptive field SR model into LUT, achieving similar runtime but better performance compared to interpolation methods.
Which Tasks Should Be Learned Together in Multi-task Learning?
ICML 2020, arxiv, review, website, code
Task: multi-task learning
Many common assumptions do not seem to be true: more similar tasks don't necessarily work better together & task relationships are sensitive to dataset size and network capacity.
Generalisation in humans and deep neural networks
NeurIPS 2018, arxiv, review, code, summary
Task: understanding the differences between DNNs and humans
Compared to human visual system, DNNs (VGG, GoogLeNet, ResNet) generalize so poorly under non-i.i.d. settings.
The Perception-Distortion Tradeoff
CVPR 2018 Oral, arxiv
Task: image restoration
For non-invertible degradation, perception-distortion tradeoff always exists.
Enhanced Deep Residual Networks for Single Image Super-Resolution
CVPR 2017 Workshop, arxiv, code
Task: single-image super-resolution
Optimize network and training for SR: remove batch normalization layer, train with residual scaling and L1 loss.
f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization
NIPS 2016 Spotlight, arxiv, review, summary
Task: image generation
Generalize GAN training objectives for all f-divergences using variational lower bound.
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
accept info, arxiv, code, summary
Task: image generation
Tractable & flexible probabilistic model by learning the reverse of a forward diffusion process.
NICE: Non-linear Independent Components Estimation
ICLR 2015 Workshop, arxiv, code, summary
Task: image generation
Maximize exact log-likelihood via a change of variables, using a carefully designed invertible transformation with a tractable Jacobian.
Generative Adversarial Networks
NIPS 2014, arxiv, review, code, summary
Task: image generation
Train generative models through adversarial training, without requiring explicit likelihood estimation.
Auto-Encoding Variational Bayes
ICLR 2014 Oral, arxiv, review, summary
Task: image generation
Train directed probabilistic models by maximizing variational lower bound with reparameterization trick for efficient gradient-based optimization.
format
> **paper title**
> *accept info*, [arxiv](), [review](), [website](), [code](), [summary]()
> Task:
> super-brief summary