AI Paper Digest

Super-brief summaries of AI papers I've read
May not perfectly align with authors' claims or intentions
Some papers I think important include detailed summary links, which leads to my blog posts

Augmenting Perceptual Super-Resolution via Image Quality Predictors
CVPR 2025, arxiv
Task: single image super-resolution
Improve the perceptual quality of SR using NR-IQA metric (MUSIQ) for patch selection and direct optimization.

Dual Prompting Image Restoration with Diffusion Transformers
CVPR 2025, arxiv
Task: image restoration

Diffusion Transformers (DiTs) recently show promising generative capabilities.

However, effectively incorporating low-quality (LQ) image information into DiTs remains underexplored.

Propose Dual Prompting Image Restoration (DPIR), which effectively integrate control signals from LQ images into the DiT.

3 key components: degradation-robust VAE encoder, low-quality image conditioning branch, dual prompting control branch.

Low-quality image conditioning: use VAE latent of LQ image for conditioning.

Dual prompting control: use VAE output of LQ image for prompt instead of text.

Acquire and then Adapt: Squeezing out Text-to-Image Model for Image Restoration
CVPR 2025, arxiv
Task: image restoration

Pre-trained T2I model based restoration approach can enrich low-quality images of any type with realistic details, but requires immense dataset and extensive training to prevent alterations of image content.

Propose FluxGen: generate diverse realistic high-quality images using Flux with empty prompt, and then filter using NR-IQA metrics.

Propose FluxIR: light-weighted ControlNet-like adapter, which controls all the MM-DiT block of Flux model using squeeze-and-excitation (SE) layers.

Train FluxIR with FluxGen with modified timestep sampling & additional pixel-space loss.

Progressive Focused Transformer for Single Image Super-Resolution
CVPR 2025, arxiv, code
Task: single-image super-resolution

Previous Transformer-based SR methods use vanilla or sparse self-attention, which still compute attention between query tokens and irrelevant tokens, leading to unnecessary computations.

Intuition: highly relevant tokens will be consistently similar to each other across layers, so use previous layer's attention maps to identify relevant tokens.

Propose Progressive Focused Attention (PFA): calculate current layer's PFA maps by the Hadamard product of the previous layer's PFA maps and the current layer's self-attention map, followed by top-k selection to construct sparse attention maps.

Exploring Semantic Feature Discrimination for Perceptual Image Super-Resolution and Opinion-Unaware No-Reference Image Quality Assessment
CVPR 2025, arxiv, code
Task: single-image super-resolution

Previous GAN-based SR methods directly discriminate on images without semantic awareness, causing the generated textures to misalign with image semantics.

Propose Semantic Feature Discrimination (SFD): discriminate multi-scale CLIP semantic features.

Since previous discriminator-based opinion-unaware no-reference image quality assessment (OU NR-IQA) methods ignore the assessment for semantic, the discriminator trained with SFD achieves better performance on OU NR-IQA.

Uncertainty-guided Perturbation for Image Super-Resolution Diffusion Model
CVPR 2025, arxiv, code
Task: single image super-resolution

Previous diffusion-based SR methods inject a large initial noise into the LR image, which is not specialized and effective for SR.

Four core methods: Uncertainty-guided Noise Weighting (UNW), additional SR image conditioning, loss design, and network architecture design.

UNW: low noise in flat areas, large noise in edge and texture areas.

SR image conditioning: combine the pre-trained SR model output to provide more accurate conditional information.

Loss design: fidelity loss (RMSE) + perceptual loss (LPIPS)

Network archiecture design: encoder & decoder → PixelUnshuffle & nearest neighbor sampling.

The Power of Context: How Multimodality Improves Image Super-Resolution
CVPR 2025, arxiv
Task: single image super-resolution

Previous text-prompt-driven SR methods use captions generated by VLMs, but VLMs cannot accurately represent spatial information, leading to hallucinated details.

Propose Multimodal Super-Resolution (MMSR): incorporate additional spatial modalities (depth, segmentation, edge) into a diffusion model to implicitly align language descriptions with spatial regions.

Instead of using ControlNet-style conditioning, use pretrained VQGAN image tokenizer to encode diverse modalities into a unified token representation, which are then concatenated with text embeddings and used in cross-attention within the diffusion model.

Vision-Language Models Do Not Understand Negation
CVPR 2025, arxiv, website, code
Task: understanding how well VLMs handle negation

How well do current VLMs understand negation?

To comprehensively evaluate how well VLMs handle negation, propose NegBench.

Joint embedding-based VLMs, such as CLIP, frequently collapse affirmative and negated statements into similar embeddings.

Data-centric approach is effective: fine-tuning CLIP-based models on large-scale datasets containing millions of negated captions.

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
CVPR 2025 Highlight, arxiv, code
Task: image generation
Mitigate the reconstruction-generation trade-off in LDMs by aligning the VAE latent space with vision foundation models, which promotes spread-out latent distribution.

Arbitrary-steps Image Super-resolution via Diffusion Inversion
CVPR 2025, arxiv, code
Task: single-image super-resolution
Diffusion inversion for SR by training a noise prediction network, enabling pre-trained diffusion models to reconstruct HR image from noise-perturbed LR input.

Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis
CVPR 2025 Highlight, arxiv, website, code
Task: generative photography
Scene-consistent text-to-image generation with camera intrinsics control by fine-tuning a T2V with a differential camera encoder.

Classifier-Free Guidance inside the Attraction Basin May Cause Memorization
CVPR 2025, arxiv, code
Task: mitigate memorization in diffusion models

Applying classifier-free guidance (CFG) before a certain timestep (transition point) tends to produce memorized samples.

Applying CFG after the transition point is unlikely to yield a memorized image.

Although every prompt and initialization pair leads to a different transition point, the transition point can be found by identifying the first local minima of the graph of $$\epsilon_{\theta}(x_t, e_{prompt}) - \epsilon_{\theta}(x_t, e_{\phi})$$.

Propose Opposite Guidance (OG): apply opposite CFG until the transition point, and switch to traditional positive CFG after the transition point.

ArtiFade: Learning to Generate High-quality Subject from Blemished Images
CVPR 2025, arxiv
Task: blemished subject-driven generation

Current subject-driven T2I methods are vulnerable to artifacts such as watermarks, stickers, or adversarial noise.

This limitation arises because current methods lack the discriminative power to distinguish subject-related features from disruptive artifacts.

Propose ArtiFade, which generates high-quality artifact-free images from blemished datasets.

Core method: artifact rectification training, which first reconstructs the blemished image and then learns to rectify it into an unblemished version.

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
arXiv 2024, arxiv
Task: multi-modal generation

Previous works quantize continuous modalities and train with next-token prediction.

Propose Transfusion: generate discrete and continuous modalities using a different objective for each modality.

Use next token prediction for text and diffuion for images.

For images, use unrestricted (bidirectional) attention.

Detecting, Explaining, and Mitigating Memorization in Diffusion Models
ICLR 2024 Oral, arxiv, review, code
Task: detect and mitigate memorization in diffusion models

For memorized prompts, the text condition consistently guides the generation towards the memorized solution, regardless of the initializations.

Thus, memorized prompts tend to exhibit larger magnitudes than non-memorized ones.

Use $$\epsilon_{\theta}(x_t, e_{prompt}) - \epsilon_{\theta}(x_t, e_{\phi})$$ to detect and mitigate memorization.

Detect trigger tokens by measure the influence of $$\epsilon_{\theta}(x_t, e_{prompt}) - \epsilon_{\theta}(x_t, e_{\phi})$$ per each token.

DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance
AAAI 2025, arxiv, website, code
Task: higher-resolution image generation
Training-free higher-resolution image generation via input image sharpening and Discrete Wavelet Transform (DWT)-based structural guidance.

Image Neural Field Diffusion Models
CVPR 2024 Highlight, arxiv, website, code
Task: image generation
Any-resolution image generation using a latent diffusion model trained on neural field representations via a neural field autoencoder.

Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models
NeurIPS 2024, arxiv, review, website, code
Task: multi-subject personalization

Previous works on personalization suffer from identity mixing when composing multiple subjects.

During training, use detailed descriptions and Seg-Mix augmentation, which randomly composes segmented subjects.

During inference, use mean-shifted noise instead of Gaussian noise, which use the segmented subjects to initialize.

Propose new metric Detect-and-Compare (D&C) to evaluate multi-subject fidelity.

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
ECCV 2024, arxiv, code
Task: referring, grounding, and reasoning on mobile UI screens

Directly adapting MLLMs to UI screens has limitation, since UI screens exhibit more elongated aspect ratios and contain smaller objects of interests than natural images.

Incorporate "any resolution" (anyres) on top of Ferret, and then train with curated dataset.

During training, both the decoder and the projection layer are updated while the vision encoder is kept frozen.

Measuring Style Similarity in Diffusion Models
ECCV 2024, arxiv, website, code
Task: image style retrieval
In contrast to existing feature extractors that prioritize image content, propose Contrastive Style Descriptors (CSD), specifically designed to extract image style.

Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention
ECCV 2024, arxiv, code
Task: detect and mitigate memorization in text-to-image diffusion models

Since memorized images are usually triggered by the specific text tokens, use entropy of cross-attention to detect and mitigate memorization.

Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder
CVPR 2024, arxiv, code
Task: image generation, single-image super-resolution
Arbitrary-scale image generation using LDM with implicit neural decoder on VAE, and arbitrary-scale super-resolution by conditioning the diffusion process through concatenation of low-resolution features with the noisy latent.

AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection
ICLR 2024, arxiv, review, code, summary
Task: zero-shot anomaly detection (ZSAD)
Since anomaly patterns remain quite similar regardless of foreground object semantics, use CLIP with learnable object-agnostic text prompts.

Tiny and Efficient Model for the Edge Detection Generalization
ICCV 2023 Workshop, arxiv, code
Task: edge detection
Propose Tiny and Efficient Edge Detector (TEED), which generates thinner and clearer edge-maps by training the model with paired dataset using weighted cross-entropy and tracing loss.

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
CoRL 2023, arxiv, review, website
Task: robot manipulation
Propose RT-2, which directly integrates large pre-trained VLMs into low-level robot control by tokenizing the actions into text tokens & co-fine-tuning robotics data with the original web data.

Understanding and Mitigating Copying in Diffusion Models
NeurIPS 2023, arxiv, review, code
Task: analyze and mitigate memorization in T2I diffusion models
Since text conditioning plays a major role in memorization, propose train-time mitiagation (use multiple captions) and inference-time mitigation (use random token replacement or addition).

Implicit Diffusion Models for Continuous Super-Resolution
CVPR 2023, arxiv, code
Task: single-image super-resolution
Continuous image super-resolution by replacing the U-Net decoder with an implicit neural representation and conditioning on multi-resolution LR features.

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
RSS 2023, arxiv, website, code
Task: robot manipulation Diffusion formulation for visuomotor policy, which predicts high-dimensional action sequences given visual representation, works effectively for real-world robot control.

Adding Conditional Control to Text-to-Image Diffusion Models
ICCV 2023 Oral, arxiv, code, summary
Task: image-based conditional image generation
Fine-tune a trainable copy of a T2I diffusion model, connected via zero convolution, to achieve fine-grained spatial control using additional images as conditioning inputs.

Learning Universal Policies via Text-Guided Video Generation
NeurIPS 2023 Spotlight, arxiv, review, website
Task: robot manipulation
Plans actions by generating a goal-directed video using a T2V diffusion model, and then infers control actions from the video using an inverse dynamics model.

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
CVPR 2023 Award Candidate, arxiv, website, code, summary
Task: subject-driven image generation
Generate novel photorealistic images of the subject contextualized in different scenes via fine-tuning with rare tokens and class-specific prior preservation loss.

Prompt-to-Prompt Image Editing with Cross Attention Control
ICLR 2023 Spotlight, arxiv, review, website, code, summary
Task: text-driven image editing
Text-driven image editing by injecting the cross-attention maps of original prompt to the cross-attention maps of edited prompt.

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
ICLR 2023 Spotlight, arxiv, review, website, code, summary
Task: personalized text-to-image generation
Generate novel photorealistic images of the subject via optimizing only a single word embedding.

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
CoRL 2022 Oral, arxiv, review, website, code
Task: robot manipulation and navigation
Enable robots to perform complex real-world tasks by selecting appropriate low-level skills through high-level planning using LLM + affordance model.

MuLUT: Cooperating Multiple Look-Up Tables for Efficient Image Super-Resolution
ECCV 2022, paper, website, code
Task: single-image super-resolution
Increase receptive field size of LUT efficiently by using complementary indexing (parallel), hierarchical indexing (cascade), and fine-tuning interpolation values.

Learning to generate line drawings that convey geometry and semantics
CVPR 2022, arxiv, website, code
Task: automatic line generation
Line drawing via unpaired image-to-image translation with 4 losses: adversarial loss (LSGAN), geometry loss (pseudo depth map), semantic loss (CLIP), appearance loss (cycle consistency).

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
ICLR 2022, arxiv, review, website, code, summary
Task: guided image synthesis & editing
Generate realistic images by adding small noise and denoising with score-based models trained on the target domain.

Learning Continuous Image Representation with Local Implicit Image Function
accept info, arxiv, website, code
Task: single-image super-resolution
Learn a continuous image representation that enables arbitrary super-resolution using a coordinate-based decoder.

Tackling the Ill-Posedness of Super-Resolution Through Adaptive Target Generation
CVPR 2021, paper, code
Task: single-image super-resolution
Modeling the one-to-many problem in SR by creating an adaptive target, which is an affine transformed version of the HR patch designed to be closer to the SR patch.

Practical Single-Image Super-Resolution Using Look-Up Table
CVPR 2021, paper, code
Task: single-image super-resolution
Practical SR by approximating small receptive field SR model into LUT, achieving similar runtime but better performance compared to interpolation methods.

Which Tasks Should Be Learned Together in Multi-task Learning?
ICML 2020, arxiv, review, website, code
Task: multi-task learning
Many common assumptions do not seem to be true: more similar tasks don't necessarily work better together & task relationships are sensitive to dataset size and network capacity.

Generalisation in humans and deep neural networks
NeurIPS 2018, arxiv, review, code, summary
Task: understanding the differences between DNNs and humans
Compared to human visual system, DNNs (VGG, GoogLeNet, ResNet) generalize so poorly under non-i.i.d. settings.

The Perception-Distortion Tradeoff
CVPR 2018 Oral, arxiv
Task: image restoration
For non-invertible degradation, perception-distortion tradeoff always exists.

Enhanced Deep Residual Networks for Single Image Super-Resolution
CVPR 2017 Workshop, arxiv, code
Task: single-image super-resolution
Optimize network and training for SR: remove batch normalization layer, train with residual scaling and L1 loss.

f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization
NIPS 2016 Spotlight, arxiv, review, summary
Task: image generation
Generalize GAN training objectives for all f-divergences using variational lower bound.

Deep Unsupervised Learning using Nonequilibrium Thermodynamics
accept info, arxiv, code, summary
Task: image generation
Tractable & flexible probabilistic model by learning the reverse of a forward diffusion process.

NICE: Non-linear Independent Components Estimation
ICLR 2015 Workshop, arxiv, code, summary
Task: image generation
Maximize exact log-likelihood via a change of variables, using a carefully designed invertible transformation with a tractable Jacobian.

Generative Adversarial Networks
NIPS 2014, arxiv, review, code, summary
Task: image generation
Train generative models through adversarial training, without requiring explicit likelihood estimation.

Auto-Encoding Variational Bayes
ICLR 2014 Oral, arxiv, review, summary
Task: image generation
Train directed probabilistic models by maximizing variational lower bound with reparameterization trick for efficient gradient-based optimization.

format

> **paper title**  
> *accept info*, [arxiv](), [review](), [website](), [code](), [summary]()  
> Task:  
> super-brief summary

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Paper Digest

About

Uh oh!

jasonleex1995/AI-Paper-Digest

Folders and files

Latest commit

History

Repository files navigation

AI Paper Digest

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks