QUSR: Quality-Aware and Uncertainty-Guided Image Super-Resolution Diffusion Model

📝 Paper Summary

Image Super-Resolution (ISR) Image Restoration

QUSR enhances super-resolution diffusion models by using a multimodal LLM to describe image degradation quality and an uncertainty map to spatially adapt noise injection, balancing detail generation with fidelity.

Core Problem

Real-world super-resolution suffers from a trade-off: high-level semantic prompts ignore specific degradation details (blur, noise), while low-level image features are corrupted by that same degradation, leading to hallucinations or artifacts.

Why it matters:

Existing diffusion SR methods struggle with unknown, non-uniform degradations in real-world scenarios
Sole reliance on text prompts overlooks critical degradation information necessary for accurate restoration
Direct feature extraction from low-quality images transmits noise and artifacts into the final output

Concrete Example: In a real-world image with both flat backgrounds and complex textures, standard diffusion models might over-smooth the textures or hallucinate artifacts in the flat areas because they apply uniform denoising. QUSR detects high uncertainty in the textures and injects stronger noise there to stimulate detail generation, while keeping the background clean.

Key Novelty

Dual-Guidance Framework (Quality-Aware Prior + Uncertainty-Guided Noise)

Uses a Multimodal Large Language Model (Qwen2.5-VL) to generate a text description of the image's *quality* (e.g., 'blur', 'noise level'), not just content, providing explicit degradation cues to the diffusion model
Estimates a pixel-wise uncertainty map to modulate noise injection: high-uncertainty regions (edges) receive stronger noise to force detail reconstruction, while low-uncertainty regions (flat areas) receive minimal noise to preserve fidelity

Architecture

The overall QUSR framework, illustrating the dual path: (1) Uncertainty Estimation modifying the latent noise, and (2) Qwen2.5-VL generating quality prompts for the UNet.

Evaluation Highlights

Reduces FID (Fréchet Inception Distance) by 16.74 compared to the second-best method on the DRealSR dataset
Increases MUSIQ (perceptual quality metric) by 0.89 compared to the second-best method on the DRealSR dataset
Achieves State-of-the-Art (SOTA) results across all metrics on the DRealSR dataset

Breakthrough Assessment

8/10

Integrates MLLM diagnostics directly into the restoration loop with spatially adaptive diffusion, effectively addressing the 'blind' nature of real-world super-resolution. Strong quantitative gains on real-world benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Blind Image Super-Resolution (Real-World ISR)

Inputs: Low-Quality (LQ) image x_lq

Outputs: High-Quality (HQ) image x_hq

Pipeline Flow

Uncertainty Estimation: LQ Image → Uncertainty Map → Guided Latent
Quality Prior Extraction: LQ Image → MLLM → Quality Description → CLIP Embeddings
Denoising: Guided Latent + CLIP Embeddings → UNet → HQ Latent

System Modules

Uncertainty Estimation Module (UEM)

Estimate pixel-wise restoration difficulty to control noise injection

Model or implementation: Lightweight Encoder-Decoder (3x3 Convs + ELU)

Quality-Aware Prior (QAP) Generator (Prior Extraction)

Generate text description of image content and degradation attributes

Model or implementation: Qwen2.5-VL-7B-Instruct

Text Encoder (Prior Extraction)

Convert quality description into embeddings for the diffusion model

Model or implementation: CLIP Text Encoder

Denoising Network

Reconstruct high-resolution details from guided latents

Model or implementation: Stable Diffusion 2.1 UNet (with LoRA)

Novel Architectural Elements

Uncertainty-Guided Noise Generation mechanism that injects spatially variant noise into the latent space based on estimated aleatoric uncertainty
Integration of MLLM-generated 'quality descriptions' (diagnosing blur/noise/lighting) as explicit conditioning via Cross-Attention

Modeling

Base Model: Stable Diffusion 2.1

Training Method: Supervised Fine-Tuning with LoRA

Objective Functions:

Purpose: Ensure pixel-level fidelity.

Formally: L2 Loss
Purpose: Enhance visual realism via deep feature similarity.

Formally: LPIPS Loss
Purpose: Align generation with semantic quality prompts using implicit classifier guidance.

Formally: Classifier Score Distillation (CSD) Loss
Purpose: Prioritize fidelity in low-uncertainty regions while relaxing constraints in complex areas.

Formally: L_un = ||x_hq - x_gt||_1 * exp(-U_n) + alpha * U_n (Uncertainty Loss)

Adaptation: LoRA (rank=4)

Training Data:

Training: LSDIR dataset + first 10k FFHQ images
Degradation: RealESRGAN pipeline
Testing: RealSR and DRealSR datasets

Key Hyperparameters:

learning_rate: 3e-5
batch_size: 4
iterations: 15000
+ 2 more
lora_rank: 4
loss_weights: {'lambda_1 (L2)': 0.5, 'lambda_2 (LPIPS)': 2, 'lambda_3 (CSD)': 2, 'lambda_4 (Uncertainty)': 0.3}

Compute: 4x NVIDIA RTX 3090 (24GB)

Comparison to Prior Work

vs. SeeSR/PiSA-SR: These use tags that overlook degradation details; QUSR uses MLLM to explicitly describe degradation (blur, noise)
vs. XPSR: QUSR adds spatially adaptive uncertainty-guided noise, whereas XPSR relies solely on prompting
vs. StableSR/DiffBIR: QUSR uses adaptive noise injection to handle non-uniform degradation, unlike global conditioning in baselines

Limitations

Inference speed is likely slower due to the requirement of running a large MLLM (Qwen2.5-VL) for every input image
Heavily dependent on the accuracy of the MLLM's quality description; incorrect diagnosis may mislead restoration
Requires synthetic degradation pipeline (RealESRGAN) for training data, which may still have a domain gap with some real-world artifacts

Reproducibility

Code: https://github.com/oTvTog/QUSR

Source code is publicly available at https://github.com/oTvTog/QUSR. The model relies on pre-trained Stable Diffusion 2.1 and Qwen2.5-VL-7B-Instruct. Training uses standard RealESRGAN degradation for data synthesis.

📊 Experiments & Results

Evaluation Setup

4x Super-Resolution on real-world datasets

Benchmarks:

RealSR (Real-world Super-Resolution)
DRealSR (Real-world Super-Resolution)

Metrics:

PSNR
SSIM
LPIPS
FID
MUSIQ
CLIPIQA
MANIQA
DISTS
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
State-of-the-art comparison on DRealSR dataset showing significant improvements in both fidelity and perceptual quality.
DRealSR	FID	Not reported in the paper	Not reported in the paper	-16.74
DRealSR	MUSIQ	Not reported in the paper	Not reported in the paper	+0.89
Ablation study demonstrating the necessity of both QAP and UNG modules.
Real-world datasets	MUSIQ	Significantly lower	Higher	Positive

Experiment Figures

Visual comparison of super-resolved images from QUSR vs. SOTA methods on real-world samples.

Main Takeaways

The Uncertainty-Guided Noise (UNG) module is critical for texture reconstruction; removing it causes a comprehensive decline in all metrics.
The Quality-Aware Prior (QAP) is essential for perceptual quality; without it, the model defaults to higher fidelity (PSNR) but lower realism (MUSIQ), failing to align with human perception.
QUSR achieves a superior trade-off between fidelity and photorealism compared to existing methods like SeeSR and DiffBIR, particularly in handling dense, repetitive textures.

📚 Prerequisite Knowledge

Prerequisites

Diffusion Models (specifically Stable Diffusion)
Image Super-Resolution concepts
Multimodal Large Language Models (MLLMs)

Key Terms

MLLM: Multimodal Large Language Model—AI models capable of processing and understanding both text and image inputs

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model weights

FID: Fréchet Inception Distance—a metric for evaluating the quality of generated images by comparing the distribution of their features to real images; lower is better

LPIPS: Learned Perceptual Image Patch Similarity—a metric that measures how similar two images look to humans; lower is better

MUSIQ: Multi-scale Image Quality Transformer—a no-reference metric that predicts the aesthetic quality of an image

Aleatoric uncertainty: Uncertainty arising from inherent noise or randomness in the data itself (e.g., blur or sensor noise)

CFG: Classifier-Free Guidance—a technique in diffusion models to control how strongly the generation follows the conditioning signal (e.g., text prompt)