Machine Learning · Remote Sensing · Deep Learning 机器学习 · 遥感 · 深度学习

Generative AI
for Satellite Cloud Removal
生成式AI
用于卫星影像云去除

A benchmarking study comparing GAN, Diffusion, and multi-temporal compositing approaches on 15,000 Sentinel-2 patches across five cloud coverage levels. 在五种云覆盖率等级的15,000张Sentinel-2卫星图像上,对GAN、扩散模型与多时相合成方法进行系统性基准测试研究。

GAN / U-Net Diffusion / DDPM Multi-temporal Fusion Sentinel-2 PyTorch Google Earth Engine
Chuan Zou · Demi Yang · Christine Cui  ·  Penn Engineering  ·  April 2026 邹传 · 杨德米 · 崔馨元  ·  宾夕法尼亚大学工程学院  ·  2026年4月
01 · Overview 01 · 项目概述

The Problem with Clouds 云层的挑战

Cloud-free satellite imagery is essential for land-cover mapping, vegetation monitoring (NDVI), urban expansion analysis, and disaster response. Yet in humid regions — particularly southern China — the majority of Sentinel-2 acquisitions are partially or fully occluded by clouds.

This project asks: can cloud-covered image patches be reconstructed reliably enough to support downstream visual inspection and vegetation-sensitive analysis when a fully cloud-free scene isn't available?

We benchmark three approaches across five cloud coverage levels (5%, 10%, 30%, 50%, 70%), evaluating reconstruction quality via PSNR, SSIM, cloud-region L1 error, and NDVI MAE.

云层无关的卫星影像对土地覆盖制图、植被监测(NDVI)、城市扩张分析和灾害响应至关重要。然而在湿润地区——尤其是中国南方——大多数Sentinel-2影像都受到不同程度的云层遮挡

本研究探讨:当完全无云影像无法获取时,能否可靠地重建被云层遮挡的图像块,以支持下游的目视解译和植被分析?

我们在五种云覆盖率等级(5%、10%、30%、50%、70%)上对三种方法进行基准测试,通过PSNR、SSIM、云区L1误差和NDVI MAE评估重建质量。

Key Finding核心发现
Deep learning methods substantially outperform temporal fusion once cloud coverage exceeds 30%. Below that threshold, multi-temporal compositing is nearly perfect. The conditional GAN consistently outperforms diffusion across all metrics. 当云覆盖率超过30%,深度学习方法大幅优于时序融合方法。低于该阈值时,多时相合成接近完美。条件GAN在所有指标上均优于扩散模型。
Base Patches基础图块
3,000
cloud-free images无云影像
Total Samples总样本量
15,000
across 5 levels五个覆盖率等级
Patch Size图块尺寸
128²
10m resolution10米分辨率
Methods方法数量
3
benchmarked对比测试
02 · Dataset 02 · 数据集

Sentinel-2 over Beijing 北京地区Sentinel-2影像

We pull COPERNICUS/S2_SR_HARMONIZED (Level-2A surface reflectance) from Google Earth Engine, covering Beijing and its urban-rural fringe during the growing season (April–October 2023). This window captures vegetated periods most sensitive to NDVI accuracy.

Each 128×128 pixel patch captures four spectral bands: Red (B4), Green (B3), Blue (B2), and NIR (B8) at 10m native resolution. Synthetic clouds are generated via Gaussian-filtered noise thresholded to exact coverage targets, ensuring ground-truth availability at all occlusion levels.

Dataset split is done at the base-image level before cloud augmentation to prevent train-test leakage (70% train / 10% val / 20% test).

我们通过Google Earth Engine获取COPERNICUS/S2_SR_HARMONIZED(L2A地表反射率)影像,覆盖北京市及其城乡交错带,时间窗口为生长季(2023年4月–10月)。该时段的植被状况对NDVI精度最为敏感。

每个128×128像素图块包含四个光谱波段:红光(B4)、绿光(B3)、蓝光(B2)和近红外(B8),原始分辨率10米。合成云通过高斯滤波噪声阈值化生成,确保所有遮挡等级均有对应的真实参考值。

数据集划分在云层增强前以基础影像为单元进行,有效防止训练-测试数据泄露(70%训练 / 10%验证 / 20%测试)。

Coverage Distribution 云覆盖率分布

5%
3,000 patches图块
10%
3,000 patches图块
30%
3,000 patches图块
50%
3,000 patches图块
70%
3,000 patches图块
Why synthetic clouds? Synthetic masking ensures known ground truth at every coverage level — allowing rigorous controlled comparison that real-world cloudy images cannot provide. 为何使用合成云?合成遮罩确保每种覆盖率等级均有已知真实参考值,从而实现真实云层影像无法提供的严格对照实验。
03 · Methods 03 · 方法

Three Approaches Benchmarked 三种方法对比测试

M-01
Multi-temporal Baseline 多时相基准方法
Baseline 基准

Averages cloud-free pixels across 4 temporal observations (original + 3 neighbors). Falls back to cloudy value when all frames are masked. 在4次时序观测中(原图+3个相邻时相)取无云像素的平均值。当所有时相均被遮挡时,回退使用含云像素值。

Failure probability at 70% coverage: 0.7⁴ = 24% of pixels have no clean alternative. 70%覆盖率下失效概率:0.7⁴ = 24%的像素无法找到无云替代值。

Params参数量 None (non-learned)无(非学习方法)
Training训练 Not required无需训练
M-02
Conditional GAN 条件生成对抗网络
Best Result 最优

U-Net generator (5-channel input: 4-band image + mask) paired with PatchGAN discriminator. Adversarial loss + 100× L1 loss trains a constrained image translator, not a free-form generator. U-Net生成器(5通道输入:4波段影像+掩膜)配合PatchGAN判别器。对抗损失+100×L1损失训练受约束的图像翻译器,而非无条件生成模型。

Generator生成器 468,484 params参数
Discriminator判别器 170,209 params参数
Training训练 8 epochs, lr 5e-5轮次,学习率5e-5
M-03
Diffusion Inpainting 扩散模型修复
Comparator 对照

DDPM-style 100-step denoising with U-Net denoiser (6-channel: conditioned image + mask + timestep). Re-imposes known clean pixels at each reverse step to constrain generation. 采用100步DDPM去噪过程,U-Net去噪器(6通道:条件影像+掩膜+时步编码)。每个反向步骤中重新施加已知无云像素以约束生成过程。

Denoiser去噪器 468,772 params参数
Timesteps时步数 100 (DDPM, β: 1e-4→0.02)
Training训练 50 epochs轮次
04 · Results 04 · 结果

Performance Across Coverage Levels 各覆盖率等级下的性能表现

Note: At 5–10% coverage, multi-temporal compositing is nearly perfect (PSNR ≈ 163 dB, SSIM ≈ 0.98) because the probability of all four temporal frames being simultaneously clouded is negligible (0.05⁴ ≈ 0.006%). The table below focuses on the challenging ≥30% regime. 说明:在5–10%覆盖率下,多时相合成方法接近完美(PSNR≈163 dB,SSIM≈0.98),因为四个时相同时被云遮挡的概率可忽略不计(0.05⁴≈0.006%)。下表重点展示具有挑战性的≥30%覆盖率情形。
Method方法 PSNR (dB) ↑ SSIM ↑ L1 Cloud ↓ NDVI MAE ↓ Avg. Coverage平均覆盖率
Multi-temporal多时相 15.09 0.142 0.1453 0.0877 30–70%
GAN (ours)GAN(本文) 32.76 0.850 0.0243 0.1127 30–70%
Diffusion扩散模型 29.20 0.711 0.0375 0.1850 30–70%

Metrics averaged over 30%, 50%, and 70% coverage levels. ↑ = higher is better · ↓ = lower is better. 指标为30%、50%、70%覆盖率等级的平均值。↑ = 越高越好 · ↓ = 越低越好。

GAN improvement over temporal baseline by coverage level GAN相较于时序基准方法在各覆盖率下的PSNR提升

30%
+10.6 dB PSNR
50%
+19.2 dB PSNR
70%
+23.2 dB PSNR
Why does diffusion underperform on NDVI MAE? Diffusion's NDVI MAE is 1.5–1.7× worse than GAN across all levels, indicating that while RGB reconstruction may appear plausible, the NIR band (B8) — critical for NDVI — is reconstructed less faithfully. This is a key downstream implication for vegetation monitoring applications. 扩散模型为何在NDVI MAE上表现较差?扩散模型的NDVI MAE在各覆盖率等级下均比GAN高1.5–1.7倍,说明尽管RGB重建视觉上尚可,但NDVI计算的关键波段——近红外(B8)——的重建保真度更低。这对植被监测应用具有重要意义。
05 · Insights & Limitations 05 · 洞察与局限性

What We Learned 我们学到了什么

Key Technical Insights 核心技术洞察

01

The 30% coverage threshold is a phase transition. Below it, temporal compositing is near-perfect because simultaneous cloud probability across 4 frames is 0.3⁴ ≈ 0.8%. Above it, the probability of catastrophic failure spikes — and GAN's context-based reconstruction pulls ahead. 30%覆盖率是性能分水岭。低于此值时,时序合成方法接近完美(4帧同时被遮挡的概率仅0.3⁴≈0.8%)。高于此值后,灾难性失败概率急剧上升,基于上下文重建的GAN优势凸显。

02

GAN wins because it is constrained, not because it is generative. The conditioned U-Net translates cloudy-to-clean images as a supervised regression, not free-form generation. This makes it more reliable and spectrally consistent than diffusion under limited training budgets. GAN胜出是因为其受约束性,而非生成性。条件U-Net将含云影像翻译为无云影像,本质是有监督回归而非无条件生成。这使其在有限训练资源下比扩散模型更稳定、光谱一致性更好。

03

PSNR and cloud-region L1 can tell different stories at 30%. The squared error in PSNR amplifies the ~133 pixels per patch with zero temporal options, causing a large PSNR drop. Cloud-region L1 dilutes the same pixels, showing a smaller change. GAN avoids this spike entirely by learning from spatial context. 30%覆盖率下PSNR与云区L1可能呈现不同规律。PSNR的平方误差放大了每个图块约133个无时序可替代像素的影响,导致PSNR大幅下降;云区L1则因稀释效应变化较小。GAN通过学习空间上下文完全规避了这一问题。

04

Diffusion initialization matters. Using the mean of clean-region pixels (rather than the cloudy proxy) as the diffusion starting point was critical. Linear final activation (not sigmoid) was also required to avoid output range collapse in the denoised patches. 扩散模型的初始化至关重要。以无云区域像素均值(而非含云影像像素)作为扩散起始点至关重要。同时,最终激活函数必须是线性的(而非sigmoid),否则去噪图块的输出值域会发生坍塌。

Limitations & Honest Assessment 局限性与客观评估

Synthetic vs. Real Clouds 合成云与真实云的差距

Our synthetic masks use Gaussian-filtered noise, not real atmospheric scattering patterns. Results may not transfer directly to optically thin cloud types or cloud shadows. 合成掩膜采用高斯滤波噪声,而非真实大气散射模式。结果可能不能直接迁移到薄云类型或云阴影情形。

Geographic Generalization 地理泛化能力

All training data comes from Beijing (April–October 2023). The motivating application — cloud removal for humid southern China (e.g., Yunnan) — has not been validated. 所有训练数据来自北京(2023年4–10月)。初始动机——面向中国南方湿润地区(如云南)的云去除应用——尚未经过验证。

Diffusion Budget Constraint 扩散模型资源约束

The diffusion result represents a limited-capacity, limited-training-budget baseline. With 100+ epochs, a larger backbone, and cloud-region-weighted objectives, diffusion may substantially close the gap. 扩散模型的结果仅代表有限容量和训练资源下的基准水平。若扩展至100+轮训练、更大网络和云区加权损失,性能差距可能显著缩小。

No Downstream Task Validation 缺乏下游任务验证

Evaluation is image-centric. The benefit of cloud removal for actual land-cover classification or vegetation monitoring workflows has not been directly measured. 评估以图像质量为中心。云去除对实际土地覆盖分类或植被监测工作流程的实际效益尚未直接测量。

06 · Future Work 06 · 未来方向

Next Steps 下一步研究方向

Multi-regional Validation 多地区验证

Expand to Yunnan province and other cloud-prone regions in southern China to test geographic transferability of the GAN model. 扩展至云南省及中国南方其他云层密集地区,测试GAN模型的地理迁移能力。

Enhanced Diffusion Training 增强扩散模型训练

Provide more data, extended training budget (100+ epochs), larger backbone, and cloud-region-weighted objectives to reveal diffusion's true potential. 扩充数据量,增加训练轮次(100+),使用更大网络和云区加权损失,充分挖掘扩散模型的性能上限。

Downstream Task Integration 下游任务集成

Validate the effect of cloud removal directly on land-cover classification accuracy and vegetation index (NDVI/EVI) computation in real-world monitoring pipelines. 在真实监测流程中直接验证云去除对土地覆盖分类精度和植被指数(NDVI/EVI)计算的影响。

Real Atmospheric Clouds 真实大气云层测试

Incorporate optically-derived cloud masks from Sentinel-2 QA bands and evaluate whether synthetic-trained models generalize to real cloud morphology. 引入Sentinel-2 QA波段的光学云掩膜,评估合成训练模型对真实云形态的泛化能力。

GitHub Repository ↗ Back to Portfolio返回作品集