Technical University of Munich and others developed a satellite image generation method based on SD3 and constructed the largest-scale remote sensing dataset to date.
A team from the Technical University of Munich in Germany and the University of Zurich in Switzerland proposed a new method to generate satellite images using Stable Diffusion 3 (SD3) conditioned on geographical and climatic cues. They also created EcoMapper, the largest and most comprehensive remote sensing dataset to date.
Satellite images are Earth's surface images obtained through satellite remote sensing technology. By establishing a "space perspective," they digitize Earth's information, enabling large - scale detection, dynamic tracking, and data support. In people's daily lives, whether it's macroscopic environmental governance or microscopic urban life, satellite images are indispensable. For example, in forestry monitoring, satellite images can quickly delineate forest distribution areas, calculate the coverage ratio of different forest types, and detect changes in forest cover caused by logging, planting, pests, and diseases.
However, satellite monitoring is susceptible to multiple factors, which to some extent undermine its performance and application effectiveness. The interference of cloud cover is particularly significant. In areas with frequent cloud cover, satellite monitoring may be interrupted for days or even weeks. This not only hinders real - time dynamic satellite monitoring but also raises a new requirement to combine satellite images with climate data to improve prediction accuracy. The rapid development of artificial intelligence technology and machine learning algorithms provides an opportunity to meet this requirement. However, most current methods are designed for specific tasks or regions and lack the universality for global application.
To address the above issues, the team from the Technical University of Munich in Germany and the University of Zurich in Switzerland proposed a new method to generate satellite images using Stable Diffusion 3 (SD3) conditioned on geographical and climatic cues. They also created EcoMapper, the largest and most comprehensive remote sensing dataset to date. This dataset collected over 2.9 million RGB satellite image data from 104,424 locations worldwide from Sentinel - 2, covering 15 land cover types and corresponding climate records. It lays the foundation for two satellite image generation methods using the fine - tuned SD3 model. By combining synthetic image generation with climate and land cover data, the proposed method promotes the development of generative modeling technology in the field of remote sensing, fills the observation gaps in areas affected by continuous cloud cover, and provides new tools for global climate adaptation and geospatial analysis.
The research results, titled "EcoMapper: Generative Modeling for Climate - Aware Satellite Imagery," were selected for ICML 2025.
Research highlights: * Constructed EcoMapper, the largest and most comprehensive remote sensing dataset to date, containing over 2.9 million satellite images
* Developed a text - to - image generative model based on the fine - tuned Stable Diffusion 3, which uses text prompts containing climate and land cover details to generate realistic synthetic images of specific regions
* Developed a multi - conditional (text + image) model framework using ControlNet to achieve climate data mapping or generate time series and simulate the landscape evolution process
Paper link:
https://go.hyper.ai/VFRWu
Dataset download link:
https://go.hyper.ai/uhOIw
More cutting - edge AI papers:
https://go.hyper.ai/owxf6
Dataset: The largest and most comprehensive remote sensing dataset to date
EcoMapper is the largest and most comprehensive remote sensing dataset to date. It consists of 2,904,000 satellite images with climate metadata, sampled from 104,424 geographical locations worldwide and covers 15 different land cover types. As shown in the following figure:
The amount of annual observation data and the total number of images in each batch (Note: Some locations are missing due to the fitting of land cover distribution)
Among them, the training set contains 98,930 geographical locations, and the observation period for each location is 24 months. The researchers selected one observation per month for each location over two years based on the days with the least cloud cover. Finally, a sequence of 24 images was obtained for each location. The two - year observation period was randomly distributed between 2017 and 2022.
The test set contains 5,494 geographical locations. The observation period for each location is 96 months (8 years), spanning from 2017 to 2024, with monthly monitoring as well.
Spatially, the spatial coverage area of each observation is approximately 26.21 square kilometers. The entire dataset covers approximately 2,704,000 square kilometers, accounting for about 2.05% of the total land area of the Earth. These data ensure sufficient spatial and temporal independence in the evaluation and enable a robust evaluation of the model's generalization in different regions and unseen climate conditions.
In addition, each sampling location is enriched with metadata, including geographical location (latitude and longitude), observation date (year and month), land cover type, cloud cover rate, as well as monthly average temperature, solar radiation, and total precipitation from NASA Power. These data show benefits for agriculture, forestry, land cover, and biodiversity.
Model architecture: Text - to - image generative model and multi - conditional generative model
The goal of this research is to synthesize satellite images conditioned on geographical and climate metadata to achieve realistic predictions of environmental conditions. To this end, the researchers must address two key tasks: text - to - image generation and multi - conditional image generation.
The researchers evaluated the ability of two generative models to integrate climate metadata into satellite image synthesis:
The first is Stable Diffusion 3, a multimodal latent diffusion model that integrates CLIP and T5 text encoders, enabling flexible prompt conditioning. The researchers fine - tuned Stable Diffusion 3 using the collected dataset so that it can generate realistic satellite images based on geographical, climate, and time metadata.
The second is DiffusionSat, a foundation model specifically designed for satellite images, which is an extension of Stable Diffusion 2 with a dedicated metadata embedding layer for numerical conditioning. Compared with general diffusion models, this model is designed for remote sensing tasks, can encode key spatial and temporal attributes, and has functions such as super - resolution, image inpainting, and time prediction.
For the text - to - image generation task, the researchers conducted comparative tests on Stable Diffusion 3 and DiffusionSat with various configurations, including fine - tuned and non - fine - tuned models, and experiments at different resolutions:
* Baseline model: Evaluate the two models without fine - tuning at a resolution of 512 x 512.
* Fine - tuned model (-FT): Evaluate the two models after fine - tuning with climate metadata at a resolution of 512 x 512.
* High - resolution SD3 model: Fine - tune and test SD3 with climate metadata at a resolution of 1024 x 1024, labeled as SD3 - FT - HR.
For the multi - conditional image generation task, the researchers selected the fine - tuned Stable Diffusion 3 model enhanced by LoRA (Low - Rank Adaptation) technology to perform the multi - conditional image generation task. This model was trained at a resolution of 512 x 512 and serves as the basis for generating high - quality and context - relevant images. The research used ControlNet technology to build a dual - conditional mechanism: * ControlNet enhances the diffusion model by integrating explicit spatial control into the generation process. This design ensures that the initial influence of the control block on the main block is minimal, and its function is similar to a skip connection.
* Satellite images as control signals: Use satellite images from the previous few months as control signals to maintain the spatial structure of the generated images and ensure that landforms, urban layouts, and other geographical features remain unchanged. In this way, the model can incorporate changes over time to reflect real - world environmental changes.
* Climate prompts: Use the text - conditioning mechanism to specify the climate and atmospheric conditions for generating satellite images.
By combining these two regulatory factors, the model can generate realistic satellite images that incorporate climate change while maintaining spatial consistency. This method also supports time - series generation and can simulate landscape evolution under changing climate conditions. As shown in the following figure:
The framework integrating Stable Diffusion 3 and ControlNet enables multi - conditional satellite image generation
In terms of the prompt structure, to effectively generate satellite images, the researchers designed two types of prompts to guide satellite image generation, namely Spatial Prompt and Climate Prompt. The former is used to encode basic metadata, including land cover type, location, date, and cloud cover, to ensure that the generated images are consistent with the geographical and temporal context; the latter incorporates monthly climate variables (temperature, precipitation, and solar radiation) on the basis of the spatial prompt to provide more comprehensive environmental condition information for image generation. Both types of prompts use the text encoder of Stable Diffusion 3, with spatial information processed by CLIP and climate data processed by the T5 encoder.
Experimental results: Generative performance surpasses the baseline model, but there is still room for improvement
The researchers designed a multi - dimensional experimental system and verified the performance of the designed generative model in generating climate - aware satellite images through multiple horizontal and vertical comparisons and experiments.
First, the researchers defined 5 established indicators, including FID (Fréchet Inception Distance), LPIPS (Learned Perceptual Image Patch Similarity), SSIM (Structural Similarity Index), PSNR (Peak Signal - to - Noise Ratio), and CLIP Score. Among them, FID and LPIPS evaluate the similarity of image distributions and perceptual differences, SSIM and PSNR measure structural consistency and reconstruction quality, and CLIP Score evaluates text - image alignment.
In terms of text - to - image generation, the researchers verified the effectiveness of the designed model by comparing the performance of Stable Diffusion 3, DiffusionSat, and their fine - tuned versions (SD3 - FT and DiffusionSat - FT) and SD3 - FT - HR at 5500 geographical locations.
As shown in the following figure. The baseline models of SD3 and DiffusionSat have the lowest evaluation scores, but the latter performs significantly better than the former, indicating the advantage of remote sensing pre - training; the indicators of all fine - tuned models are significantly improved. SD3 - FT performs better in CLIP, SSIM, and PSNR, while DiffusionSat - FT is more outstanding in FID and LPIPS. SD3 - FT - HR has the lowest FID (a lower FID value indicates higher authenticity), which is 49.48, indicating that the generated images have finer details.
Quantitative comparison of text - to - image generative models
The qualitative result analysis shows that the designed model can capture the regular textures of farmland and grassland and the characteristics of mountainous terrain. In particular, SD3 - FT - HR performs better in vegetation density changes and high - resolution details.
In the climate sensitivity analysis, as shown in the following figure, the vegetation density generated by the model is significantly correlated with climate change. The researchers conducted a quantitative stress test on the SD3 - FT model for samples showing extreme weather conditions. The results show that under high - temperature and high - radiation conditions, the FID of the images generated by the model is lower (e.g., the FID under high radiation is 107.34), and the vegetation is more obvious; the opposite is true under low - temperature and low - radiation conditions, and the simulation effect is slightly worse.
Satellite images generated by SD3 - FT for different regions under extreme climate conditions
Performance of SD3 - FT under extreme weather conditions
In the multi - conditional image generation task, the multi - conditional generation combined with ControlNet outperforms the text - to - image model in all indicators. For example, the FID of SD3 ControlNet is 48.20. In addition, the generated images and real - world images show strong spatial alignment, maintaining key geographical features while incorporating specific climate changes. As shown in the following figure:
Indicators of the SD3 ControlNet model
Comparison of real - world images, generated images, and conditional images in multi - conditional image generation under different seasonal changes
In the robustness test, the land cover type has a high impact on the stability of model generation. Common types such as grasslands and savannas have high generation stability and low FID; complex or rare types such as wetlands and cities have higher FID. For example, the FID of cities is 284.65, which is due to insufficient training data. In addition, the model performs stably on the test set from 2017 to 2024, and there is no performance degradation on the dataset from 2023 to 2024, which proves that the designed model still has high adaptability to unseen spatio - temporal scenarios.
In summary, EcoMapper introduces a generative framework for simulating satellite images based on climate variables, aiming to model how environmental landscapes respond to weather