Foundation models & multimodal remote sensing

SMARTIES learns unified representations of multi-sensor remote sensing images by leveraging spectrum-aware projections, enabling scalability and generalization to diverse sensors with unseen ones in a zero-shot manner. From Sumbul et al., 2025.

Remote sensing data come in a variety of formats, acquired by sensors operating at different spatial scales and through distinct physical principles. This diversity enables us to observe the Earth from multiple perspectives — from color and spectral composition to 3D geometry. Moreover, different perspectives, matched by geographical coordinates, also provide us with different points of view of the same objects. For example, the same building can be observed by a satellite in orbit (showing its roof) or by a ground-based sensor mounted on a car (showing its facade). When co-registered in geographic space, these heterogeneous observations offer complementary views of the same environment.

Leveraging these multiple data sources concurrently allows for a far richer understanding of environmental processes than any single modality could provide. By developing methodologies to align sensor data acquired from different signals and perspectives, we can create richer semantic spaces that enable advanced search and analysis across large spatial datasets.

Today, foundation models are transforming multimodal remote sensing by providing a common representational backbone that bridges across sensor types and data modalities. Pretrained on massive and diverse geospatial data, these models can generalize to new sensors and regions with minimal supervision. In our lab, we explore how these large-scale multimodal models can enhance transferability between heterogeneous data sources and open new pathways for scalable, knowledge-guided Earth observation.

Papers

  • G. Sumbul, C. Xu, E. Dalsasso, and D. Tuia. SMARTIES: Spectrum-Aware Multi-Sensor Auto-Encoder for Remote Sensing Images. International Conference on Computer Vision 2025 (project).
  • L. Mi, M. Bechaz, Z. Chen, A. Bosselut, and D. Tuia. GeoExplorer: Active Geo-localization with Curiosity-Driven Exploration. International Conference on Computer Vision 2025 (project, dataset).
  • L. Mi, S. Montariol, J.C. Navarro, X. Dai, A. Bosselut, and D. Tuia. March. ConVQG: Contrastive visual question generation with multimodal guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, pp. 4207-4215, 2024 (paper, project).
  • L. Hughes, D. Marcos, S. Lobry, D. Tuia, and M. Schmitt. A deep learning framework for sparse matching of SAR and optical imagery. ISPRS J. Int. Soc. Photo. Remote Sens., 169:166–179, 2020 (paper on infoscience).
  • Z. Zhang, G. Vosselman, M. Gerke, C. Persello, D. Tuia, and M. Yang. Detecting and delineating building changes between airborne laser scanning and photogrammetric data. Remote Sens., 11(20):2417, 2019 (paper).
  • S. Srivastava, J. E. Vargas, and D. Tuia. Understanding urban landuse from the above and ground perspectives: a deep learning, multimodal solution. Remote Sens. Environ., 228:129– 143, 2019 (paper preprint on arxiv).