Causal-based Multimodal Diffusion Models for Medical Image Analysis ‒ CVLAB ‐ EPFL

Background and Problem Statement

The Challenge of Dynamic Causality in AI for Science

In the field of AI for Science, particularly within medical scenarios, data often exhibits inherent causal relationships. Unlike traditional Graph Neural Networks (GNNs) or graph learning approaches that rely on fixed graph structures during training, medical and scientific entities interact through dynamic, evolving relationships. Current methods often struggle to adapt to these changes without retraining. There is a need for a framework that supports unconditional diffusion training while enabling test-time optimization (TTO)—using mechanisms like factor graphs to automatically construct and adjust graph structures during inference.

The Data Scarcity Bottleneck

A critical bottleneck in medical image analysis is the scarcity of complete, paired multimodal data. For instance, obtaining perfectly paired CT, MRI, and Ultrasound scans for the same patient at the same time point is clinically difficult and expensive. Traditional multimodal learning often requires these complete pairs. When data is missing or unpaired (e.g., having only CT-Mask pairs and MRI-Mask pairs, but no CT-MRI pairs), standard supervised approaches fail to leverage the full dataset.

Proposed Methodology

We propose a Causal-based Multimodal Diffusion Framework that decouples training from specific modality pairings and recouples them via causal structures during inference.

Unpaired Training via Modular Diffusion

Building on the concept of DiffAtlas [1] and recent advancements in exploiting distributional correlations, we propose to train independent diffusion models on available partial pairs. Instead of requiring a complete tuple (X_CT, X_MRI, Y_mask), we train separate generative models:

Model A: Learns the distribution P(X_CT | Y_mask) or P(X_CT, Y_mask).
Model B: Learns the distribution P(X_MRI | Y_mask) or P(X_MRI, Y_mask).

This approach allows us to utilize massive amounts of partially labeled or unpaired data for pretraining, significantly scaling up the effective dataset size.

Test-Time Optimization with Factor Graphs

At inference time, we introduce a Test-Time Optimization (TTO) strategy using factor graphs. By treating the anatomical geometry (the mask/shape Y) as a causal anchor (a “bridge” variable), we can dynamically construct a graph that links disparate modalities.

Zero-Shot Modality Bridging: Given a shape Y, the system can simultaneously generate consistent X_CT and X_MRI without ever seeing them paired during training.
Flexible Graph Construction: This method avoids fixed graph dependencies, allowing the model to adapt to available nodes (modalities) at runtime.

Extension to Discrete Diffusion & Multiple Modalities

While the initial framework focuses on continuous diffusion, we aim to investigate Discrete Diffusion mechanisms to better handle categorical data (segmentation masks) and improve convergence. The framework is designed to be extensible to other modalities, such as Ultrasound, creating a unified generative environment for cardiac or abdominal imaging.

Significance and Expected Outcomes

Synthetic Data Generation & Augmentation: The model can generate high-fidelity, paired synthetic data (e.g., CT and MRI) from simple semantic masks, addressing the data scarcity issue in downstream tasks.
Clinical World Models & Digital Twins: By learning the generative causal relationships between anatomy and various imaging modalities, this project contributes to building “Clinical World Models.” This supports simulation, data augmentation, and the creation of patient-specific digital twins.
Cross-Modal Domain Adaptation: The framework naturally bridges the domain gap between different imaging modalities, enabling effective transfer learning even in the absence of paired ground truth.

Contact

Interested students are requested to send an email to:
Hantao Zhang ([email protected])

References

[1] DiffAtlas: GenAI-fying Atlas Segmentation via Image-Mask Diffusion
Hantao Zhang*, Yuhe Liu*, Jiancheng Yang, Weidong Guo, Xinyuan Wang, Pascal Fua
International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2025 (Spotlight)
[2] LeFusion: Controllable Pathology Synthesis via Lesion-Focused Diffusion Models
Hantao Zhang, Yuhe Liu, Jiancheng Yang, Shouhong Wan, Xinyuan Wang, Wei Peng, Pascal Fua
International Conference on Learning Representations (ICLR), 2025 (Spotlight)
[3] Tuning vision foundation models for rectal cancer segmentation from CT scans
Hantao Zhang, Weidong Guo, Shouhong Wan, Bingbing Zou, Wanqin Wang, Chenyang Qiu, Kaige Liu, Peiquan Jin, Jiancheng Yang
Communications Medicine, 2025 (Nature Portfolio)