Research ‒ Laboratory of Multimodal Intelligent Systems ‐ EPFL

We develop models and systems designed to interpret complex, real-world data. Our goal is to engineer intelligent systems that accurately perceive multimodal documents or environments while remaining transparent about how conclusions are reached.

We advance three STI research clusters, namely AI, Imaging Science, and Robotics, through four interconnected research areas:

Privacy-preserving machine learning
Safety and alignment
Robustness and interpretability
Multi-modal perception

Privacy-preserving machine learning

We develop privacy-enhancing technologies to protect sensitive information during processing and detect violations in visual content. These systems empower individuals, developers, and regulators to maintain human rights and legal compliance without requiring manual oversight.

Recent work explores

automating the recognition of legal concepts in visual data, identifying personal data as defined by GDPR and international privacy frameworks, and
evaluating the proficiency of Visual Language Models in detecting private attributes relative to human observers.

Safety and alignment

We advance AI safety by identifying harmful content and engineering technical guardrails that keep model outputs aligned with human intent and ethical standards.

Recent work centers on

optimizing embedding models to improve the detection of implicit hate speech within heterogeneous text datasets, and
developing a multi-modal model for video-based hate speech detection that fuses visual frames, audio, and text (transcripts and overlays), and cross-modal attention features.

Robustness and interpretability

We investigate the inherent vulnerabilities of learning systems and their susceptibility to malicious attacks. We also design explainable-by-design pipelines that translate specific decisions into human-understandable insights.

Recent work focuses on

identifying critical neurons whose ablation triggers catastrophic collapse in Large Visual Language Models, and
developing a concept-driven counterfactual framework that generates sparse, image-grounded explanations to isolate decision factors.

Multi-modal perception

We enhance the reliability of autonomous perception across multiple data sources, including vision, audio, depth, and haptics. We aim to improve how autonomous agents interpret concepts and intentions within complex, dynamic environments.

Recent work includes developing

an open-vocabulary VLM-based architecture for relative pose estimation of novel, unseen objects, and
a framework for language-conditioned robot manipulation designed for sample efficient learning of object-arrangement tasks via a few demonstrations.

We translate our research into high-stakes applications, prioritizing ethical and human-centric deployment. We facilitate synergy between humans and machines across digital platforms and shared physical environments, from smart factories to assistive living spaces.

Explore our full body of work in our scientific publications.