We develop models and systems designed to interpret complex, real-world data. Our goal is to engineer intelligent systems that accurately perceive multimodal documents or environments while remaining transparent about how conclusions are reached.
We advance three STI research clusters, namely AI, Imaging Science, and Robotics, through four interconnected research areas:
- Privacy-preserving machine learning
- Safety and alignment
- Robustness and interpretability
- Multi-modal perception
Privacy-preserving machine learning
We develop privacy-enhancing technologies to protect sensitive information during processing and detect violations in visual content. These systems empower individuals, developers, and regulators to maintain human rights and legal compliance without requiring manual oversight.
Recent work explores
- automating the recognition of legal concepts in visual data, identifying personal data as defined by GDPR and international privacy frameworks, and
- evaluating the proficiency of Visual Language Models in detecting private attributes relative to human observers.
Safety and alignment
We advance AI safety by identifying harmful content and engineering technical guardrails that keep model outputs aligned with human intent and ethical standards.
Recent work centers on
- optimizing embedding models to improve the detection of implicit hate speech within heterogeneous text datasets, and
- developing a multi-modal model for video-based hate speech detection that fuses visual frames, audio, and text (transcripts and overlays), and cross-modal attention features.
Robustness and interpretability
We investigate the inherent vulnerabilities of learning systems and their susceptibility to malicious attacks. We also design explainable-by-design pipelines that translate specific decisions into human-understandable insights.
Recent work focuses on
- identifying critical neurons whose ablation triggers catastrophic collapse in Large Visual Language Models, and
- developing a concept-driven counterfactual framework that generates sparse, image-grounded explanations to isolate decision factors.
Multi-modal perception
We enhance the reliability of autonomous perception across multiple data sources, including vision, audio, depth, and haptics. We aim to improve how autonomous agents interpret concepts and intentions within complex, dynamic environments.
Recent work includes developing
- an open-vocabulary VLM-based architecture for relative pose estimation of novel, unseen objects, and
- a framework for language-conditioned robot manipulation designed for sample efficient learning of object-arrangement tasks via a few demonstrations.
We translate our research into high-stakes applications, prioritizing ethical and human-centric deployment. We facilitate synergy between humans and machines across digital platforms and shared physical environments, from smart factories to assistive living spaces.
Explore our full body of work in our scientific publications.