Adversarial learning corresponds to trying to build models which are robust to malicious adversaries.
Jailbreaking of LLMs
We focus on advancing the understanding of large language models’ susceptibility to jailbreak attacks, strategies that manipulate prompts to bypass safety mechanisms and elicit prohibited outputs. Our research encompasses the development of benchmarks, the analysis of adaptive attack methodologies, and the evaluation of generalization gaps in existing defense mechanisms. To facilitate consistent evaluation, we introduced JailbreakBench, an open-source benchmark comprising a curated set of adversarial prompts, a dataset of behaviors aligned with usage policies of major LLM providers, and a standardized evaluation framework. Our investigations reveal that even state-of-the-art, safety-aligned LLMs remain vulnerable to simple adaptive attacks. By employing techniques such as random search on prompt suffixes to maximize the likelihood of specific tokens (e.g., “Sure”), we achieved a 100% attack success rate across various models, including GPT-4o, Claude 3.5 Sonnet, and Llama-3-Instruct-8B. Additionally, we identified a notable generalization gap in refusal training approaches: simply rephrasing harmful requests into the past tense (e.g., “How did people make a Molotov cocktail?”) significantly increases the success rate of jailbreak attempts, indicating that current alignment techniques may not generalize effectively across different linguistic formulations. Our ongoing work focuses on similar problems applied specifically to LLM agents, such as computer use agents.
P. Chao, E. Debenedetti, A. Robey, M. Andriushchenko, F. Croce, V. Sehwag, E. Dobriban, N. Flammarion, G. J. Pappas, F. Tramer, H. Hassani, E. Wong, JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models, NeurIPS 2024 Datasets and Benchmarks Track
M. Andriushchenko, F. Croce, N. Flammarion, Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks, ICLR 2025
M. Andriushchenko, N. Flammarion, Does Refusal Training in LLMs Generalize to the Past Tense? ICLR 2025
Adversarial robustness
We are interested in developing a better understanding of robustness of machine learning models to small, worst-case changes in the inputs known as adversarial examples. For example, we have focused on studying intriguing phenomena in this area such as catastrophic and robust overfitting. Moreover, we are also interested in improving robustness evaluation standards and understanding the effect of adversarial robustness on other tasks (e.g., robustness to common image corruptions). Recently, we also got interested in understanding the role of robustness in the parameter space and its effect on generalization.
M. Andriushchenko, N. Flammarion, Understanding and Improving Fast Adversarial Training, Neurips 2020
M. Andriushchenko, F. Croce, N. Flammarion, M. Hein, Square Attack: a query-efficient black-box adversarial attack via random search, ECCV 2020
F. Croce, M. Andriushchenko, V. Sehwag, E. Debenedetti, N. Flammarion, M. Chiang, P. Mittal, M. Hein, RobustBench: a standardized adversarial robustness benchmark, Neurips Dataset and Benchmark Track 2021
F, Croce, M. Andriushchenko, N.D. Singh, N. Flammarion, M. Hein, Sparse-rs: a versatile framework for query-efficient sparse black-box adversarial attacks, AAAI 2022
K. Kireev, M. Andriushchenko, N. Flammarion, On the effectiveness of adversarial training against common corruptions, UAI 2022
K. Kireev, M. Andriushchenko, C. Troncoso, N. Flammarion, Transferable Adversarial Robustness for Categorical Data via Universal Robust Embeddings, NeurIPS 2023
Robust learning
Robust learning is a critical field that seeks to develop efficient algorithms that can recover an underlying model despite possibly malicious corruptions in the data. In recent decades, being able to deal with corrupted measurements has become of crucial importance. The applications are considerable, to name a few settings: computer vision, economics, astronomy, biology and above all, safety-critical systems.
S. Pesme, N. Flammarion, Online Robust Regression via SGD on the l1 loss, Neurips 2020
Y. Cherapanamjeri, N. Flammarion, P. L. Bartlett, Fast Mean Estimation with Sub-Gaussian Rates, COLT 2018