Theoretical foundation of LLMs ‒ TML ‐ EPFL

We focus on uncovering the mathematical structures that underlie the advanced learning capabilities of large language models. Our goal is to provide theoretical guarantees for methods used by practioners and to enhance the capabilities and effectiveness of LLMs.

Statistical analysis of LLMs

Language data often display long-range temporal dependencies that pose challenges for standard empirical risk minimization techniques. In our work, we study higher-order linear autoregressive models with extended context windows, establishing the first statistical complexity bounds via novel martingale-based concentration methods. Our analysis shows that long contexts do not hinder learning efficiency; in fact, structural properties, like shared low-rank representations or misspecified context lengths, can lead to improved sample efficiency. In the discrete setting, relevant to language modeling, our estimator surpasses classical k-gram models and achieves optimal predictive performance. These findings provide foundational insight into the generalization of large language models, illustrating that temporal dependencies can be effectively leveraged with appropriate theoretical tools.

O.K. Yüksel, M. Even, N. Flammarion, Long-Context Linear System Identification, ICLR 2025

O.K. Yüksel, N. Flammarion, On the Sample Complexity of Next-Token Prediction, AISTATS 2025

In-context learning

Transformers exhibit a powerful capability known as in-context learning (ICL), the ability to generalize from examples provided directly in the input, without any additional training. This allows large language models to perform tasks like translation or classification from just a few demonstrations.
Our work explores how transformers learn algorithms that enable ICL, combining mechanistic interpretability with optimization analysis. We studied attention circuits called induction heads, which replicate earlier tokens in a sequence. To move beyond simple fixed patterns, we introduced a synthetic setup with interleaved Markov chains and discovered Selective Induction Heads—mechanisms that dynamically identify relevant causal structures. We also analyzed how ICL emerges during training by studying the loss landscape of transformers. Our results show that structured predictors, like k-gram models, arise naturally as stationary points, helping to explain phenomena such as stage-wise learning and emergent behavior during training.

F. D’Angelo, F. Croce, N. Flammarion, Selective induction Heads: How Transformers Select Causal Structures in Context, ICLR 2025

G. Yüce, A. Varre, N. Flammarion, Learning In-context n-grams with Transformers: Sub-n-grams Are Near Stationary Points, ICML 2025

Generalization in fine-tuned LLMs

After large language models are pretrained, they must be finetuned to follow human instructions and behave as expected. Our research investigates how different types of data influence this alignment process and how to use them more effectively. We show that even small amounts of high-quality, information-rich data can dramatically improve alignment. For example, fine-tuning on longer, more detailed responses outperforms more complex data selection methods, leading to better instruction-following behavior across various models. We also compared supervised fine-tuning to in-context learning, revealing when each method works best. While ICL can be effective, especially with carefully chosen examples, SFT still provides more reliable performance overall. To explain why good alignment doesn’t always require massive datasets, we developed a new theoretical framework for preference-based fine-tuning (like RLHF). We found that using pairwise comparisons—asking a model to choose between two responses—instead of traditional data can significantly improve learning efficiency.
In summary, our work shows that alignment is not just about more data, but about using the right data to learn more capable models.

H. Zhao, M. Andriushchenko, F. Croce, N. Flammarion, Long is more for alignment: A simple but tough-to-beat baseline for instruction fine-tuning, ICML 2024

H. Zhao, M. Andriushchenko, F. Croce, N. Flammarion, Is in-context learning sufficient for instruction following in LLMs? ICLR 2025

M. Jourdan, G. Yüce,, N. Flammarion, Learning Parametric Distributions from Samples and Preferences, ICML 2025