Understanding the performance of neural networks is certainly one of the most thrilling challenges for the current machine learning community.
Implicit regularisation in linear networks
There remain many unanswered questions concerning the brilliant performances of neural networks. One of which is why do the currently used training algorithms converge to solutions which generalise well, and this with very little use of explicit regularisation. To understand this phenomenon, the concept of implicit regularisation has emerged: if over-fitting is benign, it must be because the optimisation procedure converges towards some particular global minimum which enjoys good generalisation properties. To shed light on this phenomenon, we focus here on linear and diagonal linear networks, simple yet sufficiently expressive models that strike a balance between tractability and complexity, making them a good intermediate model for studying the implicit regularisation properties of more complex architectures.
S. Pesme, L. Pillaud-Vivien, N. Flammarion, Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity , NeurIPS 2021
L. Pillaud-Vivien, J. Reygner, N. Flammarion, Label noise (stochastic) gradient descent implicitly solves the lasso for quadratic parametrisation, COLT 2022
M. Even, S. Pesme, S. Gunasekar, N. Flammarion, (S)GD over Diagonal Linear Networks: Implicit Regularisation, Large Stepsizes and Edge of Stability, NeurIPS 2023
S. Pesme, N. Flammarion, Saddle-to-Saddle Dynamics in Diagonal Linear Networks, NeurIPS 2023
A.V. Varre, M.L. Vladarean, L. Pillaud-Vivien, N. Flammarion, On the spectral bias of two-layer linear networks, NeurIPS 2023
S. Pesme, R.A. Dragomir, N. Flammarion, Implicit bias of mirror flow on separable data, NeurIPS 2024
H. Papazov, S. Pesme, N. Flammarion, Leveraging continuous time to understand momentum when training diagonal linear networks, AISTATS 2024
A.V. Varre, M. Sagitova, N. Flammarion, SGD vs GD: Rank Deficiency in Linear Networks, NeurIPS 2024
Implicit regularisation in ReLU networks
Characterizing implicit bias beyond linear NNs requires analyzing the full training trajectory, a significantly harder challenge. We tackled this challenge in the noise-free setting, focusing on gradient flow with small initialization and study the early alignment phase which refers to the initial stage of training when neurons rapidly align along key
directions. Even without explicit regularization, this dynamic can yield simple solutions with strong generalization, revealing an inherent bias toward sparse interpolators.
E. Boursier, L. Pillaud-Vivien, N. Flammarion, Gradient flow dynamics of shallow relu networks for square loss and orthogonal inputs, NeurIPS 2022
E. Boursier, N. Flammarion, Penalising the biases in norm regularisation enforces sparsity, NeurIPS 2023
E. Boursier, N. Flammarion, Early alignment in two-layer networks training is a two-edged sword, JMLR 2024
E. Boursier, N. Flammarion, Simplicity bias and optimization threshold in two-layer ReLU networks, ICML 2025
Implicit bias and sharpness of deep neural networks
To tackle the complexity of general deep neural networks, we combine theoretical analysis with large-scale controlled experiments, guided by insights from simpler models. We challenge the common belief that sharpness explains generalization and show it does not reliably predict performance. Instead, we uncover a rich implicit bias toward sparsity, driven by SGD dynamics and learning-rate schedules. Our work reveals how large step sizes, weight decay, and training schemes like SAM promote simpler, sparse solutions. This synergy between theory and experiment highlights sparsity as the key to modern deep learning success.
M. Andriushchenko, N. Flammarion, Towards understanding sharpness-aware minimization, ICML 2022
M. Andriushchenko, F. Croce, M. Müller, M. Hein, N. Flammarion, A modern look at the relationship between sharpness and generalization, ICML 2023
M. Andriushchenko, D. Bahri, H. Mobahi, N. Flammarion, Sharpness-Aware Minimization Leads to Low-Rank Features, NeurIPS 2023
M. Andriushchenko, A.V. Varre, L. Pillaud-Vivien, N. Flammarion, Sgd with large step sizes learns sparse features, ICML 2023
F. D’Angelo, M. Andriushchenko, A. Varre, N. Flammarion, Why do we need weight decay in modern deep learning? NeurIPS 2024