Internal Seminar : Theory of Neural Nets ‒ TML ‐ EPFL

What it is about

This seminar consists in intern talks about current research on the theory of neural networks. Alternating with the Theory of Neural Nets Seminar (link here: https://www.epfl.ch/labs/csft/theory-of-neural-nets-seminar/), this should be the occasion to promote communication within ML groups interested in Deep Learning Theory.

Every session lasts one hour and comprises a talk (30-40 minutes) followed by a discussion with questions from the audience. People from all groups can either present their own work (finished/unfinished) or introduce papers they have found interesting.

Starting from May 2021, the aim is to have two sessions per month, alternating with the open seminar on Mondays 16:30 CEST. Until the sanitary situation allows physical sessions, the seminar will be held virtually.

If you want to present your work or for any information, please contact Loucas Pillaud-Vivien: loucas.pillaud-vivien[at]epfl.ch.

Upcoming talks

June 28th 2021 at 4.30 pm.

Zoom link: https://epfl.zoom.us/j/62400754289

Speaker: Scott Pesme

Title: Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity

Abstract: Understanding the implicit bias of training algorithms is of crucial importance in order to explain the success of overparametrised neural networks. In this paper, we study the dynamics of stochastic gradient descent over diagonal linear networks through its continuous time version, namely stochastic gradient flow. We explicitly characterise the solution chosen by the stochastic flow and prove that it always enjoys better generalisation properties than that of gradient flow. Quite surprisingly, we show that the convergence speed of the training loss controls the magnitude of the biasing effect: the slower the convergence, the better the bias. To fully complete our analysis, we provide convergence guarantees for the dynamics. We also give experimental results which support our theoretical claims. Our findings highlight the fact that structured noise can induce better generalisation and they help explain the greater performances observed in practice of stochastic gradient descent over gradient descent.

Slides link.

July 12th 2021 at 4.30 pm.

Zoom link: https://epfl.zoom.us/j/67799152011

Speaker: Leonardo Petrini.

Title: Relative stability toward diffeomorphisms indicates performance in deep nets

Abstract: Understanding why deep nets can classify data in large dimensions remains a challenge. It has been proposed that they do so by becoming stable to diffeomorphisms, yet existing empirical measurements support that it is often not the case. We revisit this question by defining a maximum-entropy distribution on diffeomorphisms, that allows to study typical diffeomorphisms of a given norm. We confirm that stability toward diffeomorphisms does not strongly correlate to performance on benchmark data sets of images. By contrast, we find that the stability toward diffeomorphisms relative to that of generic transformations correlates remarkably with the test error. It is of order unity at initialization but decreases by several decades during training for state-of-the-art architectures.

Slides link.

Date to be defined.

Zoom link:

Speaker: Cagnetta Francesco.

Title:Beating the curse of dimensionality via locality in teacher-student scenarios

Abstract: What is the source of the successes of convolutional neural networks on high-dimensional tasks? The prime suspects are locality—input signals are cut into low-dimensional patches before processing—and shift-invariance—obtained by performing the same operation on each patch. In order to test the effect of these two features on generalisation abilities, we introduce a set of teacher-student scenarios for kernel regression with `convolutional’ kernels inspired by the neural tangent kernel of simple convolutional architectures with given filter size. The asymptotic analysis of the power-law decay of the generalisation error with the number of training samples reveals the following: if the filter size of the teacher is smaller than that of the student, the error decay is controlled by the student and independent of the input dimension, implying that locality is key in a model performance. By contrast, enforcing shift-invariance yields only pre-asymptotic improvements on the error decay. During the talk, I will introduce the aforementioned teacher-student scenarios, then elucidate our asymptotic results for both ridgeless and finite-ridge kernel regression.

Slides link.

Past talks

May 17th 2021 at 4.30 pm.

Zoom link: epfl.zoom.us/j/69244976578

Speaker: Eugene Golikov

Title: Tensor Programs

Abstract: We shall discuss a formalism of Tensor programs that allows one to express neural network computation (e.g. forward and backward passes) for a wide class of neural nets. The formalism is equipped with a theorem (the Master theorem) that reasons about the distributions of random variables of the program in the limit of infinite width. Several previous results about infinite width nets, i.e. convergence to a Gaussian process at initialization and convergence of a neural tangent kernel to a deterministic variable can be deduced as simple corollaries of the Master theorem.

Slides link: Tensor Programs.

June 7th 2021 at 4.30 pm.

Zoom link: https://epfl.zoom.us/j/61386325452

Speaker: Ido Nachum

Title: Regularization by Misclassification in ReLU Neural Networks

Abstract: We study the implicit bias of ReLU neural networks trained by a variant of SGD where at each step, the label is changed with probability $p$ to a random label (label smoothing being a close variant of this procedure). Our experiments demonstrate that label noise propels the network to a sparse solution in the following sense: for a typical input, a small fraction of neurons are active, and the firing pattern of the hidden layers is sparser. In fact, for some instances, an appropriate amount of label noise does not only sparsify the network but further reduces the test error. We then turn to the theoretical analysis of such sparsification mechanisms, focusing on the extremal case of $p=1$. We show that in this case, the network withers as anticipated from experiments, but surprisingly, in different ways that depend on the learning rate and the presence of bias, with either weights vanishing or neurons ceasing to fire.

Slides link. Implicit bias with label noise.

Organizers:

Elisabetta Cornacchia, François Ged, Evgenii Golikov, Jan Hazla, Arthur Jacot, Loucas Pillaud-Vivien, Berfin Simsek.