July, 5th 2021 at 16:30 CEST

Uncategorized

Logistic regression explicitly maximizes margins;should we stop training early?Matus Telgarsky, University of Illinois This talk will present two perspectives on the behavior of gradient descent with the logistic loss: on the one hand, it seems we should run as long as possible, and achieve good margins; on the other, stopping early seems necessary for noisy (…)

June, 14th 2021 at 16:30 CEST

Uncategorized

Stochastic gradient descent for noise with ML-type scaling Stephan Wojtowytsch, Princeton University In the literature on stochastic gradient descent, there are two types of convergence results: (1) SGD finds minimizers of convex objective functions and (2) SGD finds critical points of smooth objective functions. Classical results are obtained under the assumption that the stochastic noise (…)

June, 21st 2021 at 16:30 CEST

Uncategorized

Mode Connectivity and Convergence of Gradient Descent for (Not So) Over-parameterized Deep Neural NetworksMarco Mondelli, IST Austria Training a neural network is a non-convex problem that exhibits spurious and disconnected local minima. Yet, in practice neural networks with millions of parameters are successfully optimized using gradient descent methods. In this talk, I will give some (…)

May, 31st 2021 at 16:30 CEST

Uncategorized

On the Benefit of using Differentiable Learning over Tangent KernelsEran Malach,  Hebrew University A popular line of research in recent years shows that, in some regimes, optimizing neural networks with gradient descent is equivalent to learning with the neural tangent kernel (NTK) –  a kernel induced by the network architecture and initialization. We study the (…)

May, 10th 2021 at 16:30 CEST

Uncategorized

Feature Learning in Infinite-Width Neural NetworksGreg Yang, Microsoft Research As its width tends to infinity, a deep neural network’s behavior under gradient descent can become simplified and predictable (e.g. given by the Neural Tangent Kernel (NTK)), if it is parametrized appropriately (e.g. the NTK parametrization). However, we show that the standard and NTK parametrizations of (…)