#### EPFL School of Computer and Communication Sciences

## Summer Research Institute 2022

#### Learning: Optimization and Stochastics

June 29th to July 1st 2022

## Overview

The goal of the workshop is to bring together experts in various areas of mathematics and computer science related to the theory of machine learning, and to learn about recent and exciting developments in a relaxed atmosphere. The workshop will take place on EPFL campus, with social activities in the Lake Geneva area.

The event is **open to everyone** and attendance is **free of charge**.

In case you plan to attend SuRI 2022, please register to facilitate the event's organization.
The registration deadline is June 15th 2022.

## Workshop preview (from 2020)

The workshop was postponed from its original date in Summer 2020 due to the COVID-19 situation, but is now scheduled for Summer 2022. A short Zoom preview of the workshop was held in 2020, featuring several open problem/research presentations:

- Lower Bounds for Sampling (Peter Bartlett - slides)

- Black-box complexity of optimization in low dimensions (Sebastien Bubeck - slides)

- Learning neural networks (Adam Klivans)

- Cutting Convex Sets with Margin (Shay Moran - slides)

- From Nesterov's Estimate Sequence to Riemannian Acceleration (Survit Sra - slides)

## Program (Overview)

**30.06.22**

**01.07.22**

## Program (Detailed)

**29.06.2022**

The classic view of statistical learning tells us that we should balance model fit with model complexity instead of insisting on training error that’s much lower than what we can expect to generalize to, or even lower than the noise level or Bayes error. And that this balance, and control on model complexity ensures good generalization. But in recent years we have seen that in many situations we can learn and generalize without such a balance, and despite (over?) fitting the training set well below the noise level. This has caused us to rethink the basic principles underlying statistical learning theory. In this talk I will discuss how much of our theory we can salvage and how much of it needs to be revised, focusing on the role of uniform convergence in understanding interpolation learning.

Based on joint work with Lijia Zhou, Fred Koehler, and Danica Sutherland

Standard statistical data analysis, developed by Fisher in the early 20th century, assumes that the observed data is an independent sample from the chosen statistical model. The resulting estimation algorithms often break down if the data exhibits minor (e.g., a vanishing fraction of outliers) departures from the chosen model. In the 1960s, Huber and Tukey founded the study of robust statistics to address this deficiency. While phenomenally successful in low dimensions, the estimators from classical robust statistics almost always incur an exponential computational cost in the underlying dimension.

Can we formulate general principles in the design of efficient robust estimators in high dimensions? What properties of the statistical model govern the existence of algorithms for robustly fitting expressive statistical models?

In the last few years, an algorithmic theory of high-dimensional robust statistics has emerged. This theory reveals a natural connection to computational versions of basic questions in high-dimensional probability: How can we use samples to efficiently confirm that low-degree polynomials of a random variable are sufficiently concentrated? How can we efficiently certify that every one-dimensional marginal of a random variable is sufficiently anti-concentrated?

In this talk, I will give a short overview of robust statistics based on the sum-of-squares method that reveals how such problems are intimately related to the above basic questions. As a case study, I will use the recent resolution of the problem of robustly learning a mixture of arbitrary high-dimensional Gaussians as a culmination of a sequence of works that developed new algorithmic certificates for basic probabilistic phenomena.

**30.06.22**

Recent years have witnessed tremendous progress in the field of Machine Learning (ML). However, many of the recent breakthroughs demonstrate phenomena that lack explanations, and sometimes even contradict conventional wisdom. One main reason for this is because classical ML theory adopts a worst-case perspective which seems too pessimistic to explain practical ML: in reality data is rarely worst-case, and experiments indicate that often much less data is needed than predicted by traditional theory.

In this talk we will discuss two variations of classical learning theory. These models are based on a distribution- and data-dependent perspective which complements the distribution-free worst-case perspective of classical theory, and is suitable for exploiting specific properties of a given learning task.

A common theme of these models is their combinatorial nature. This can be seen as a continuation of the fruitful link between machine learning and combinatorics, which goes back to the discovery of the VC dimension more than 50 years ago.

Traditionally, optimization methods like (stochastic) gradient descent take time at least linear to the number of parameters. For problems with many parameters, each step is too expensive.

In this talk, I will discuss how designing robust algorithms can lead to faster algorithms. In particular, I will explain the robust interior point method. Its robustness allows us to speed up its iteration cost via data structures. This results in faster algorithms for many important problems such as linear programming, semidefinite programming, and the maximum flow problem.

Abstract: SGD is a workhorse for optimization and thus for statistics and machine learning, and it is well understood in low dimensions. But understanding its behavior in very high dimensions is not yet a simple task, and is a work in progress. We describe here interesting and new regimes for the limiting dynamics of “summary statistics” for SGD in high dimensions. These regimes may differ from the expected one given by the usual wisdom in finite dimensions, i.e. the population gradient flow. We find that a new corrector term may be needed and that the phase portrait of these dynamics is quite complex and substantially different from what would be predicted using the classical low-dimensional approach, including for simple tasks, like Tensor PCA, or simple XOR classification. A related picture emerged rin the recent work done at EPFL on two-layers networks by Veiga, Stephan, Loureiro, Krzakala and Zdeborova.

Joint work with Reza Gheissari (UC Berkeley) and Aukosh Jagannath (Waterloo)

The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution. Yet, despite some recent progress, a complete theory explaining its success is still missing. This article presents, for orthogonal input vectors, a precise description of the gradient flow dynamics of training one-hidden layer ReLU neural networks for the mean squared error at small initialisation. In this setting, despite non-convexity, we show that the gradient flow converges to zero loss and characterise its implicit bias towards minimum variation norm. Furthermore, some interesting phenomena are highlighted: a quantitative description of the initial alignment phenomenon and a proof that the process follows a specific saddle to saddle dynamics.

Joint work with Loucas Pillaud-Vivien and Nicolas Flammarion.

We consider the Boolean version of the Pointer Value Retrieval (PVR) benchmark, where a ‘reasoning’ function acts on a string of digits to produce the output. More generally, we study the learning of logical functions with gradient descent (GD) on neural networks. We show that in the distribution shift setting, when the data withholding corresponds to freezing a single feature (referred to as canonical holdout), the generalization error of GD admits a tight characterization in terms of the Boolean influence for several relevant architectures. This is shown on linear models and supported experimentally on other models such as MLPs and Transformers. In particular, this puts forward the hypothesis that for such architectures and for learning logical functions such as PVR functions, GD tends to have an implicit bias towards low-degree representations, which in turn gives the Boolean influence for the generalization error under quadratic loss.

Joint work with Emmanuel Abbé (EPFL), Samy Bengio (Apple), Elisabetta Cornacchia (EPFL), Jon Kleinberg (Cornell University), Maithra Raghu (Google Research), and Chiyuan Zhang (Google Research).

We consider the problem of learning sparse functions (a function that depends on a latent low-dimensional subspace) with two-layer neural networks (NN). This setting is of interest since neural networks routinely tackle high-dimensional datasets and it is still poorly understood when and how SGD-trained NNs can adapt to latent low-dimensional structure without suffering from the curse of dimensionality.

We consider training the 2-layer NN with online SGD in the mean-field scaling. We show that the SGD trajectory can be well approximated for constant time by a dimension independent dynamics, which we call DF-PDE. In particular, if the DF-PDE converges in constant time to zero test error, so does online SGD. We apply this characterization to sparse functions on the hypercube and show that a hierarchical property —the merged-staircase property— is necessary and nearly sufficient for learning in this regime.

This is a joint work with Emmanuel Abbe (EPFL) and Enric Boix-Adsera (MIT).

**01.07.22**

Stochastic gradient descent (SGD) is the workhorse of modern machine learning. While SGD has been thoroughly analyzed for independent data and tight finite time guarantees are known, its finite sample performance with dependent data has not been as thoroughly analyzed. In this talk, we will consider SGD-style algorithms for two problems where the data is not independent but rather comes from a Markov chain: learning dynamical systems and Q-learning for reinforcement learning. While vanilla SGD is biased and does not converge to the correct solution for these problems, we show that SGD along with a technique known as “reverse experience replay” can efficiently find the optimal solutions.

Based on joint works with Naman Agarwal, Syomantak Chaudhuri, Prateek Jain, Suhas Kowshik and Dheeraj Nagaraj.

In this talk, I will present two results that highlight the implicit benefits of making non-positive progress on the loss objective while training. In the first part, I will focus on the classical problem of minimizing convex quadratics using gradient descent and describe how a “fractal” learning rate schedule, which has locally unstable updates that make negative progress on the objective, allows for accelerated convergence to the optimum. In the second part, I will focus on another canonical family of problems, sparse parities problem, and discuss how simple architectures with no sparsity prior learn this class of functions using (stochastic) gradient descent in the number of iterations matching the lower bound (in the SQ model). Furthermore, the loss curves exhibit a phase transition where there is no apparent progress for a while and then exponentially fast learning. For both parts, I will present both empirical and theoretical results quantifying these phenomena.

The first part is based on joint work with Naman Agarwal and Cyril Zhang. The second part is based on joint work with Boaz Barak, Ben L. Edelman, Sham Kakade, Eran Malach, and Cyril Zhang.

## Speakers

### Gerard Ben Arous

(New York University)

### Sebastien Bubeck

(Microsoft Research Redmond)

### John Duchi

(Stanford University)

### Surbhi Goel

(Microsoft Research)

### Moritz Hardt

(MPI)

### Piotr Indyk

(MIT)

### Adam Klivans

(UT Austin)

### Pravesh Kothari

(CMU)

### James Lee

(University of Washington)

### Yin Tat Lee

(University of Washington)

### Shay Moran

(Technion)

### Praneeth Netrapalli

(Google Research India)

### Vianney Perchet

(CREST, ENSAE)

### Nati Srebro

(TTIC)

### Ashia Wilson

(MIT)

### Etienne Boursier

(EPFL)

### Aryo Lotfi

(EPFL)

### Theodor Misiakiewicz

(Stanford)

### Hugo Cui

(EPFL)

## Directions

If you're flying in, Genève-Cointrin is the nearest international airport. By train, it takes ~45 minutes to reach Lausanne. Zürich Airport is about ~2.5 hours away by train. To get information on train schedules, please see the Swiss Rail website.

To reach the SuRI venue from the EPFL M1 metro station, please check the map below.

## Archive

EPFL's Summer Research Institute has a long and successful history. For more information on past editions of the event refer to the links below.