Summer Research Institute

EPFL School of Computer and Communication Sciences

Summer Research Institute 2022

Learning: Optimization and Stochastics

June 29th to July 1st 2022

Overview

The goal of the workshop is to bring together experts in various areas of mathematics and computer science related to the theory of machine learning, and to learn about recent and exciting developments in a relaxed atmosphere. The workshop will take place on EPFL campus, with social activities in the Lake Geneva area.

The event is open to everyone and attendance is free of charge.
In case you plan to attend SuRI 2022, please register to facilitate the event's organization. The registration deadline is June 15th 2022.

Workshop preview (from 2020)

The workshop was postponed from its original date in Summer 2020 due to the COVID-19 situation, but is now scheduled for Summer 2022. A short Zoom preview of the workshop was held in 2020, featuring several open problem/research presentations:

Lower Bounds for Sampling (Peter Bartlett - slides)

Black-box complexity of optimization in low dimensions (Sebastien Bubeck - slides)

Learning neural networks (Adam Klivans)

Cutting Convex Sets with Margin (Shay Moran - slides)

From Nesterov's Estimate Sequence to Riemannian Acceleration (Survit Sra - slides)

Program (Overview)

29.06.2022

09:30 – 10:00 Learning by Overfitting: A Statistical Learning view of Benign Overfitting – Nati Srebo

10:10 – 10:50 Unveiling Transformers with LEGO – Sebastien Bubeck

10:50 – 11:20 Break

11:50 – 12:30 Efficient High-Dimensional Robust Statistics via Algorithmic Certificates of (Anti)-Concentration – Pravesh Kothari

14:00 – Social activity (speakers only)

30.06.22

09:00 - 09:40 Multilevel Optimization for Inverse Problems – Ashia Wilson

09:40 - 10:20 Extending Generalization Theory Towards Addressing Modern Challenges in Machine Learning – Shay Moran

10:20 - 10:50 Break

10:50 - 11:30 How many labelers do you have? Some perspective on 'gold standard' labels – John Duchi

11:30 - 12:10 In Search of New Algorithms for Training Neural Networks – Adam Klivans

12:10 - 14:00 Lunch

14:00 - 14:40 From Robustness to Efficiency – Yin Tat Lee

14:40 - 15:20 Stochastic Gradient Descent in high dimensions: Effective dynamics and critical scaling. – Gerard Ben Arous

15:20 - 15:50 Break

15:50 - 16:00 Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs – Etienne Boursier

16:00 - 16:10 Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures – Aryo Lotfi

16:10 - 16:20 Learning sparse functions in the mean-field regime – Theodor Misiakiewicz

16:20 - 16:30 Error rates for kernel methods under source and capacity conditions – Hugo Cui

18:30 - Social activity (speakers only)

01.07.22

09:00 - 09:40 Reverse Experience Replay: An Efficient Way to Learn with Dependent Data. – Praneeth Netrapalli

09:40 - 10:20 An algorithmic solution to the Blotto game using multi-marginal couplings – Vianney Perchet

10:20 - 10:50 Break

10:50 - 11:30 The Hidden Progress Behind Loss Curves – Surbhi Goel

11:30 - 12:10 Learning-Based Algorithms for Sampling and Streaming – Piotr Indyk

12:10 - 14:00 Lunch

14:00 - 14:40 The invisible hand of prediction – Moritz Hardt

14:40 - 15:00 Lucky Vote

15:00 - 16:00 Lucky Talk

16:00 Closing session

Program (Detailed)

29.06.2022

09:30 – 10:00 Learning by Overfitting: A Statistical Learning view of Benign Overfitting – Nati Srebo

The classic view of statistical learning tells us that we should balance model fit with model complexity instead of insisting on training error that’s much lower than what we can expect to generalize to, or even lower than the noise level or Bayes error. And that this balance, and control on model complexity ensures good generalization. But in recent years we have seen that in many situations we can learn and generalize without such a balance, and despite (over?) fitting the training set well below the noise level. This has caused us to rethink the basic principles underlying statistical learning theory. In this talk I will discuss how much of our theory we can salvage and how much of it needs to be revised, focusing on the role of uniform convergence in understanding interpolation learning.

Based on joint work with Lijia Zhou, Fred Koehler, and Danica Sutherland

10:10 – 10:50 Unveiling Transformers with LEGO – Sebastien Bubeck

The discovery of the transformer architecture was a paradigm shifting event for deep learning. However, these architectures are arguably even harder to understand than say convolutional neural networks. In this work we propose a synthetic task, called LEGO, to probe the inner workings of transformers. We obtain some insights on multi-head attention, the effect of pretraining, as well as overfitting issues. Joint work with Yi Zhang, Arturs Backurs, Ronen Eldan, Suriya Gunasekar, and Tal Wagner.

11:50 – 12:30 Efficient High-Dimensional Robust Statistics via Algorithmic Certificates of (Anti)-Concentration – Pravesh Kothari

Standard statistical data analysis, developed by Fisher in the early 20th century, assumes that the observed data is an independent sample from the chosen statistical model. The resulting estimation algorithms often break down if the data exhibits minor (e.g., a vanishing fraction of outliers) departures from the chosen model. In the 1960s, Huber and Tukey founded the study of robust statistics to address this deficiency. While phenomenally successful in low dimensions, the estimators from classical robust statistics almost always incur an exponential computational cost in the underlying dimension.

Can we formulate general principles in the design of efficient robust estimators in high dimensions? What properties of the statistical model govern the existence of algorithms for robustly fitting expressive statistical models?

In the last few years, an algorithmic theory of high-dimensional robust statistics has emerged. This theory reveals a natural connection to computational versions of basic questions in high-dimensional probability: How can we use samples to efficiently confirm that low-degree polynomials of a random variable are sufficiently concentrated? How can we efficiently certify that every one-dimensional marginal of a random variable is sufficiently anti-concentrated?

In this talk, I will give a short overview of robust statistics based on the sum-of-squares method that reveals how such problems are intimately related to the above basic questions. As a case study, I will use the recent resolution of the problem of robustly learning a mixture of arbitrary high-dimensional Gaussians as a culmination of a sequence of works that developed new algorithmic certificates for basic probabilistic phenomena.

30.06.22

09:00 - 09:40 Multilevel Optimization for Inverse Problems – Ashia Wilson

Inverse problems occur in a variety of parameter identification tasks in engineering. Such problems are challenging in practice, as they require repeated evaluation of computationally expensive forward models. I will discuss some recent work with collaborators which introduces a unifying framework of multilevel optimization that can be applied to a wide range of optimization-based solvers. Our framework provably reduces the computational cost associated with evaluating the expensive forward maps stemming from various physical models. To demonstrate the versatility of our analysis, I will discuss its implications for various methodologies including multilevel (accelerated, stochastic) gradient descent, a multilevel ensemble Kalman inversion and a multilevel Langevin sampler.

09:40 - 10:20 Extending Generalization Theory Towards Addressing Modern Challenges in Machine Learning – Shay Moran

Recent years have witnessed tremendous progress in the field of Machine Learning (ML). However, many of the recent breakthroughs demonstrate phenomena that lack explanations, and sometimes even contradict conventional wisdom. One main reason for this is because classical ML theory adopts a worst-case perspective which seems too pessimistic to explain practical ML: in reality data is rarely worst-case, and experiments indicate that often much less data is needed than predicted by traditional theory.

In this talk we will discuss two variations of classical learning theory. These models are based on a distribution- and data-dependent perspective which complements the distribution-free worst-case perspective of classical theory, and is suitable for exploiting specific properties of a given learning task.

A common theme of these models is their combinatorial nature. This can be seen as a continuation of the fruitful link between machine learning and combinatorics, which goes back to the discovery of the VC dimension more than 50 years ago.

10:50 - 11:30 How many labelers do you have? Some perspective on 'gold standard' labels – John Duchi

The construction of most supervised learning datasets revolves around collecting multiple labels for each instance, then aggregating the labels to form a type of “gold-standard” label. We question the wisdom of this pipeline by developing a (stylized) model of this process and analyzing its statistical consequences, showing how access to non-aggregated label information can make training well-calibrated models easier or—in some cases—even feasible, whereas it is impossible with only gold-standard labels. The entire story, however, is subtle, and the contrasts between aggregated and fuller label information depend on the particulars of the problem. The theory we develop in the stylized model makes several predictions for real-world datasets, including when non-aggregate labels should improve learning performance, which we carefully test to corroborate the validity of our predictions.

11:30 - 12:10 In Search of New Algorithms for Training Neural Networks – Adam Klivans

It has been known for decades that a polynomial-size training sample suffices for core tasks in supervised learning. Many theoretical results, however, indicate that these learning tasks are computationally intractable. Can we find new algorithms to circumvent these hardness results? As a case study, we consider the fundamental problem of training neural networks. We present a new algorithm showing that even deep ReLU networks are fixed parameter tractable, a result that provably cannot be obtained using gradient descent.

14:00 - 14:40 From Robustness to Efficiency – Yin Tat Lee

Traditionally, optimization methods like (stochastic) gradient descent take time at least linear to the number of parameters. For problems with many parameters, each step is too expensive.

In this talk, I will discuss how designing robust algorithms can lead to faster algorithms. In particular, I will explain the robust interior point method. Its robustness allows us to speed up its iteration cost via data structures. This results in faster algorithms for many important problems such as linear programming, semidefinite programming, and the maximum flow problem.

14:40 - 15:20 Stochastic Gradient Descent in high dimensions: Effective dynamics and critical scaling. – Gerard Ben Arous

Abstract: SGD is a workhorse for optimization and thus for statistics and machine learning, and it is well understood in low dimensions. But understanding its behavior in very high dimensions is not yet a simple task, and is a work in progress. We describe here interesting and new regimes for the limiting dynamics of “summary statistics” for SGD in high dimensions. These regimes may differ from the expected one given by the usual wisdom in finite dimensions, i.e. the population gradient flow. We find that a new corrector term may be needed and that the phase portrait of these dynamics is quite complex and substantially different from what would be predicted using the classical low-dimensional approach, including for simple tasks, like Tensor PCA, or simple XOR classification. A related picture emerged rin the recent work done at EPFL on two-layers networks by Veiga, Stephan, Loureiro, Krzakala and Zdeborova.

Joint work with Reza Gheissari (UC Berkeley) and Aukosh Jagannath (Waterloo)

15:50 - 16:00 Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs – Etienne Boursier

The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution. Yet, despite some recent progress, a complete theory explaining its success is still missing. This article presents, for orthogonal input vectors, a precise description of the gradient flow dynamics of training one-hidden layer ReLU neural networks for the mean squared error at small initialisation. In this setting, despite non-convexity, we show that the gradient flow converges to zero loss and characterise its implicit bias towards minimum variation norm. Furthermore, some interesting phenomena are highlighted: a quantitative description of the initial alignment phenomenon and a proof that the process follows a specific saddle to saddle dynamics.

Joint work with Loucas Pillaud-Vivien and Nicolas Flammarion.

16:00 - 16:10 Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures – Aryo Lotfi

We consider the Boolean version of the Pointer Value Retrieval (PVR) benchmark, where a ‘reasoning’ function acts on a string of digits to produce the output. More generally, we study the learning of logical functions with gradient descent (GD) on neural networks. We show that in the distribution shift setting, when the data withholding corresponds to freezing a single feature (referred to as canonical holdout), the generalization error of GD admits a tight characterization in terms of the Boolean influence for several relevant architectures. This is shown on linear models and supported experimentally on other models such as MLPs and Transformers. In particular, this puts forward the hypothesis that for such architectures and for learning logical functions such as PVR functions, GD tends to have an implicit bias towards low-degree representations, which in turn gives the Boolean influence for the generalization error under quadratic loss.

Joint work with Emmanuel Abbé (EPFL), Samy Bengio (Apple), Elisabetta Cornacchia (EPFL), Jon Kleinberg (Cornell University), Maithra Raghu (Google Research), and Chiyuan Zhang (Google Research).

16:10 - 16:20 Learning sparse functions in the mean-field regime – Theodor Misiakiewicz

We consider the problem of learning sparse functions (a function that depends on a latent low-dimensional subspace) with two-layer neural networks (NN). This setting is of interest since neural networks routinely tackle high-dimensional datasets and it is still poorly understood when and how SGD-trained NNs can adapt to latent low-dimensional structure without suffering from the curse of dimensionality.

We consider training the 2-layer NN with online SGD in the mean-field scaling. We show that the SGD trajectory can be well approximated for constant time by a dimension independent dynamics, which we call DF-PDE. In particular, if the DF-PDE converges in constant time to zero test error, so does online SGD. We apply this characterization to sparse functions on the hypercube and show that a hierarchical property —the merged-staircase property— is necessary and nearly sufficient for learning in this regime.

This is a joint work with Emmanuel Abbe (EPFL) and Enric Boix-Adsera (MIT).

16:20 - 16:30 Error rates for kernel methods under source and capacity conditions – Hugo Cui

We investigate the rates of decay of the prediction error for kernel methods under the Gaussian design and source/capacity assumptions. For kernel ridge regression, we derive all the observable rates, and characterize the regimes in which each hold. In particular, we show that the decay rate may transition from a fast, noiseless rate to a slow, noisy rate as the sample complexity is increased. For noiseless kernel classification, we derive the rates for two standard classifiers, margin-maximizing SVMs and ridge classifiers, and contrast the two methods. In both cases, the derived rates also describe to a good degree the learning curves of a number of real datasets.

01.07.22

09:00 - 09:40 Reverse Experience Replay: An Efficient Way to Learn with Dependent Data. – Praneeth Netrapalli

Stochastic gradient descent (SGD) is the workhorse of modern machine learning. While SGD has been thoroughly analyzed for independent data and tight finite time guarantees are known, its finite sample performance with dependent data has not been as thoroughly analyzed. In this talk, we will consider SGD-style algorithms for two problems where the data is not independent but rather comes from a Markov chain: learning dynamical systems and Q-learning for reinforcement learning. While vanilla SGD is biased and does not converge to the correct solution for these problems, we show that SGD along with a technique known as “reverse experience replay” can efficiently find the optimal solutions.

Based on joint works with Naman Agarwal, Syomantak Chaudhuri, Prateek Jain, Suhas Kowshik and Dheeraj Nagaraj.

09:40 - 10:20 An algorithmic solution to the Blotto game using multi-marginal couplings – Vianney Perchet

We describe an efficient algorithm to compute solutions for the general two-player Blotto game on n battlefields with heterogeneous values. While explicit constructions for such solutions have been limited to specific, largely symmetric or homogeneous, setups, this algorithmic resolution covers the most general situation to date: value-asymmetric game with asymmetric budget. The proposed algorithm rests on recent theoretical advances regarding Sinkhorn iterations for matrix and tensor scaling. An important case which had been out of reach of previous attempts is that of heterogeneous but symmetric battlefield values with asymmetric budget. In this case, the Blotto game is constant-sum so optimal solutions exist, and our algorithm samples from an ε-optimal solution in time ̃ (n2+ε−4), independently of budgets and battlefield values. In the case of asymmetric values where optimal solutions need not exist but Nash equilibria do, our algorithm samples from an ε-Nash equilibrium with similar complexity but where implicit constants depend on various parameters of the game such as battlefield values.

10:50 - 11:30 The Hidden Progress Behind Loss Curves – Surbhi Goel

In this talk, I will present two results that highlight the implicit benefits of making non-positive progress on the loss objective while training. In the first part, I will focus on the classical problem of minimizing convex quadratics using gradient descent and describe how a “fractal” learning rate schedule, which has locally unstable updates that make negative progress on the objective, allows for accelerated convergence to the optimum. In the second part, I will focus on another canonical family of problems, sparse parities problem, and discuss how simple architectures with no sparsity prior learn this class of functions using (stochastic) gradient descent in the number of iterations matching the lower bound (in the SQ model). Furthermore, the loss curves exhibit a phase transition where there is no apparent progress for a while and then exponentially fast learning. For both parts, I will present both empirical and theoretical results quantifying these phenomena.

The first part is based on joint work with Naman Agarwal and Cyril Zhang. The second part is based on joint work with Boaz Barak, Ben L. Edelman, Sham Kakade, Eran Malach, and Cyril Zhang.

11:30 - 12:10 Learning-Based Algorithms for Sampling and Streaming – Piotr Indyk

Classical algorithms typically provide “one size fits all” performance, and do not leverage properties or patterns in their inputs. A recent line of work aims to address this issue by developing algorithms that use machine learning predictions to improve their performance. In this talk I will present some examples of results this type, in the context of streaming and sampling algorithms. In particular, I will show how to use machine learning predictions to improve the performance of a sampling algorithm for estimating the number of distinct elements in a data set, and a streaming algorithm for counting triangles in a graph.

14:00 - 14:40 The invisible hand of prediction – Moritz Hardt

Algorithmic predictions steer markets, drive consumption, shape communities, and alter life trajectories. The theory and practice of machine learning, however, has long neglected the causal forces of prediction. A recent conceptual framework, called performative prediction, draws attention to the difference between learning from a population and steering a population through predictions. After covering some emerging insights on performative prediction, the talk turns to an application of performativity to the question of power in digital economies.

Speakers

Gerard Ben Arous
(New York University)

Sebastien Bubeck
(Microsoft Research Redmond)

John Duchi
(Stanford University)

Surbhi Goel
(Microsoft Research)

Moritz Hardt
(MPI)

Piotr Indyk
(MIT)

Adam Klivans
(UT Austin)

Pravesh Kothari
(CMU)

James Lee
(University of Washington)

Yin Tat Lee
(University of Washington)

Shay Moran
(Technion)

Praneeth Netrapalli
(Google Research India)

Vianney Perchet
(CREST, ENSAE)

Nati Srebro
(TTIC)

Ashia Wilson
(MIT)

Etienne Boursier
(EPFL)

Aryo Lotfi
(EPFL)

Theodor Misiakiewicz
(Stanford)

Hugo Cui
(EPFL)

Organizers

Emmanuel Abbe

Nicolas Flammarion

Martin Jaggi

Michael Kapralov

Directions

If you're flying in, Genève-Cointrin is the nearest international airport. By train, it takes ~45 minutes to reach Lausanne. Zürich Airport is about ~2.5 hours away by train. To get information on train schedules, please see the Swiss Rail website.

To reach the SuRI venue from the EPFL M1 metro station, please check the map below.