Date: February 25, 2026
Time: 14h - 18h
Place: Amphitheater Morgenstern (Kahn building) at Inria Centre d’Université Côte d'Azur
For external visitors or online attendance, registration is free but mandatory. Register Now
14h - 14h45: Lecture 1: Machine Learning Regularization via Information Measures
Iñaki Esnaola (University of Sheffield, UK)
Abstract: The effect of relative entropy asymmetry is analyzed in the context of empirical risk minimization (ERM) with relative entropy regularization (ERM-RER). We show that regularization by relative entropy forces the support of the solution to collapse into the support of the reference measure, introducing a strong inductive bias that negates the evidence provided by the training data. Motivated by this insight, we extend the results to f-divergence regularization by obtaining a closed form expression for the solution under mild assumptions on the structure of the regularizer. The dual formulation of empirical risk minimization with f-divergence regularization (ERM-fDR) is introduced. The solution of the dual optimization problem to the ERM-fDR is connected to the notion of normalization function introduced as an implicit function. This dual approach leverages the Legendre-Fenchel transform and the implicit function theorem to provide an ODE expression for the normalization function. Furthermore, the ODE expression and its properties provide a computationally efficient method to calculate the normalization function of the ERM-fDR solution under a mild condition.
References: ArXiv:2410.02833
Slides: PDF
Video: Open in Canal-U
14h45 - 15h30: Lecture 2: A Compressibility Approach to Generalization and (No) Memorization
Abdellatif Zaidi (Université Gustave Eiffel)
Abstract: The lecture is composed of two parts. In the first one, I will present a novel framework for studying the generalization error of statistical learning algorithms, through the lens of a variable-size compressibility of algorithms. In this framework, the generalization error of an algorithm is linked to a variable-size 'compression rate’ of its input data. This is shown to yield bounds that depend on the empirical measure of the given input data at hand, rather than its unknown distribution. Moreover, it is shown that these general bounds subsume and possibly improve over several existing PAC-Bayes and data-dependent intrinsic dimension-based bounds that are recovered as special cases, thus unveiling a unifying character of our approach. For instance, a new data-dependent intrinsic dimension-based bound is established, which connects the generalization error to the optimization trajectories and reveals various interesting connections with the rate-distortion dimension of a process, the Rényi information dimension of a process and the metric mean dimension. The second part revolves around the important question of the relationship between generalization and data memorization in machine learning, which is yet to be fully understood. Indeed, the intuition that good algorithms should only extract relevant information and discard all irrelevant ones, which is also supported by some theoretic works, is challenged by the enormous success of modern overparametrized deep neural networks. I will be showing how suitable usage of the presented compressibility framework of the first part leads to bounds that are non-vacuous for certain instances problems for which classic MI and CMI bounds fail; and allows to characterize whether memorization is necessary for generalization for certain instance problems as was recently claimed. If time permits, I will also discuss various other implications, in particular on sample-compression schemes, fingerprinting codes and privacy attacks.
References: ArXiv:2303.05369 and ArXiv:2510.23485
Slides PDF
Video: Open in Canal-U
15h30 - 15h45
Break
15h45 - 16h30: Lecture 3: What is the Long-Run Distribution of Stochastic Gradient Descent?
Panayotis Mertikopoulos (CNRS and Inria)
Abstract: Even though it's been around for more than 70 years, stochastic gradient descent (SGD) remains the go-to method for solving large-scale non-convex optimization problems and training deep learning models and networks. However, we know surprisingly little about the algorithm's long-run behavior—for example, which local minimizers are more likely to be observed in the long run (and by how much)? Using an approach based on the theory of large deviations and randomly perturbed dynamical systems, we show that the long-run distribution of SGD resembles the Boltzmann-Gibbs distribution of equilibrium thermodynamics with temperature equal to the method's step-size and energy levels determined by the problem's objective and the statistics of the noise entering the process. In particular, we show that, in the long run (i) the iterates of SGD spend an exponentially small amount of time away from the problem's critical region; (ii) any given critical point (or manifold thereof) is visited with probability that is exponentially proportional to its energy; (iii) minimizers are visited exponentially more often than non-minimizers; and (iv) SGD becomes exponentially concentrated around the problem's “ground state” (which does not always coincide with the minimum of the objective).
References: ArXiv:2406.09241
Slides PDF
Video: Open in Canal-U
16h30 - 17h15: Lecture 4: Rethinking Generalisation: Beyond KL with Geometry and Comparators
Benjamin Guedj (Inria-London and UCL, UK)
Abstract: Generalisation is arguably one of the central problems in machine learning and foundational AI. Generalisation theory has traditionally relied on KL-based PAC-Bayesian bounds, which, despite their elegance, often obscure geometry and limit applicability. In this talk, I will present recent advances that move beyond traditional bounds. One line of work replaces KL with Wasserstein distances, yielding high-probability bounds valid for heavy-tailed losses and leading to new, optimisable learning objectives. Another line introduces a general comparator framework, showing how optimal bounds naturally arise from convex conjugates of cumulant generating functions, unifying and extending many classical results. Together, these perspectives highlight how rethinking divergences and comparators opens new directions in both theory and practice. I will conclude by discussing links with information theory and how these ideas might shape the next generation of PAC-Bayesian learning algorithms.
References: ArXiv:2310.10534, ArXiv:2306.04375, and ArXiv:2309.04381
Slides PDF
Video: Open in Canal-U
17h15 - 17h45: Junior Lecture: Minimization and Maximization of Empirical Risks: Applications in Data Unlearning and Decentralized Learning
Yaiza Bermudez (Inria)
Abstract: This talk presents a novel view on the minimization and maximization of empirical risks with relative-entropy regularization. These complementary problems, referred to as ERMin-RER and ERMax-RER, respectively, are both variational problems enabling the construction of algorithms for: (a) exact data unlearning; and (b) centralized-level performance in decentralized learning.
First, a constructive method for exact unlearning in Gibbs algorithms is presented. The key idea is to compute an ERMax-RER with respect to the data to be removed using as reference measure, the original algorithm. The resulting algorithm matches the distribution of the algorithm that would be obtained by retraining from scratch on the data to be retained (exact unlearning).
Second, centralized ERMin-RER performance is shown to be achievable in decentralized settings without exchanging local datasets but locally computed Gibbs measures, used sequentially as reference measures by other clients.
References: INRIA Technical Report 9608 and INRIA Technical Report 9610
Slides PDF
Video: Open in Canal-U
These Lectures are supported in part by the French National Agency for Research (ANR) through the Project ANR-21-CE25-0013 and the project ANR-22-PEFT-0010 of the France 2030 program PEPR Réseaux du Futur; and in part by the Agence de l’Innovation de Défense (AID) through the project UK-FR 2024352.