I am a researcher at IRIF ("Chargé de recherche"), affiliated with CNRS and
Université Paris Cité.
I was previously a Marie-Sklodowska-Curie fellow at IST Austria, hosted by
Monika Henzinger. I did my PhD in Sorbonne
Université, in Paris,
where I was very glad to be have been advised by Vincent Cohen-Addad and Christoph Dürr.
My thesis received the Gilles Kahn award from the Société Francaise
d'Informatique.
I am broadly interested in algorithms, with a particular attention
to Clustering problems for which I strive for
a clear picture of their multiple facets:
for instance, when is it possible to recover clusters, with what precision,
how much data is necessary?
I am also trying to understand the success of k-Means objective
function among practitioners. Is it because we love to assume we live in
a Gaussian world? Or because of Lloyd's algorithm? If you have an idea on that matter, I'd
love to discuss it with you!
Besides clustering, I am more generally interested in theory of problems arising from data analysis. I would like to design algorithms for those problems, with provable guarantee, for instances with constraints on the memory usage, privacy or robustness requirements.
I am actively doing popular science, trying to share knowledge from TCS to wide audiance. Duing my PhD, I worked at la Cité des sciences et de l'industrie, in Paris, and am now designing an animation on Privacy. Get in touch if you are interested!
We focus on data selection and consider the problem of finding the best representative subset of a dataset to train a machine learning model. We provide a new data selection approach based on k-means clustering and sensitivity sampling. Assuming embedding representation of the data and that the model loss is Hölder continuous with respect to these embeddings, we prove that our new approach allows to select a set of ``typical'' k + 1/eps^2 elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative (1 +/- eps)-factor and an additive eps*L*Phi_k, where Phi_k represents the k-means cost for the input data and L is the Hölder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show that our sampling strategy can be used to define new sampling scores for regression, leading to a new active learning strategy that is comparatively simpler and faster than previous ones like leverage score.
In all state-of-the-art sketching and coreset techniques for clustering, as well as in the best known fixed-parameter tractable approximation algorithms, randomness plays a key role.
For the classic k-median and k-means problems, there are no known deterministic dimensionality reduction procedure
or coreset construction that avoid an exponential dependency on the input
dimension d, the precision parameter 1/eps or k.
Furthermore, there is no coreset construction that succeeds with
probability 1-1/n and whose size does not depend
on the number of input points, n.
This has led researchers in the area to ask what is the power of randomness for clustering sketches
[Feldman, WIREs Data Mining Knowl. Discov'20].
Similarly, the best approximation ratio achievable deterministically without a complexity exponential in the
dimension are Omega(1) for both k-median and k-means, even when allowing a complexity FPT in the number of clusters k.
This stands in sharp contrast with the (1+eps)-approximation achievable in that case, when allowing randomization.
In this paper, we provide deterministic sketches constructions for
clustering, whose size bounds are close to the best-known randomized
ones. We also construct a deterministic algorithm for computing (1+eps)-approximation to k-median and
k-means in high dimensional Euclidean spaces in time 2^(k^2/eps^O(1)) poly(nd), close to the best randomized complexity.
Furthermore, our new insights on sketches also yield a randomized coreset construction that uses uniform sampling,
that immediately improves over the recent results of [Braverman et al.
FOCS '22] by a factor k.
Given a set of n points in d dimensions, the Euclidean k-means problem (resp. the Euclidean k-median problem) consists of finding k centers such that the
sum of squared distances (resp. sum of distances) from every point to its closest center is minimized. The arguably most popular way of dealing with this
problem in the big data setting is to first compress the data by computing a weighted subset known as a coreset and then run any algorithm on this subset.
The guarantee of the coreset is that for any candidate solution, the
ratio between coreset cost and the cost of the original instance is less
than a (1+/- \eps) factor.
The current state of the art coreset size is O(min(k^2 \eps^{-2}, k
\eps^{-4}) for Euclidean k-means and O(min(k^2\eps^{-2}, k \eps^{-3}))for Euclidean k-median.
The best known lower bound for both problems is \Omega(k \eps^{-2}).
In this paper, we improve the upper bounds to
O(min(k^{3/2}\eps^{-2},k\eps^{-4})) for k-means
and O(min(k^{4/3}\eps^{-2}, k\eps^{-3})) for k-median. In particular,
ours is the first provable bound that breaks through the k^2 barrier
while retaining an optimal dependency on \eps.
We study the private k-median and k-means clustering problem in d dimensional Euclidean space. By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods. We prove that our method computes a solution with cost at most O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / \epsilon^2), where \epsilon is the privacy guarantee. (The dimension term, d, can be replaced with O(\log k) using standard dimension reduction techniques). Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, \tilde{O}(nkd), time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime. Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines.
We consider the problem of recovering communities in a random directed graph with planted communities.
To model real-world directed graphs such as the Twitter or Instagram graphs that exhibit very heterogeneous
degree sequences, we introduce the Degree-Heterogeneous Stochastic Block Model (DHSBM), a generalization of
the classic Stochastic Block Model (SBM), where the vertex set is partitioned into communities and each
vertex u has two (unknown) associated probabilities, p_{u} and q_{u},
p_{u} > q_{u}.
An arc from u to v is generated with probability p_{u} if u and v are in the same community and with probability q_{u} otherwise.
Given a graph generated from this model, the goal is to retrieve the
communities.
The DHSBM allows to generate graphs with planted communities while allowing heterogeneous
degree distributions, a quite important feature of real-world
networks.
In the case where there are two communities, we present an iterative greedy linear-time algorithm that
recovers them whenever min_{u} \frac{p_{u} - q_{u}}{\sqrt{p_{u}}} = \Omega(\sqrt{\log (n)/n}). We also show that,
up to a constant, this condition is necessary.
Our results also extend to the standard (undirected) SBM, where p_{u} = p and q_{u}= q for all nodes u.
Our algorithm presents the first linear-time algorithm that recovers exactly the communities at the asymptotic
information-theoretic threshold, improving over previous near-linear time spectral approaches.
Graph clustering is one of the most basic and popular unsupervised learning problems. Among the different formulations
of the problem, the modularity objective has been particularly successful in helping design impactful algorithms;
Most notably, the Louvain algorithm has become one of the most used algorithm for clustering graphs. Yet, one major
limitation of the Louvain algorithm is its sequential nature which makes it impractical in distributed environments and on massive datasets.
In this paper, we provide a parallel version of Louvain which works in the massively parallel computation model (MPC).
We show that it recovers the ground-truth clusters in the classic stochastic block model in only a constant number of
parallel rounds, and so for a wider regime of parameters than the
standard Louvain algorithm as shown recently in Cohen-Addad, Kosowski,
Mallmann-Trenn and Saulpic [NeurIPS 2020]
Given a set of points in a metric space, the (k, z)-clustering problem consists of finding a set of k points called centers,
such that the sum of distances raised to the power of z of every data point to its closest center is minimized.
Special cases include the famous k-median problem (z = 1) and k-means problem (z = 2).
The k-median and k-means problems are at the heart of modern data analysis and massive data applications have given raise
to the notion of coreset: a small (weighted) subset of the input point set preserving the cost of any solution to the problem
up to a multiplicative (1 +/- eps) factor, hence reducing from large to small scale the input to the problem.
While there has been an intensive effort to understand what is the best coreset size possible for both problems in
various metric spaces, there is still a significant gap between the state- of-the-art upper and lower bounds.
In this paper, we make progress on both upper and lower bounds, obtaining tight bounds for several cases, namely:
in finite n point general metrics, any coreset must consist of
Omega(k/eps^2 log n) points.
This improves on the Omega(k/eps log n) lower bound of Braverman, Jiang, Krauthgamer, and Wu [ICML'19] and matches the upper bounds
proposed for k-median by Feldman and Langberg [STOC'11] and k-means by Cohen-Addad, Saulpic, and Schwiegelshohn [STOC'21] up
to polylog(1/eps) factors. The dependency in k, n is therefore optimal. For doubling metrics with doubling constant D,
any coreset must consist of Omega(k/eps^2 D) points.
This matches the k-median and k-means upper bounds by Cohen-Addad, Saulpic, and Schwiegelshohn [STOC'21] up to polylog(1/eps) factors.
The dependency in k, D is therefore optimal. In d-dimensional Euclidean space, any coreset for (k, z) clustering
requires Omega(k/eps^2 ) points. This improves on the Omega(k/ eps) lower bound of Baker, Braverman, Huang, Jiang, Krauthgamer, and Wu [ICML'20]
for k-median and complements the Omega(k min(d, 2^{z/20} )) lower bound of Huang and Vishnoi [STOC'20].
We complement our lower bound for d-dimensional Euclidean space with the construction of a coreset of size
O(k/eps^2 · min(eps^{-z} , k)). This improves over the O(k^2 eps^{-4}) upper bound for general power of z proposed by Braverman Jiang,
Krauthgamer, and Wu [SODA'21] and over the O(k/eps^4 ) upper bound for k-median by Huang and Vishnoi [STOC'20].
In fact, ours is the first construction breaking through the eps^{-2} min(d, eps^-2 ) barrier inherent in all previous coreset constructions.
To do this, we employ a novel chaining based analysis that may be of independent interest.
Together our upper and lower bounds for k-median in Euclidean spaces are
tight up to a factor O(eps^{-1} polylog k/eps).
We present a new local-search algorithm for the k-median clustering problem. We show that local optima for this algorithm give a (2.836+\epsilon)-approximation; our result improves upon the (3+\epsilon)-approximate local-search algorithm of Arya et al. [STOC'01]. Moreover, a computer-aided analysis of a natural extension suggests that this approach may lead to an improvement over the best-known approximation guarantee for the problem. The new ingredient in our algorithm is the use of a potential function based on both the closest and second-closest facilities to each client. Specifically, the potential is the sum over all clients, of the distance of the client to its closest facility, plus (a small constant times) the truncated distance to its second-closest facility. We move from one solution to another only if the latter can be obtained by swapping a constant number of facilities, and has a smaller potential than the former. This refined potential allows us to avoid the bad local optima given by Arya et al. for the local-search algorithm based only on the cost of the solution.
We study in this paper the geometric (1, z)-clustering problem: given
n points in R^d, find the point x that minimizes the sum of
Euclidean distance, raised to the power z, over all input points. This
problem interpolates between the well-known Fermat-Weber problem -- or
geometric median problem-- where z = 1, and the Minimum Enclosing Ball
problem, where z = infinity.
Our contribution is the design of a precise estimator that sample only a
constant number of points. Namely, for any \epsilon > 0, we show that
sampling uniformly at random O(\epsilon^{-z-3}) input points is enough
to find a center such that the sum of distances to the power z to that
center is within a (1+\epsilon)-factor of the optimum. We also provide
a lower bound, showing that any such algorithm must sample at least
\Omega(\epsilon^{-z+1}) points.
This implies an algorithm that computes a (1+\epsilon)-approximation
running in time O(d \epsilon^{-z-3}), generalizing the result from
Cohen et al [STOC '16] to arbitrary z. This also implies a
(1+\epsilon)-approximation in the streaming setting, with memory
independent of the number of input points.
Given a metric space, the (k,z)-clustering problem consists of finding k
centers such that the sum of the of distances raised to the power z of
every point to its closest center is minimized. This encapsulates the
famous k-median (z=1) and k-means (z=2) clustering problems. Designing
small-space sketches of the data that approximately preserves the cost of
the solutions, also known as coresets, has been an important research
direction over the last 15 years.
In this paper, we present a new, simple coreset framework that
simultaneously improves upon the best known bounds for a large variety of
settings, ranging from Euclidean space, doubling metric, minor-free
metric, and the general metric cases.
A classic problem in machine learning and data analysis is to partition
the vertices of a network in such a way that vertices in the same set are densely connected and vertices in different sets are loosely connected.
In practice, the most popular approaches rely on local search algorithms;
not only for the ease of implementation and the efficiency, but also because of the accuracy of these methods on
many real world graphs. For example, the Louvain algorithm -- a local search based algorithm -- has quickly become
the method of choice for clustering in social networks.
However, explaining the success of these methods remains an open problem:
in the worst-case, the runtime can be up to $\Omega(n^2)$, much worse than what is typically observed in practice, and no guarantee on the quality of its output can be established.
The goal of this paper is to shed light on the inner-workings of Louvain; only if we understand Louvain, can we rely on it and further improve it.
To achieve this goal, we study the behavior of Louvain in the
famous two-bloc Stochastic Block Model, which has a clear ground-truth and serves as the standard testbed for graph clustering algorithms.
We provide valuable tools for the analysis of Louvain, but also for many other combinatorial algorithms. For example,
we show that the probability for a node to have more edges towards its own community is
$1/2 + \Omega( \min( \Delta(p-q)/\sqrt{np},1 ))$ in the SBM($n,p,q$), where $\Delta$
is the imbalance. Note that this bound is asymptotically tight and useful for the analysis of a wide range of
algorithms (Louvain, Kernighan-Lin, Simulated Annealing etc).
We study clustering problems such as k-Median, k-Means, and Facility Location in graphs of low highway dimension, which is a graph parameter modeling transportation networks. It was previously shown that approximation schemes for these problems exist, which either run in quasi-polynomial time (assuming constant highway dimension) [Feldmann et al. SICOMP 2018] or run in FPT time (parameterized by the number of clusters k, the highway dimension, and the approximation factor) [Becker et al. ESA~2018, Braverman et al. 2020]. In this paper we show that a polynomial-time approximation scheme (PTAS) exists (assuming constant highway dimension). We also show that the considered problems are NP-hard on graphs of highway dimension 1.
We consider classic clustering problems in fully dynamic data streams, where data elements can be both inserted and deleted. In this context, several parameters are of importance: (1) the quality of the solution after each insertion or deletion, (2) the time it takes to update the solution, and (3) how different consecutive solutions are. The question of obtaining efficient algorithms in this context for facility location, $k$-median and $k$-means has been raised in a recent paper by Hubert-Chan et al. [WWW'18] and also appears as a natural follow-up on the online model with recourse studied by Lattanzi and Vassilvitskii [ICML'17] (i.e.: in insertion-only streams). In this paper, we focus on general metric spaces and mainly on the facility location problem. We give an arguably simple algorithm that maintains a constant factor approximation, with O(nlog n) update time, and total recourse O(n). This improves over the naive algorithm which consists in recomputing a solution at each time step and that can take up to O(n^2) update time, and O(n^2) total recourse. These bounds are nearly optimal: in general metric space, inserting a point take O(n) times to describe the distances to other points, and we give a simple lower bound of O(n) for the recourse. Moreover, we generalize this result for the k-medians and k-means problems: our algorithm maintains a constant factor approximation in time O((n+k^2) polylog n). We complement our analysis with experiments showing that the cost of the solution maintained by our algorithm at any time t is very close to the cost of a solution obtained by quickly recomputing a solution from scratch at time t while having a much better running time.
We consider the classic Facility Location, k-Median, and k-Means problems in metric spaces of doubling dimension d. We give nearly linear-time approximation schemes for each problem. The complexity of our algorithms is near-linear in n, making a significant improvement over the state-of-the-art algorithms which run in time $n^(d/\eps)^O(d). Moreover, we show how to extend the techniques used to get the first efficient approximation schemes for the problems of prize-collecting k-Medians and k-Means, and efficient bicriteria approximation schemes for k-Medians with outliers, k-Means with outliers and k-Center
In this paper we study the dynamic versions of two basic graph problems: Minimum Dominating Set and its variant Minimum Connected Dominating Set. For those two problems, we present algorithms that maintain a solution under edge insertions and edge deletions in time O(Δ polylog n) per update, where Δ is the maximum vertex degree in the graph. In both cases, we achieve an approximation ratio of O(log n), which is optimal up to a constant factor (under the assumption that P≠NP). Although those two problems have been widely studied in the static and in the distributed settings, to the best of our knowledge we are the first to present efficient algorithms in the dynamic setting. As a further application of our approach, we also present an algorithm that maintains a Minimal Dominating Set in O(min(Δ,√m))
The concept of bounded highway dimension was developed to capture observed properties of the metrics of road networks. We show that a graph with bounded highway dimension, for any vertex, can be embedded into a a graph of bounded treewidth in such a way that the distance between u and v is preserved up to an additive error of ε times the distance from u or v to the selected vertex. We show that this theorem yields a PTAS for Bounded-Capacity Vehicle Routing in graphs of bounded highway dimension. In this problem, the input specifies a depot and a set of clients, each with a location and demand; the output is a set of depot-to-depot tours, where each client is visited by some tour and each tour covers at most Q units of client demand. Our PTAS can be extended to handle penalties for unvisited clients. We extend this embedding result to handle a set S of distinguished vertices. The treewidth depends on |S|, and the distance between u and v is preserved up to an additive error of ε times the distance from u and v to S. This embedding result implies a PTAS for Multiple Depot Bounded-Capacity Vehicle Routing: the tours can go from one depot to another. The embedding result also implies that, for fixed k, there is a PTAS for k-Center in graphs of bounded highway dimension. In this problem, the goal is to minimize d such that there exist k vertices (the centers) such that every vertex is within distance d of some center. Similarly, for fixed k, there is a PTAS for k-Median in graphs of bounded highway dimension. In this problem, the goal is to minimize the sum of distances to the k centers.
The Capacitated Vehicle Routing problem is a generalization of the Traveling Salesman problem in which a set of clients must be visited by a collection of capacitated tours. Each tour can visit at most Q clients and must start and end at a specified depot. We present the first approximation scheme for Capacitated Vehicle Routing for non-Euclidean metrics. Specifically we give a quasi-polynomial-time approximation scheme for Capacitated Vehicle Routing with fixed capacities on planar graphs. We also show how this result can be extended to bounded-genus graphs and polylogarithmic capacities, as well as to variations of the problem that include multiple depots and charging penalties for unvisited clients.