David Saulpic

Publications

Faster and Simpler Greedy Algorithm for k-Median and k-Means
Joint work with Max Dupré la Tour
To appear at ICALP 2026. [arXiv]

Abstract
Clustering problems such as k-means and k-median are staples of unsupervised learning, and many algorithmic techniques have been developed to tackle their numerous aspects.
In this paper, we focus on the class of greedy approximation algorithm, that attracted less attention than local-search or primal-dual counterparts. In particular, we study the recursive greedy algorithm developed by Mettu and Plaxton [SIAM J. Comp 2003]. We provide a simplification of the algorithm, allowing for faster implementation: our algorithm matches the state-of-the-art running time for computing a constant-factor approximation in Euclidean space and graph metrics, and, in addition, is the first near-linear-time to compute a polylogarithmic approximation in Euclidean space.
Improved Lower Bounds for Privacy under Continual Release
Joint work with Bardiya Aryanfard, Monika Henzinger, and A. R. Sricharan
To appear at PODS 2026. [arXiv]

Abstract
We study the problem of continually releasing statistics of an evolving dataset under differential privacy. In the event-level setting, we show the first polynomial lower bounds on the additive error for insertions-only graph problems such as maximum matching, degree histogram and k-core. This is an exponential improvement on the polylogarithmic lower bounds of Fichtenberger et al. [ESA 2021] for the former two problems, and are the first continual release lower bounds for the latter. Our results run counter to the intuition that the difference between insertions-only vs fully dynamic updates causes the gap between polylogarithmic and polynomial additive error. We show that for maximum matching and k-core, allowing small multiplicative approximations is what brings the additive error down to polylogarithmic.
Beyond graph problems, our techniques also show that polynomial additive error is unavoidable for Simultaneous Norm Estimation in the insertions-only setting. When multiplicative approximations are allowed, we circumvent this lower bound by giving the first continual mechanism with polylogarithmic additive error under $(1+\xi)$-multiplicative approximations, for $\xi > 0$, for estimating all monotone symmetric norms simultaneously.
In the item-level setting, we show polynomial lower bounds on the product of the multiplicative and the additive error of continual mechanisms for a large range of graph problems. To the best of our knowledge, these are the first lower bounds for any differentially private continual release mechanism with multiplicative error. To obtain this, we prove a new lower bound on the product of multiplicative and additive error for 1-Way-Marginals, from which we reduce to continual graph problems. This generalizes the lower bounds of Hardt and Talwar [STOC 2010] and Bun et al. [STOC 2014] on the additive error for mechanisms with no multiplicative error.
Almost-Optimal Upper and Lower Bounds for Clustering in Low Dimensional Euclidean Spaces
Joint work with Vincent Cohen-Addad, Karthik C.S. and Chris Schwiegelshohn
To appear at SoCG 2026. [arXiv]

Abstract
The k-median and k-means clustering objectives are classic objectives for modeling clustering in a metric space. Given a set of points in a metric space, the goal of the k-median (resp. k-means) problem is to find k representative points so as to minimize the sum of the distances (resp. sum of squared distances) from each point to its closest representative. Cohen-Addad, Feldmann, and Saulpic [JACM'21] showed how to obtain a $(1+\eps)$-factor approximation in low-dimensional Euclidean metric for both the k-median and k-means problems in near-linear time $2^{1/\eps^{O(d^2)} n polylog(n)$ (where $d$ is the dimension and $n$ is the number of input points).
We improve this running time to $2^{\tilde O(1/\eps)^{d-1}} n$, and show an almost matching lower bound: under the Gap Exponential Time Hypothesis for 3-SAT, there is no $2^{o(1/\eps^{d-1})} n^{o(1)}$ algorithm achieving a $(1+\eps)$-approximation for k-means.
Near-Optimal Bounds for Parameterized Euclidean k-means
Joint work with Vincent Cohen-Addad, Karthik C.S. and Chris Schwiegelshohn
To appear at SoCG 2026. [arXiv]

Abstract
The k-means problem is a classic objective for modeling clustering in a metric space. Given a set of points in a metric space, the goal is to find k representative points so as to minimize the sum of the squared distances from each point to its closest representative. In this work, we study the approximability of k-means in Euclidean spaces parameterized by the number of clusters, k.
In seminal works, de la Vega, Karpinski, Kenyon, and Rabani [STOC'03] and Kumar, Sabharwal, and Sen [JACM'10] showed how to obtain a $(1+\eps)$-approximation for high-dimensional Euclidean k-means in time $2^{(k/\eps)^{O(1)} dn^{0(1)}$.
In this work, we introduce a new fine-grained hypothesis called Exponential Time for Expanders Hypothesis (XXH) which roughly asserts that there are no non-trivial exponential time approximation algorithms for the vertex cover problem on near perfect vertex expanders. Assuming XXH, we close the above long line of work on approximating Euclidean k-means by showing that there is no $2^{(k/\eps)^{1-o(1)}} n^{O(1)}$ time algorithm achieving a $(1+\eps)$-approximation for k-means in Euclidean space. This lower bound is tight as it matches the algorithm given by Feldman, Monemizadeh, and Sohler [SoCG'07] whose runtime is $2^{\tilde O(k/\eps)} + O(nkd)$.
Furthermore, assuming XXH, we show that the seminal $O(n^{kd+1})$ runtime exact algorithm of Inaba, Katoh, and Imai [SoCG'94] for k-means is optimal for small values of k.
Differentially Private Federated $k$-Means Clustering with Server-Side Data
Joint work with Jonathan Scott and Christoph Lampert
ICML 2025. [arXiv]

Abstract
Clustering is a cornerstone of data analysis that is particularly suited to identifying coherent subgroups or substructures in unlabeled data, as are generated continuously in large amounts these days.
However, in many cases traditional clustering methods are not applicable, because data are increasingly being produced and stored in a distributed way, \eg on edge devices, and privacy concerns prevent it from being transferred to a central server.
To address this challenge, we present FedDP-KMeans, a new algorithm for $k$-means clustering that is fully-federated as well as differentially private.
Our approach leverages (potentially small and out-of-distribution) server-side data to overcome the primary challenge of differentially private clustering methods: the need for a good initialization.
Combining our initialization with a simple federated DP-Lloyds algorithm we obtain an algorithm that achieves excellent results on synthetic and real-world benchmark tasks. Our code can be found here.
We also provide a theoretical analysis of our method that provides bounds on the convergence speed and cluster identification success.
Estimating the Electoral Consequences of Legislative Redistricting in France
Joint work with Evripidis Bampis, Thomas Ehrhard, Bruno Escoffier, Claire Mathieu and Fanny Pascual
Facct 2025. [HAL]

Abstract
We study in this paper the influence that redistricting can have on the result of elections. First, we propose a definition of diversity of the distribution of legal electoral maps of a state, to capture the gerrymandering potential; then, for a specific electoral map of a state, we propose a definition of outliers (positive outlier or negative outlier with respect to a given political party), to identify the states where the electoral map ought to be audited.
We apply this approach to the French system, that has legal constraints very different from the U.S., much more granular and much less balanced. We first show that despite the legal constraints, there is some large diversity of maps, and as a consequence it is possible to draw the maps in order to favor one specific party in the aggregate (over all states). Then, we show that many states are positive or negative outliers, with a large imbalance between different parties. This calls for a specific audit of redistricting in those states, to examine whether the current map is indeed drawn in a way to advantage or disadvantage some party.
This is the first computer study of gerrymandering potential in France.
A Tight VC-Dimension Analysis of Clustering Coresets with Applications
Joint work with Vincent Cohen-Addad, Andrew Draganov, Matteo Russo, and Chris Schwiegelshohn
SODA 2025. [arXiv]

Abstract
We consider coresets for k-clustering problems, where the goal is to assign points to centers minimizing powers of distances. A popular example is the k-median objective \sum_{p}\min_{c\in C}dist(p,C). Given a point set P, a coreset is a small weighted subset that approximates the cost of P for all candidate solutions C up to a (1 +/- e ) multiplicative factor. In this paper, we give a sharp VC-dimension based analysis for coreset construction. As a consequence, we obtain improved k-median coreset bounds for the following metrics: Coresets of size O(k/e^2) for shortest path metrics in planar graphs, improving over the bounds O(k/e^6) by [Cohen-Addad, Saulpic, Schwiegelshohn, STOC'21] and O(k^2/e^4) by [Braverman, Jiang, Krauthgamer, Wu, SODA'21]. Coresets of size O(kd\ell/e^2 \log m) for clustering d-dimensional polygonal curves of length at most m with curves of length at most \ell with respect to Frechet metrics, improving over the bounds O(k^3d\ell/e^3 \log m) by [Braverman, Cohen-Addad, Jiang, Krauthgamer, Schwiegelshohn, Toftrup, and Wu, FOCS'22] and O(k^2d\ell / e^2\log m \log |P|) by [Conradi, Kolbe, Psarros, Rohde, SoCG'24].
Sensitivity Sampling for k-Means: Worst Case and Stability Optimal Coreset Bounds
Joint work with Nikhil Bansal, Vincent Cohen-Addad, Milind Prabhu, and Chris Schwiegelshohn
FOCS 2024. [arXiv] [Minlind's talk at FOCS]

Abstract
Coresets are arguably the most popular compression paradigm for center-based clustering objectives such as k-means. Given a point set P, a coreset is a small, weighted summary that preserves the cost of all candidate solutions S up to a (1+/- eps) factor. For k-means in d-dimensional Euclidean space the cost for solution S is sum_{p in P} min_{s in S} |p-s||^2.
A very popular method for coreset construction, both in theory and practice, is Sensitivity Sampling, where points are sampled in proportion to their importance. We show that Sensitivity Sampling yields optimal coresets of size tile O(k/eps^2 min(\sqrt k, eps^{-2})) for worst-case instances. Uniquely among all known coreset algorithms, for well-clusterable data sets with Omega(1) cost stability, Sensitivity Sampling gives coresets of size O(k/eps^2), improving over the worst-case lower bound. Notably, Sensitivity Sampling does not have to know the cost stability in order to exploit it: It is appropriately sensitive to the clusterability of the data set while being oblivious to it.
We also show that any coreset for stable instances consisting of only input points must have size Omega(k/eps^2). Our results for Sensitivity Sampling also extend to the k-median problem, and more general metric spaces.
Fully Dynamic k-Means Coreset in Near-Optimal Update Time.
Joint work with Max Dupré la Tour and Monika Henzinger
ESA 2024. [arXiv]

Abstract
We study in this paper the problem of maintaining a solution to k-median and k-means clustering in a fully dynamic setting. To do so, we present an algorithm to efficiently maintain a coreset, a compressed version of the dataset, that allows easy computation of a clustering solution at query time. Our coreset algorithm has near-optimal update time of O(k) in general metric spaces, which reduces to O(d) in the Euclidean space R^d. The query time is O(k^2) in general metrics, and O(kd) in R^d.
To maintain a constant-factor approximation for k-median and k-means clustering in Euclidean space, this directly leads to an algorithm update time O(d), and query time O(kd+k^2). To maintain a O(polylog k)-approximation, the query time is reduced to O(kd)
Making old things new: a unified algorithm for differentially private clustering
Joint work with Max Dupré la Tour and Monika Henzinger
Oral presentation at ICML 2024 (top 2%). [arXiv]
Sensitivity Sampling for Coreset-Based Data Selection
Joint work with Kyriakos Axiotis, Vincent Cohen-Addad, Monika Henzinger, Sammy Jerome, Vahab Mirrokni, David Woodruff and Michael Wunder
Poster presentation at ICML 2024. [arXiv]

Abstract

We focus on data selection and consider the problem of finding the best representative subset of a dataset to train a machine learning model. We provide a new data selection approach based on k-means clustering and sensitivity sampling. Assuming embedding representation of the data and that the model loss is Hölder continuous with respect to these embeddings, we prove that our new approach allows to select a set of ``typical'' k + 1/eps^2 elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative (1 +/- eps)-factor and an additive eps*L*Phi_k, where Phi_k represents the k-means cost for the input data and L is the Hölder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show that our sampling strategy can be used to define new sampling scores for regression, leading to a new active learning strategy that is comparatively simpler and faster than previous ones like leverage score.
Settling Time vs. Accuracy Tradeoffs for Clustering Big Data
Joint work with Andrew Draganov and Chris Schwiegelshohn
SIGMOD 2024. [arXiv], [ Proc. ACM Manag. Data]
Experimental Evaluation of Fully Dynamic k-Means via Coresets
Joint work with Monika Henzinger and Leonhard Sidl
ALENEX 2024. [arXiv]
Deterministic Clustering in High Dimensional Spaces: Sketches and Approximation
Joint work with Vincent Cohen-Addad and Chris Schwiegelshohn
FOCS 2023, invited talk at HALG 2024! [arXiv]

Abstract

In all state-of-the-art sketching and coreset techniques for clustering, as well as in the best known fixed-parameter tractable approximation algorithms, randomness plays a key role. For the classic k-median and k-means problems, there are no known deterministic dimensionality reduction procedure or coreset construction that avoid an exponential dependency on the input dimension d, the precision parameter 1/eps or k. Furthermore, there is no coreset construction that succeeds with probability 1-1/n and whose size does not depend on the number of input points, n. This has led researchers in the area to ask what is the power of randomness for clustering sketches [Feldman, WIREs Data Mining Knowl. Discov'20]. Similarly, the best approximation ratio achievable deterministically without a complexity exponential in the dimension are Omega(1) for both k-median and k-means, even when allowing a complexity FPT in the number of clusters k. This stands in sharp contrast with the (1+eps)-approximation achievable in that case, when allowing randomization.
In this paper, we provide deterministic sketches constructions for clustering, whose size bounds are close to the best-known randomized ones. We also construct a deterministic algorithm for computing (1+eps)-approximation to k-median and k-means in high dimensional Euclidean spaces in time 2^(k^2/eps^O(1)) poly(nd), close to the best randomized complexity.
Furthermore, our new insights on sketches also yield a randomized coreset construction that uses uniform sampling, that immediately improves over the recent results of [Braverman et al. FOCS '22] by a factor k.
Improved Coresets for Euclidean k-Means
Joint work with Vincent Cohen-Addad, Kasper Green Larsen, Chris Schwiegelshohn and Omar Ali Sheikh-Omar
NeurIPS 2022. [arXiv]

Abstract

Given a set of n points in d dimensions, the Euclidean k-means problem (resp. the Euclidean k-median problem) consists of finding k centers such that the sum of squared distances (resp. sum of distances) from every point to its closest center is minimized. The arguably most popular way of dealing with this problem in the big data setting is to first compress the data by computing a weighted subset known as a coreset and then run any algorithm on this subset.
The guarantee of the coreset is that for any candidate solution, the ratio between coreset cost and the cost of the original instance is less than a (1+/- \eps) factor. The current state of the art coreset size is O(min(k^2 \eps^{-2}, k \eps^{-4}) for Euclidean k-means and O(min(k^2\eps^{-2}, k \eps^{-3}))for Euclidean k-median. The best known lower bound for both problems is \Omega(k \eps^{-2}).
In this paper, we improve the upper bounds to O(min(k^{3/2}\eps^{-2},k\eps^{-4})) for k-means and O(min(k^{4/3}\eps^{-2}, k\eps^{-3})) for k-median. In particular, ours is the first provable bound that breaks through the k^2 barrier while retaining an optimal dependency on \eps.
Scalable Differentially Private Clustering via Hierarchically Separated Trees
Joint work with Vincent Cohen-Addad, Alessandro Epasto, Silvio Lattanzi, Vahab Mirrokni, Andres Munoz Medina, Chris Schwiegelshohn and Sergei Vassilvitskii
KDD 2022. [arXiv]

Abstract

We study the private k-median and k-means clustering problem in d dimensional Euclidean space. By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods. We prove that our method computes a solution with cost at most O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / \epsilon^2), where \epsilon is the privacy guarantee. (The dimension term, d, can be replaced with O(\log k) using standard dimension reduction techniques). Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, \tilde{O}(nkd), time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime. Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines.
Community Recovery in the Degree-Heterogeneous Stochastic Block Model
Joint work with Vincent Cohen-Addad and Frederik Mallmann-Trenn
COLT 2022

Abstract

We consider the problem of recovering communities in a random directed graph with planted communities. To model real-world directed graphs such as the Twitter or Instagram graphs that exhibit very heterogeneous degree sequences, we introduce the Degree-Heterogeneous Stochastic Block Model (DHSBM), a generalization of the classic Stochastic Block Model (SBM), where the vertex set is partitioned into communities and each vertex u has two (unknown) associated probabilities, p_u and q_u, p_u > q_u. An arc from u to v is generated with probability p_u if u and v are in the same community and with probability q_u otherwise. Given a graph generated from this model, the goal is to retrieve the communities.

The DHSBM allows to generate graphs with planted communities while allowing heterogeneous degree distributions, a quite important feature of real-world networks.

In the case where there are two communities, we present an iterative greedy linear-time algorithm that recovers them whenever min_u \frac{p_u - q_u}{\sqrt{p_u}} = \Omega(\sqrt{\log (n)/n}). We also show that, up to a constant, this condition is necessary. Our results also extend to the standard (undirected) SBM, where p_u = p and q_u= q for all nodes u. Our algorithm presents the first linear-time algorithm that recovers exactly the communities at the asymptotic information-theoretic threshold, improving over previous near-linear time spectral approaches.
A Massively Parallel Modularity-Maximizing Algorithm With Provable Guarantees
Joint work with Vincent Cohen-Addad and Frederik Mallmann-Trenn
PODC 2022

Abstract

Graph clustering is one of the most basic and popular unsupervised learning problems. Among the different formulations of the problem, the modularity objective has been particularly successful in helping design impactful algorithms; Most notably, the Louvain algorithm has become one of the most used algorithm for clustering graphs. Yet, one major limitation of the Louvain algorithm is its sequential nature which makes it impractical in distributed environments and on massive datasets.

In this paper, we provide a parallel version of Louvain which works in the massively parallel computation model (MPC). We show that it recovers the ground-truth clusters in the classic stochastic block model in only a constant number of parallel rounds, and so for a wider regime of parameters than the standard Louvain algorithm as shown recently in Cohen-Addad, Kosowski, Mallmann-Trenn and Saulpic [NeurIPS 2020]
Towards Optimal Lower Bounds for k-median and k-means Coresets
Joint work with Vincent Cohen-Addad, Kasper Green Larsen and Chris Schwiegelshohn
STOC 2022. [arXiv]

Abstract

Given a set of points in a metric space, the (k, z)-clustering problem consists of finding a set of k points called centers, such that the sum of distances raised to the power of z of every data point to its closest center is minimized. Special cases include the famous k-median problem (z = 1) and k-means problem (z = 2). The k-median and k-means problems are at the heart of modern data analysis and massive data applications have given raise to the notion of coreset: a small (weighted) subset of the input point set preserving the cost of any solution to the problem up to a multiplicative (1 +/- eps) factor, hence reducing from large to small scale the input to the problem. While there has been an intensive effort to understand what is the best coreset size possible for both problems in various metric spaces, there is still a significant gap between the state- of-the-art upper and lower bounds.
In this paper, we make progress on both upper and lower bounds, obtaining tight bounds for several cases, namely: in finite n point general metrics, any coreset must consist of Omega(k/eps^2 log n) points. This improves on the Omega(k/eps log n) lower bound of Braverman, Jiang, Krauthgamer, and Wu [ICML'19] and matches the upper bounds proposed for k-median by Feldman and Langberg [STOC'11] and k-means by Cohen-Addad, Saulpic, and Schwiegelshohn [STOC'21] up to polylog(1/eps) factors. The dependency in k, n is therefore optimal. For doubling metrics with doubling constant D, any coreset must consist of Omega(k/eps^2 D) points. This matches the k-median and k-means upper bounds by Cohen-Addad, Saulpic, and Schwiegelshohn [STOC'21] up to polylog(1/eps) factors. The dependency in k, D is therefore optimal. In d-dimensional Euclidean space, any coreset for (k, z) clustering requires Omega(k/eps^2 ) points. This improves on the Omega(k/ eps) lower bound of Baker, Braverman, Huang, Jiang, Krauthgamer, and Wu [ICML'20] for k-median and complements the Omega(k min(d, 2^{z/20} )) lower bound of Huang and Vishnoi [STOC'20].
We complement our lower bound for d-dimensional Euclidean space with the construction of a coreset of size O(k/eps^2 · min(eps^{-z} , k)). This improves over the O(k^2 eps^{-4}) upper bound for general power of z proposed by Braverman Jiang, Krauthgamer, and Wu [SODA'21] and over the O(k/eps^4 ) upper bound for k-median by Huang and Vishnoi [STOC'20]. In fact, ours is the first construction breaking through the eps^{-2} min(d, eps^-2 ) barrier inherent in all previous coreset constructions. To do this, we employ a novel chaining based analysis that may be of independent interest. Together our upper and lower bounds for k-median in Euclidean spaces are tight up to a factor O(eps^{-1} polylog k/eps).
An Improved Local Search Algorithm for k-Median
Joint work with Vincent Cohen-Addad, Anupam Gupta, Lunjia Hu and Hoon Oh
SODA 2022. [arXiv]

Abstract

We present a new local-search algorithm for the k-median clustering problem. We show that local optima for this algorithm give a (2.836+\epsilon)-approximation; our result improves upon the (3+\epsilon)-approximate local-search algorithm of Arya et al. [STOC'01]. Moreover, a computer-aided analysis of a natural extension suggests that this approach may lead to an improvement over the best-known approximation guarantee for the problem. The new ingredient in our algorithm is the use of a potential function based on both the closest and second-closest facilities to each client. Specifically, the potential is the sum over all clients, of the distance of the client to its closest facility, plus (a small constant times) the truncated distance to its second-closest facility. We move from one solution to another only if the latter can be obtained by swapping a constant number of facilities, and has a smaller potential than the former. This refined potential allows us to avoid the bad local optima given by Arya et al. for the local-search algorithm based only on the cost of the solution.
Improved Coresets and Sublinear Algorithms for Power Means in Euclidean Spaces.
Joint work with Vincent Cohen-Addad and Chris Schwiegelshohn
Spotlight presentation at NeurIPS 2021. [Paper], [Online presentation at NeurIPS]

Abstract

We study in this paper the geometric (1, z)-clustering problem: given n points in R^d, find the point x that minimizes the sum of Euclidean distance, raised to the power z, over all input points. This problem interpolates between the well-known Fermat-Weber problem -- or geometric median problem-- where z = 1, and the Minimum Enclosing Ball problem, where z = infinity.

Our contribution is the design of a precise estimator that sample only a constant number of points. Namely, for any \epsilon > 0, we show that sampling uniformly at random O(\epsilon^{-z-3}) input points is enough to find a center such that the sum of distances to the power z to that center is within a (1+\epsilon)-factor of the optimum. We also provide a lower bound, showing that any such algorithm must sample at least \Omega(\epsilon^{-z+1}) points.

This implies an algorithm that computes a (1+\epsilon)-approximation running in time O(d \epsilon^{-z-3}), generalizing the result from Cohen et al [STOC '16] to arbitrary z. This also implies a (1+\epsilon)-approximation in the streaming setting, with memory independent of the number of input points.
A New Coreset Framework for Clustering.
Joint work with Vincent Cohen-Addad and Chris Schwiegelshohn
STOC 2021. [arXiv] [Long presentation at IRIF] [Short presentation at STOC]

Abstract

Given a metric space, the (k,z)-clustering problem consists of finding k centers such that the sum of the of distances raised to the power z of every point to its closest center is minimized. This encapsulates the famous k-median (z=1) and k-means (z=2) clustering problems. Designing small-space sketches of the data that approximately preserves the cost of the solutions, also known as coresets, has been an important research direction over the last 15 years.
In this paper, we present a new, simple coreset framework that simultaneously improves upon the best known bounds for a large variety of settings, ranging from Euclidean space, doubling metric, minor-free metric, and the general metric cases.
On the Power of Louvain for Graph Clustering.
Joint work with Vincent Cohen-Addad, Adrian Kosowski and Frederik Mallmann-Trenn
NeurIPS 2020. [Paper]

Abstract

A classic problem in machine learning and data analysis is to partition the vertices of a network in such a way that vertices in the same set are densely connected and vertices in different sets are loosely connected.
In practice, the most popular approaches rely on local search algorithms; not only for the ease of implementation and the efficiency, but also because of the accuracy of these methods on many real world graphs. For example, the Louvain algorithm -- a local search based algorithm -- has quickly become the method of choice for clustering in social networks. However, explaining the success of these methods remains an open problem: in the worst-case, the runtime can be up to $\Omega(n^2)$, much worse than what is typically observed in practice, and no guarantee on the quality of its output can be established.
The goal of this paper is to shed light on the inner-workings of Louvain; only if we understand Louvain, can we rely on it and further improve it. To achieve this goal, we study the behavior of Louvain in the famous two-bloc Stochastic Block Model, which has a clear ground-truth and serves as the standard testbed for graph clustering algorithms. We provide valuable tools for the analysis of Louvain, but also for many other combinatorial algorithms. For example, we show that the probability for a node to have more edges towards its own community is $1/2 + \Omega( \min( \Delta(p-q)/\sqrt{np},1 ))$ in the SBM($n,p,q$), where $\Delta$ is the imbalance. Note that this bound is asymptotically tight and useful for the analysis of a wide range of algorithms (Louvain, Kernighan-Lin, Simulated Annealing etc).
Polynomial Time Approximation Schemes for Clustering in Low Highway Dimension Graphs.
Joint work with Andreas Emil Feldmann
ESA 2020, invited to JCSS. [arXiv], [Online presentation at ESA]

Abstract

We study clustering problems such as k-Median, k-Means, and Facility Location in graphs of low highway dimension, which is a graph parameter modeling transportation networks. It was previously shown that approximation schemes for these problems exist, which either run in quasi-polynomial time (assuming constant highway dimension) [Feldmann et al. SICOMP 2018] or run in FPT time (parameterized by the number of clusters k, the highway dimension, and the approximation factor) [Becker et al. ESA~2018, Braverman et al. 2020]. In this paper we show that a polynomial-time approximation scheme (PTAS) exists (assuming constant highway dimension). We also show that the considered problems are NP-hard on graphs of highway dimension 1.
Fully Dynamic Consistent Facility Location
Joint work with Vincent Cohen-Addad, Niklas Hjuler, Nikos Parotsidis and Chris Schwiegelshohn
NeurIPS 2019. [Paper]

Abstract

We consider classic clustering problems in fully dynamic data streams, where data elements can be both inserted and deleted. In this context, several parameters are of importance: (1) the quality of the solution after each insertion or deletion, (2) the time it takes to update the solution, and (3) how different consecutive solutions are. The question of obtaining efficient algorithms in this context for facility location, $k$-median and $k$-means has been raised in a recent paper by Hubert-Chan et al. [WWW'18] and also appears as a natural follow-up on the online model with recourse studied by Lattanzi and Vassilvitskii [ICML'17] (i.e.: in insertion-only streams). In this paper, we focus on general metric spaces and mainly on the facility location problem. We give an arguably simple algorithm that maintains a constant factor approximation, with O(nlog n) update time, and total recourse O(n). This improves over the naive algorithm which consists in recomputing a solution at each time step and that can take up to O(n^2) update time, and O(n^2) total recourse. These bounds are nearly optimal: in general metric space, inserting a point take O(n) times to describe the distances to other points, and we give a simple lower bound of O(n) for the recourse. Moreover, we generalize this result for the k-medians and k-means problems: our algorithm maintains a constant factor approximation in time O((n+k^2) polylog n). We complement our analysis with experiments showing that the cost of the solution maintained by our algorithm at any time t is very close to the cost of a solution obtained by quickly recomputing a solution from scratch at time t while having a much better running time.
Linear-Time Approximation Schemes for Clustering in Doubling Metrics
Joint work with Vincent Cohen-Addad and Andreas Emil Feldmann
FOCS 2019, JACM . [arXiv], [Vincent's talk at ICERM]

Abstract

We consider the classic Facility Location, k-Median, and k-Means problems in metric spaces of doubling dimension d. We give nearly linear-time approximation schemes for each problem. The complexity of our algorithms is near-linear in n, making a significant improvement over the state-of-the-art algorithms which run in time $n^(d/\eps)^O(d). Moreover, we show how to extend the techniques used to get the first efficient approximation schemes for the problems of prize-collecting k-Medians and k-Means, and efficient bicriteria approximation schemes for k-Medians with outliers, k-Means with outliers and k-Center
Dominating Sets and Connected Dominating Sets in Dynamic Graphs.
Joint work with Niklas Hjuler, Giuseppe F. Italiano and Nikos Parotsidis
STACS 2019. [arXiv]

Abstract

In this paper we study the dynamic versions of two basic graph problems: Minimum Dominating Set and its variant Minimum Connected Dominating Set. For those two problems, we present algorithms that maintain a solution under edge insertions and edge deletions in time O(Δ polylog n) per update, where Δ is the maximum vertex degree in the graph. In both cases, we achieve an approximation ratio of O(log n), which is optimal up to a constant factor (under the assumption that P≠NP). Although those two problems have been widely studied in the static and in the distributed settings, to the best of our knowledge we are the first to present efficient algorithms in the dynamic setting. As a further application of our approach, we also present an algorithm that maintains a Minimal Dominating Set in O(min(Δ,√m))
Polynomial-Time Approximation Schemes for k-center, k-median, and Capacitated Vehicle Routing in Bounded Highway Dimension.
Joint work with Amariah Becker and Phil Klein
ESA 2018. [arXiv]

Abstract

The concept of bounded highway dimension was developed to capture observed properties of the metrics of road networks. We show that a graph with bounded highway dimension, for any vertex, can be embedded into a a graph of bounded treewidth in such a way that the distance between u and v is preserved up to an additive error of ε times the distance from u or v to the selected vertex. We show that this theorem yields a PTAS for Bounded-Capacity Vehicle Routing in graphs of bounded highway dimension. In this problem, the input specifies a depot and a set of clients, each with a location and demand; the output is a set of depot-to-depot tours, where each client is visited by some tour and each tour covers at most Q units of client demand. Our PTAS can be extended to handle penalties for unvisited clients. We extend this embedding result to handle a set S of distinguished vertices. The treewidth depends on |S|, and the distance between u and v is preserved up to an additive error of ε times the distance from u and v to S. This embedding result implies a PTAS for Multiple Depot Bounded-Capacity Vehicle Routing: the tours can go from one depot to another. The embedding result also implies that, for fixed k, there is a PTAS for k-Center in graphs of bounded highway dimension. In this problem, the goal is to minimize d such that there exist k vertices (the centers) such that every vertex is within distance d of some center. Similarly, for fixed k, there is a PTAS for k-Median in graphs of bounded highway dimension. In this problem, the goal is to minimize the sum of distances to the k centers.
A Quasi-Polynomial-Time Approximation Scheme for Vehicle Routing on Planar and Bounded-Genus Graphs
Joint work with Amariah Becker and Phil Klein
ESA 2017. [Paper]

Abstract

The Capacitated Vehicle Routing problem is a generalization of the Traveling Salesman problem in which a set of clients must be visited by a collection of capacitated tours. Each tour can visit at most Q clients and must start and end at a specified depot. We present the first approximation scheme for Capacitated Vehicle Routing for non-Euclidean metrics. Specifically we give a quasi-polynomial-time approximation scheme for Capacitated Vehicle Routing with fixed capacities on planar graphs. We also show how this result can be extended to bounded-genus graphs and polylogarithmic capacities, as well as to variations of the problem that include multiple depots and charging penalties for unvisited clients.

David Saulpic

About me

Contact

Publications