Article in Proceedings INPROC-2020-53

BibliographyFritz, Manuel; Behringer, Michael; Schwarz, Holger: LOG-Means: Efficiently Estimating the Number of Clusters in Large Datasets.
In: Balazinska, Magdalena (ed.); Zhou, Xiaofang (ed.): Proceedings of the 46th International Conference on Very Large Databases (VLDB).
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology.
Proceedings of the VLDB Endowment; 13 (12), pp. 2118-2131, english.
ACM Digital Library, August 2020.
ISBN: ISSN 2150-8097; DOI:
Article in Proceedings (Conference Paper).
CR-SchemaH.2.8 (Database Applications)

Clustering is a fundamental primitive in manifold applications. In order to achieve valuable results, parameters of the clustering algorithm, e.g., the number of clusters, have to be set appropriately, which is a tremendous pitfall. To this end, analysts rely on their domain knowledge in order to define parameter search spaces. While experienced analysts may be able to define a small search space, especially novice analysts often define rather large search spaces due to the lack of in-depth domain knowledge. These search spaces can be explored in different ways by estimation methods for the number of clusters. In the worst case, estimation methods perform an exhaustive search in the given search space, which leads to infeasible runtimes for large datasets and large search spaces. We propose LOG-Means, which is able to overcome these issues of existing methods. We show that LOG-Means provides estimates in sublinear time regarding the defined search space, thus being a strong fit for large datasets and large search spaces. In our comprehensive evaluation on an Apache Spark cluster, we compare LOG-Means to 13 existing estimation methods. The evaluation shows that LOG-Means significantly outperforms these methods in terms of runtime and accuracy. To the best of our knowledge, this is the most systematic comparison on large datasets and search spaces as of today.

Department(s)University of Stuttgart, Institute of Parallel and Distributed Systems, Applications of Parallel and Distributed Systems
Entry dateNovember 19, 2020
   Publ. Department   Publ. Institute   Publ. Computer Science