Clustering is a fundamental primitive to group points of a dataset and learn about the dataset’s structure. However, the search for a clustering algorithm with parameters suitable for a specific clustering task is challenging. Tschechlov et al. recently introduced AutoML4Clust (A4C) which effectively automates this search [TFS21]. Thus, A4C brings powerful clustering analyses within reach of novice analysts. Given this role of A4C, it is important to further improve the generated clusterings. This work explores one potential way to further improve A4C - ensemble clustering.
In ensemble clustering, a consensus function turns multiple clusterings on a dataset (the ensemble) into a single consensus clustering. This work contributes A4C-based ensemble clustering which combines ensemble clustering with A4C. In this concept, the ensemble is generated using A4C and subsequently processed by a consensus function.
The consensus functions are selected based on an extensive literature search and requirements analysis. The requirements analysis ensures amongst other things that only consensus functions whose runtime scales linearly in the number of points are selected. This ensures that A4C’s ability to cluster large datasets translates to A4C-based ensemble clustering. The selected consensus functions are then implemented. Additionally, this work implements clusterlab, a generic and extensible framework for ensemble clustering analyses. Clusterlab facilitates the automatic evaluation of ensemble clustering in this and future work. It is independent of A4C and can be extended with new ensemble generations and consensus functions.
The evaluation compares A4C, A4C-based ensemble clustering and OPTICS, a density-based clustering algorithm capable of detecting arbitrarily-shaped clusters. It finds that A4C-based ensemble clustering leads to a consistent improvement in accuracy over A4C. Additionally, it does not substantially increase the total runtime of A4C. Furthermore, A4C-based ensemble clustering compares favorably to OPTICS in terms of accuracy for most dataset characteristics. Additionally, A4C-based ensemble clustering is faster for datasets with many points and dimensions.
These results establish that A4C-based ensemble clustering leads to more valuable clusterings and scales well to datasets with many points and dimensions. If it was incorporated into A4C by default, novice analysts could conduct even more powerful clustering analyses.
|