Bibliography | Armbruster, Benedikt: Refinement of Partitioning for Multi-class Problems with Heterogeneous Groups. University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 54 (2023). 49 pages, english.
|
Abstract | Data-analysis is a complex task in a world with a growing amount of complex data. Data often has different characteristics like multi-class imbalance and heterogeneous groups. These characteristics pose a problem for classification algorithms because these algorithms tend to disregard classes with smaller sample sizes or they cannot learn the patterns for heterogeneous classes. A first approach to solve the problem is data-driven partitioning. This improved the classification performance of this type of data. However, the partitions proved to still be too complex or contain too few samples. Hence, the aim of this work is to further refine the partitions to make more accurate classifications. This workflow is based on data-driven partitioning. AutoML4Clust is used to get a clustering algorithm with optimized hyperparameters for producing the initial partitioning. To increase the classification performance by further splitting the partition or merging them with their nearest partition. Partitions are split using the clustering algorithm KMeans. The accuracy of the classification is determined by random forest classifiers. Based on this, the initial, split, and merged partitions are compared and the best-performing clusters are chosen and assembled into a new clustering. To evaluate this approach five datasets with different grades of multi-class imbalance and heterogeneous groups were produced. The extrinsic measures for clustering algorithms ARI and NMI, as well as the f1-score for classification accuracy and the complexity metrics fishers discriminant ratio and the fraction of border points, were used for the evaluation. The NMI increased in three out of five tests by at least 10% in the other ones it stayed the same. The ARI increased in four out of five tests by at least 13% in the fifth the ARI increased by 2%. The f1-scores remained the same between the initial and the refined partitioning in all tests. The fishers discriminant ratio increased in one out of five tests by 11% while in the other four no major change occurred. In the fraction of border points no major change was observed in any of the tests. Taken together, the presented workflow increased extrinsic measures ARI and NMI profoundly. To increase the performance in the other measures further refinements have to be implemented in the future.
|
Department(s) | University of Stuttgart, Institute of Parallel and Distributed Systems, Applications of Parallel and Distributed Systems
|
Superviser(s) | Mitschang, Prof. Bernhard; Treder-Tschechlov, Dennis |
Entry date | November 15, 2023 |
---|