Master Thesis MSTR-2022-37

BibliographyBalihalli, Tushar Rajendra: Data-driven partitioning of training data for complex multiclass problems.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 37 (2022).
66 pages, english.

Substantial technological advancements in the modern era have paved the way for the growth of a humongous amount of data that can be used for data analysis and decision-making processes. Data analysis tasks typically employ machine learning algorithms on real-world data. However, data in real-world scenarios contain a variety of complex characteristics like missing features, multi-class imbalance, and so on. Therefore, directly applying the machine learning methods does not lead to satisfactory results. As a result, data has to be pre-processed, e.g., by partitioning the data to reduce the complexity, before performing actual data analysis activities. The objective of this work is to develop a partitioning approach using clustering that reduces the complexity of the challenges. To this end, various measures that can reflect the impact of the challenges are analyzed in detail. These measures quantify the complexity associated with the data. The focus of characteristics related to data complexity is focused on two major challenges: C1 and C2. Challenge C1 focuses on multi-class imbalance characteristic including high number of classes, overlapping decision boundaries, whereas challenge C2 comprises of heterogeneous feature characteristics involving missing features, sub-concepts and class membership problem. Although there are measures to address individual problems, this work focuses on addressing all of the challenges in a single dataset, thereby overcoming the shortcomings of approaches that address only one characteristic. The training data is subjected to data-driven partitioning using clustering, which is then optimized for the value of complexity measure. AutoML, an automated machine learning concept is employed for the hyper-parameter optimization. Further, a classifier is trained on individual partitions, and its hyper-parameters are optimized to improve the model’s performance. The comprehensive evaluation discusses the results for different complexity measures and various state-of-the-art approaches using numerous validation datasets. The evaluation unfolds that partitioning of data with complex characteristics and optimizing for appropriate value of complexity measure increases system performance. Hence, this work demonstrates that the application of classification models on individual partitions aid for better performance in terms of prediction accuracy.

Full text and
other links
Department(s)University of Stuttgart, Institute of Parallel and Distributed Systems, Applications of Parallel and Distributed Systems
Superviser(s)Schwarz, Prof. Holger; Tschechlov, Dennis
Entry dateSeptember 16, 2022
   Publ. Computer Science