Master Thesis MSTR-2024-20

BibliographyLabes, Leon: Analysis and evaluation of data preprocessing methods for clustering analyses.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 20 (2024).
93 pages, english.
Abstract

Data is often used to extract knowledge from it and guide decisions in several areas. The knowledge extraction process is usually done with a machine-learning model like clustering. However, frequently, this data contains data imperfections such as missing values, outliers, or the data has skewed distributions. These imperfections need to be addressed to extract knowledge from the data because otherwise, machine-learning models can not be applied or achieve poor results. This process is called preprocessing. Many different preprocessing methods and corresponding hyperparameters exist. Therefore, finding a good selection of methods and hyperparameters to improve the machine-learning model result is challenging, especially if the data imperfections are unknown. In addition, more than one preprocessing method is often used, which increases the search space and brings additional challenges as the order in which they are performed and the interaction between the preprocessing methods is unclear. This may result in a lengthy trial and error process, especially for inexperienced users, where different preprocessing pipelines are evaluated until an acceptable pipeline is found.

Some recent advances have been made in the area of AutoML that cover the automation of this selection as a small part of their approach. However, these are limited in the area of clustering, and most approaches propose only a single preprocessing method or consider only a limited number of methods in their configuration space. One reason for these limitations is that they focus on an end-to-end approach where, besides an appropriate preprocessing pipeline, a clustering algorithm and its parameters are suggested as well.

The AutoML approaches consider preprocessing to be a small part at best and focus primarily on model building. In contrast, this work focuses mainly on addressing the challenges named in the first paragraph and improving the results with appropriate preprocessing pipelines while using a model, in this case clustering, to evaluate the pipelines.

To mitigate the challenges described in the first section, the overall goal is to make accurate suggestions for preprocessing pipelines that improve the clustering result. In order to achieve this, it is important to identify data imperfections and to understand the relationship between data imperfections and preprocessing pipelines. This thesis contributes a first step in that direction with the concept of how a knowledge base can be created that accurately measures the effects of preprocessing pipelines on data with imperfections and can identify data imperfections of datasets. In order to achieve the latter, meta-features are evaluated because they are a good way to describe unseen datasets and their imperfections. Such a knowledge base needs to contain information from many different datasets to be able to generalize. Because of that, synthetic data is used and further manipulated with data imperfections to have an almost unlimited dataset available and to make a precise evaluation of what data imperfection is handled by which pipeline well. These manipulations skew the data distribution, remove parts of the data to create missing values, and add outliers to the data. Because of the named challenges of section one, it is impossible to apply all combinations of preprocessing methods and their hyperparameters to a dataset, even if the selection of preprocessing methods is small. Instead, pipelines are generated and refined during an optimization process. As optimization, Genetic optimization is used because it provides high flexibility and can be well customized to address the named challenges.

The evaluation shows that for most datasets, a preprocessing pipeline is found by the optimizer, which leads to a significant improvement in the clustering results compared to the results without preprocessing. The improvement is most significant for skewed distribution data, while datasets that have not been manipulated show the slightest improvement. Additionally, the evaluation of the optimization process shows that a well-performing pipeline is found relatively quickly, while the improvement afterward exists but is comparatively small. It is shown that the missing value imputation works best with the KNN-Imputer compared to other imputation techniques. Other data imperfections do not produce a preferred method or pipeline. Concerning the order of the preprocessing methods within a pipeline, it could not be shown that there is a significant difference. Additionally, it is demonstrated that meta-features correlate with data imperfections. Therefore, it suggests that it is possible to determine the need for preprocessing and the identification of data imperfections with the thesis used meta-features.

Full text and
other links
Volltext
Department(s)University of Stuttgart, Institute of Parallel and Distributed Systems, Applications of Parallel and Distributed Systems
Superviser(s)Schwarz, Prof. Holger; Treder-Tschechlov, Dennis
Entry dateAugust 8, 2024
   Publ. Computer Science