Master Thesis MSTR-2018-54

BibliographyLoutfi, Kinda: Evaluation of prediction mechanisms of parameters for data mining algorithms.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 54 (2018).
97 pages, english.

Extracting knowledge and useful information from huge amount of data is one of the biggest issues currently in the world of computer science. Data mining is one essential step in the knowledge discovery process due to the importance of its contribution in extracting this knowledge. One of the most famous mining techniques is Clustering, which is a widely used approach in data mining. Clustering is the process of partitioning a group of objects into smaller sets called clusters, in which similar objects are assigned to the same cluster, and dissimilar objects are assigned to different clusters. K-Means, DBSCAN, and OPTICS are three of the most popular clustering algorithms which are used to group similar data into clusters. Each of these algorithms requires input parameters. The difficulty of knowing these input parameters in advance is the flaw of these algorithms. Many previous approaches were introduced which provide a prediction of these parameters. However, different problems emerged while using these approaches, such as a long overall runtime to achieve predicted parameter values, since the clustering algorithm is applied multiple times with varying parameters to identify the best parameter configuration. This thesis introduces a new approach for predicting the input parameters. The proposed approach is called PROD (Position-Based Prediction Using Voronoi Diagrams). PROD overcomes the problems which emerged in previous approaches. The prediction of the input parameters is performed by this new approach through using space partitioning approaches and subsequently exploiting their properties. PROD is evaluated on various data sets. Regarding K-Means algorithm; PROD provides a prediction of number of clusters as well as a prediction of the initial centroids. The experimental results unveil, that PROD is (a) more accurate and (b) faster by a factor of 26.5 in contrast to previously best existing approaches. Additionally, the results of the evaluations show that it can be used either as a prediction approach for the amount of clusters or as a standalone clustering algorithm. Despite the effectiveness of the prediction made by PROD with regard to the K-Means parameters, the results of the experiments show that this novel approach needs further improvement to make the prediction of DBSCAN and OPTICS parameters work. More clearly, changing some aspects of PROD might lead to a better prediction with regard to these two algorithms.

Full text and
other links
Department(s)University of Stuttgart, Institute of Parallel and Distributed Systems, Applications of Parallel and Distributed Systems
Superviser(s)Schwarz, PD Dr. Holger; Fritz, Manuel; Behringer, Michael
Entry dateJune 4, 2019
   Publ. Computer Science