Master Thesis MSTR-2018-64

BibliographyMuazzen, Osama: Framework for automatic selection of analytic platforms for data mining tasks.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 64 (2018).
71 pages, english.
Abstract

Data mining is becoming more important and can be applied to several domains. The importance of data mining refers to its ability to derive valuable knowledge from voluminous datasets. Nowadays, numerous analytic platforms are developed which execute different data mining algorithms such as Spark, Mahout, WEKA, etc. These analytic platforms differ by their characteristics, purposes, performance, and the manner of processing the data. The plethora of analytic platforms escalates the difficulty of selecting the most appropriate analytic platform that fit the needed data mining algorithm, the submitted dataset, and additional user-defined criteria. Several works were introduced in order to help users in selecting the appropriate analytic platform. These works are mainly benchmarks for evaluating the performance of these analytic platforms. However, these benchmarks have several issues regarding their objectivity and the considered measures. For example, several benchmarks focus only on the execution runtime as a performance indicator with completely ignoring other measures such as the consumption of resources. This thesis introduces a novel approach to solve the problem of selecting the most appropriate analytic platform automatically for executing the needed data mining algorithm. The selection process depends on a hypothesis which is executing a data mining algorithm on similar datasets, using the same analytic platform, yields comparable performance. Depending on a previously conducted benchmark, the selection process compares the performance of different analytic platform and selects the one with the best performance according to a user-defined criterion, e.g., runtime of the data mining algorithm. The introduced approach in this thesis aggregates several analytic platforms with various data mining algorithms in order to offer a single, abstract interface to several analytic platforms. Hence, it enables the user to execute data mining algorithms on different analytic platforms according to the user need. Moreover, this approach allows novice users without profound programming background to execute data mining algorithms easily by solely calling the algorithm with the required parameters via a REST API. Therefore, the proposed approach can be used with several programming languages. Furthermore, this approach can perform an automatic benchmark for several execution engines using different data mining algorithms and different datasets. The benchmark result is stored so it can be used for the automatic selection later on. The prototypical implementation of the introduced approach has a loosely coupled and extensible architecture that gives it the flexibility to add more execution engines and more data mining algorithms later on. The introduced approach is evaluated on several analytic platforms such as Spark, Mahout, and WEKA along with several data mining algorithms from classification, clustering, and association rule discovery. The experimental results unveil that automatic selection of the analytic platform can save up to 98.45% of the execution time of the data mining algorithm in some cases. The proposed selection process achieves accuracy of 87.00% in selecting the analytic platform with best performance according to the runtime criterion. Selecting the most appropriate analytic platform causes a runtime overhead. However, this overhead is negligible compared to the actual execution time of the data mining algorithm.

Full text and
other links
Volltext
Department(s)University of Stuttgart, Institute of Parallel and Distributed Systems, Applications of Parallel and Distributed Systems
Superviser(s)Schwarz, PD Dr. Holger; Fritz, Manuel
Entry dateJune 5, 2019
   Publ. Computer Science