Artikel in Zeitschrift ART-2019-07

Fritz, Manuel; Behringer, Michael; Schwarz, Holger: Quality-driven early stopping for explorative cluster analysis for big data.
In: Software-Intensive Cyber-Physical Systems.
Universität Stuttgart, Fakultät Informatik, Elektrotechnik und Informationstechnik.
S. 1-12, englisch.
Springer Berlin Heidelberg, 6. Februar 2019.
ISSN: 2524-8510; 2524-8529; DOI: 10.1007/s00450-019-00401-0.
Artikel in Zeitschrift.
CR-Klassif.E.0 (Data General)
H.2.8 (Database Applications)
H.3.3 (Information Search and Retrieval)
KeywordsClustering; Big Data; Early Stop; Convergence; Regression

Data analysis has become a critical success factor for companies in all areas. Hence, it is necessary to quickly gain knowledge from available datasets, which is becoming especially challenging in times of big data. Typical data mining tasks like cluster analysis are very time consuming even if they run in highly parallel environments like Spark clusters. To support data scientists in explorative data analysis processes, we need techniques to make data mining tasks even more efficient. To this end, we introduce a novel approach to stop clustering algorithms as early as possible while still achieving an adequate quality of the detected clusters. Our approach exploits the iterative nature of many cluster algorithms and uses a metric to decide after which iteration the mining task should stop. We present experimental results based on a Spark cluster using multiple huge datasets. The experiments unveil that our approach is able to accelerate the clustering up to a factor of more than 800 by obliterating many iterations which provide only little gain in quality. This way, we are able to find a good balance between the time required for data analysis and quality of the analysis results.

Volltext und
andere Links
Springer Link
CopyrightSpringer Berlin Heidelberg
Abteilung(en)Universität Stuttgart, Institut für Parallele und Verteilte Systeme, Anwendersoftware
Eingabedatum7. März 2019
   Publ. Abteilung   Publ. Institut   Publ. Informatik