Masterarbeit MSTR-2017-13

Bibliograph.
Daten
Bettadapura Raghavendra, Shreyas: Relevance of the two adjusting screws in data analytics: data quality and optimization of algorithms.
Universität Stuttgart, Fakultät Informatik, Elektrotechnik und Informationstechnik, Masterarbeit Nr. 13 (2017).
97 Seiten, englisch.
Kurzfassung

In the context of learning from data, the impact on the performance of a learning algorithm has traditionally been studied through the perspective of data preprocessing and through that of empirical works. We attempt to provide a middle ground by employing an approach which enables a systematic analysis considering the interaction between the quality of the data provided for training, and the configurations applied to the learning algorithm. This is achieved through the concepts of a Data Quality Profile, which depicts quality indicators for the dataset and a Classification Configuration Profile, which depicts the configuration parameters applied to the learning algorithm. Both the profiles have the common characteristic of being able to distinctly view, and equally represent the variations in their properties, allowing for a systematic study. We demonstrate this through a prototypical implementation, considering the data quality indicators of missing values, label imbalance, and high cardinality, and evaluating it against the CART Decision Tree algorithm, configurable by its splitting criteria, early stopping criteria, and training data preprocessing operations. We were able to successfully observe a relationship between decreasing quality of the training data, and deterioration in the performance of the algorithm. The flexibility of the approach allows for easy progression to other algorithms, and implementations of more quality indicators.

Volltext und
andere Links
Volltext
Abteilung(en)Universität Stuttgart, Institut für Parallele und Verteilte Systeme, Anwendersoftware
BetreuerMitschang, Prof. Bernhard; Villanueva Zacarias, Alejandro Gabriel; Kiefer, Cornelia
Eingabedatum28. Mai 2019
   Publ. Informatik