Bibliography | Schubert, Tim: Context-aware data validation for machine learning pipelines. University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 98 (2022). 78 pages, english.
|
Abstract | These days, machine learning plays a key role in plenty of applications. Self-learning algorithms are developed in not only industrial applications, e.g., production lines, or fleet management, but also in the private sector, e.g. smart homes. The performance of these programs is significantly related to the provided training data. A major challenge is preserving high quality of the data. Therefore, the demand for good data cleaning methods has been increasing over the past few years. While existing cleaning techniques can consider constraints and dependencies in data, they can not exploit context information automatically. Thus, they usually fail to track shifts in the data distributions or the associated error profiles. To overcome these limitations, this thesis introduces a novel pipeline for automated tabular data cleaning powered by dynamic functional dependency rules extracted from a context model. This context model is a live updating ontology, representing the current state of the environment where the data originates from. The proposed concept divides the pipeline into three main steps: (i) context modeling, (ii) dependency extraction, and (iii) data cleaning. As a proof-of-concept and for evaluation purposes, a prototype has been implemented. This prototype is evaluated on two different datasets, including an IoT dataset from a smart home use case and a commonly used benchmark dataset with different metrics from hospitals in the US. The evaluation shows that the proposed concept and pipeline for the data validation process performs better than typical state-of-the-art error detection methods.
|