Masterarbeit MSTR-2021-74

Bibliograph.
Daten
Schmidt, Simone: Concepts towards an automated data pre-processing and preparation within data lakes.
Universität Stuttgart, Fakultät Informatik, Elektrotechnik und Informationstechnik, Masterarbeit Nr. 74 (2021).
91 Seiten, englisch.
Kurzfassung

The Internet of Things produces huge amounts of heterogeneous data. Fields like Industry 4.0, smart city development, or the healthcare sector analyse this big data to serve as a basis for many applications. With their central storage, where heterogeneous data is stored in its original format, data lakes allow the analysis of data towards any use case. This schema-on-read approach leaves the transformation of data into an appropriate schema to the user. To achieve this, users need knowledge about the stored data, domain knowledge, and IT knowledge. The people who need the analysis results however are often domain experts and not IT experts. Possibilities for assisting users in data preparation for novel use cases in data lakes are explored in the scope of this work. Reasons for the difficulty of data pre-processing in data lakes are explored and requirements for a concept for user assistance are derived. Steps are extracted, which a user takes in developing data preparation for a new use case in data lakes. Existing concepts in literature for assisting users in those steps are explored. It is found, that sufficient assistance in data discovery is provided by existing solutions. The support for technical realisation is almost sufficient, but assistance in choosing the right transformations is still lacking. Based on the lessons learned from the analysis of existing solutions a new concept is developed. The concept is based on BARENTS, a data lake concept that enables specification of data preparation in the form of ontologies and automatically performs specified transformations. A new transformation recommender helps users in choosing transformations and creating the ontology to specify data preparation. With a prototypical implementation of the concept, it is demonstrated, how users are assisted in specifying their data preparation needs. The concept is shown to fulfill the stated requirements and enables flexible, user-friendly specification of data pre-processing needs within data lakes.

Volltext und
andere Links
Volltext
Abteilung(en)Universität Stuttgart, Institut für Parallele und Verteilte Systeme, Anwendersoftware
BetreuerMitschang, Prof. Bernhard; Stach, Dr. Christoph; Eichler, Rebecca Kay
Eingabedatum16. Februar 2022
   Publ. Informatik