Bibliography | Schmidt, Simone: Concepts towards an automated data pre-processing and preparation within data lakes. University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 74 (2021). 91 pages, english.
|
Abstract | The Internet of Things produces huge amounts of heterogeneous data. Fields like Industry 4.0, smart city development, or the healthcare sector analyse this big data to serve as a basis for many applications. With their central storage, where heterogeneous data is stored in its original format, data lakes allow the analysis of data towards any use case. This schema-on-read approach leaves the transformation of data into an appropriate schema to the user. To achieve this, users need knowledge about the stored data, domain knowledge, and IT knowledge. The people who need the analysis results however are often domain experts and not IT experts. Possibilities for assisting users in data preparation for novel use cases in data lakes are explored in the scope of this work. Reasons for the difficulty of data pre-processing in data lakes are explored and requirements for a concept for user assistance are derived. Steps are extracted, which a user takes in developing data preparation for a new use case in data lakes. Existing concepts in literature for assisting users in those steps are explored. It is found, that sufficient assistance in data discovery is provided by existing solutions. The support for technical realisation is almost sufficient, but assistance in choosing the right transformations is still lacking. Based on the lessons learned from the analysis of existing solutions a new concept is developed. The concept is based on BARENTS, a data lake concept that enables specification of data preparation in the form of ontologies and automatically performs specified transformations. A new transformation recommender helps users in choosing transformations and creating the ontology to specify data preparation. With a prototypical implementation of the concept, it is demonstrated, how users are assisted in specifying their data preparation needs. The concept is shown to fulfill the stated requirements and enables flexible, user-friendly specification of data pre-processing needs within data lakes.
|
Full text and other links | Volltext
|
Department(s) | University of Stuttgart, Institute of Parallel and Distributed Systems, Applications of Parallel and Distributed Systems
|
Superviser(s) | Mitschang, Prof. Bernhard; Stach, Dr. Christoph; Eichler, Rebecca Kay |
Entry date | February 16, 2022 |
---|