Article in Journal ART-2023-05

BibliographyStach, Christoph; Eichler, Rebecca; Schmidt, Simone: A Recommender Approach to Enable Effective and Efficient Self-Service Analytics in Data Lakes.
In: Datenbank-Spektrum. Vol. 23(2).
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology.
pp. 123-132, english.
Springer Nature, June 14, 2023.
ISSN: 1618-2162; DOI: 10.1007/s13222-023-00443-4.
Article in Journal.
CR-SchemaH.2.7 (Database Administration)
E.2 (Data Storage Representations)
H.3.3 (Information Search and Retrieval)
H.2.8 (Database Applications)
KeywordsData Lake; Data Preparation; Data Pre-Processing; Data Refinement; Recommender; Self-Service Analytics
Abstract

As a result of the paradigm shift away from rather rigid data warehouses to general-purpose data lakes, fully flexible self-service analytics is made possible. However, this also increases the complexity for domain experts who perform these analyses, since comprehensive data preparation tasks have to be implemented for each data access. For this reason, we developed BARENTS, a toolset that enables domain experts to specify data preparation tasks as ontology rules, which are then applied to the data involved. Although our evaluation of BARENTS showed that it is a valuable contribution to self-service analytics, a major drawback is that domain experts do not receive any semantic support when specifying the rules. In this paper, we therefore address how a recommender approach can provide additional support to domain experts by identifying supplementary datasets that might be relevant for their analyses or additional data processing steps to improve data refinement. This recommender operates on the set of data preparation rules specified in BARENTS-i.e., the accumulated knowledge of all domain experts is factored into the data preparation for each new analysis. Evaluation results indicate that such a recommender approach further contributes to the practicality of BARENTS and thus represents a step towards effective and efficient self-service analytics in data lakes.

CopyrightOpen Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
ContactSenden Sie eine E-Mail an christoph.stach@ipvs.uni-stuttgart.de.
Department(s)University of Stuttgart, Institute of Parallel and Distributed Systems, Applications of Parallel and Distributed Systems
Entry dateAugust 2, 2023
   Publ. Department   Publ. Institute   Publ. Computer Science