Article in Journal ART-2021-03

BibliographyEichler, Rebecca; Giebler, Corinna; Gröger, Christoph; Schwarz, Holger; Mitschang, Bernhard: Modeling metadata in data lakes—A generic model.
In: Data & Knowledge Engineering. Vol. 136.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology.
pp. 1-17, english.
Elsevier, November 2021.
ISSN: 0169-023X; DOI: 10.1016/j.datak.2021.101931.
Article in Journal.
CR-SchemaH.2 (Database Management)
KeywordsMetadata management; Metadata model; Data lake; Data management; Data lake zones; Metadata classification
Abstract

Data contains important knowledge and has the potential to provide new insights. Due to new technological developments such as the Internet of Things, data is generated in increasing volumes. In order to deal with these data volumes and extract the data’s value new concepts such as the data lake were created. The data lake is a data management platform designed to handle data at scale for analytical purposes. To prevent a data lake from becoming inoperable and turning into a data swamp, metadata management is needed. To store and handle metadata, a generic metadata model is required that can reflect metadata of any potential metadata management use case, e.g., data versioning or data lineage. However, an evaluation of existent metadata models yields that none so far are sufficiently generic as their design basis is not suited. In this work, we use a different design approach to build HANDLE, a generic metadata model for data lakes. The new metadata model supports the acquisition of metadata on varying granular levels, any metadata categorization, including the acquisition of both metadata that belongs to a specific data element as well as metadata that applies to a broader range of data. HANDLE supports the flexible integration of metadata and can reflect the same metadata in various ways according to the intended utilization. Furthermore, it is created for data lakes and therefore also supports data lake characteristics like data lake zones. With these capabilities HANDLE enables comprehensive metadata management in data lakes. HANDLE’s feasibility is shown through the application to an exemplary access-use-case and a prototypical implementation. By comparing HANDLE with existing models we demonstrate that it can provide the same information as the other models as well as adding further capabilities needed for metadata management in data lakes.

Department(s)University of Stuttgart, Institute of Parallel and Distributed Systems, Applications of Parallel and Distributed Systems
Project(s)MetaMan
Entry dateDecember 1, 2021
   Publ. Department   Publ. Institute   Publ. Computer Science