Article in Proceedings INPROC-2023-02

Bibliography	Treder-Tschechlov, Dennis; Reimann, Peter; Schwarz, Holger; Mitschang, Bernhard: Approach to Synthetic Data Generation for Imbalanced Multi-class Problems with Heterogeneous Groups. In: Tagungsband der 20. Fachtagung Datenbanksysteme für Business, Technologie und Web (BTW 2019). University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology. Lecture Notes in Informatics (LNI), pp. 329-351, english. GI Gesellschaft für Informatik e.V. (GI), March 2023. Article in Proceedings (Conference Paper).
CR-Schema	H.2.8 (Database Applications)
Keywords	Machine learning; classification; data generation; real-world data characteristics
Abstract	To benchmark novel classification algorithms, these algorithms should be evaluated on data with characteristics that also appear in real-world use cases. Important data characteristics that often lead to challenges for classification approaches are multi-class imbalance and heterogeneous groups. Heterogeneous groups are sets of real-world entities, where the classification patterns may vary among different groups and where the groups are typically imbalanced in the data. Real-world data that comprise these characteristics are usually not publicly available, e.g., because they constitute sensitive patient information or due to privacy concerns. Further, the manifestations of the characteristics cannot be controlled specifically on real-world data. A more rigorous approach is to synthetically generate data such that different manifestations of the characteristics can be controlled as well. However, existing data generators are not able to generate data that feature both data characteristics, i.e., multi-class imbalance and heterogeneous groups. In this paper, we propose an approach that fills this gap as it allows to synthetically generate data that exhibit both characteristics. We make use of a taxonomy model that organizes real-world entities in domain-specific heterogeneous groups to generate data reflecting the characteristics of these groups. Further, we incorporate probability distributions to reflect the imbalances of multiple classes and groups from real-world use cases. The evaluation shows that our approach can generate data that feature the data characteristics multi-class imbalance and heterogeneous groups and that it allows to control different manifestations of these characteristics.
Full text and other links	Link zum Paper
Department(s)	University of Stuttgart, Institute of Parallel and Distributed Systems, Applications of Parallel and Distributed Systems
Project(s)	GSaME-NFG
Entry date	March 13, 2023

Publ. Institute Publ. Computer Science