Artikel in Tagungsband INPROC-2023-02

Bibliograph. Daten	Treder-Tschechlov, Dennis; Reimann, Peter; Schwarz, Holger; Mitschang, Bernhard: Approach to Synthetic Data Generation for Imbalanced Multi-class Problems with Heterogeneous Groups. In: Tagungsband der 20. Fachtagung Datenbanksysteme für Business, Technologie und Web (BTW 2019). Universität Stuttgart, Fakultät Informatik, Elektrotechnik und Informationstechnik. Lecture Notes in Informatics (LNI), S. 329-351, englisch. GI Gesellschaft für Informatik e.V. (GI), März 2023. Artikel in Tagungsband (Konferenz-Beitrag).
CR-Klassif.	H.2.8 (Database Applications)
Keywords	Machine learning; classification; data generation; real-world data characteristics
Kurzfassung	To benchmark novel classification algorithms, these algorithms should be evaluated on data with characteristics that also appear in real-world use cases. Important data characteristics that often lead to challenges for classification approaches are multi-class imbalance and heterogeneous groups. Heterogeneous groups are sets of real-world entities, where the classification patterns may vary among different groups and where the groups are typically imbalanced in the data. Real-world data that comprise these characteristics are usually not publicly available, e.g., because they constitute sensitive patient information or due to privacy concerns. Further, the manifestations of the characteristics cannot be controlled specifically on real-world data. A more rigorous approach is to synthetically generate data such that different manifestations of the characteristics can be controlled as well. However, existing data generators are not able to generate data that feature both data characteristics, i.e., multi-class imbalance and heterogeneous groups. In this paper, we propose an approach that fills this gap as it allows to synthetically generate data that exhibit both characteristics. We make use of a taxonomy model that organizes real-world entities in domain-specific heterogeneous groups to generate data reflecting the characteristics of these groups. Further, we incorporate probability distributions to reflect the imbalances of multiple classes and groups from real-world use cases. The evaluation shows that our approach can generate data that feature the data characteristics multi-class imbalance and heterogeneous groups and that it allows to control different manifestations of these characteristics.
Volltext und andere Links	Link zum Paper
Abteilung(en)	Universität Stuttgart, Institut für Parallele und Verteilte Systeme, Anwendersoftware
Projekt(e)	GSaME-NFG
Eingabedatum	13. März 2023

Publ. Abteilung Publ. Institut Publ. Informatik