Masterarbeit MSTR-2024-127

Bibliograph.
Daten
Bothmann, Jan: Benchmarking pre-trained language models for schema-agnostic entity resolution.
Universität Stuttgart, Fakultät Informatik, Elektrotechnik und Informationstechnik, Masterarbeit Nr. 127 (2024).
105 Seiten, englisch.
Kurzfassung

Abstract

Data integration is a process in which data from different sources are brought together to create a unified picture of the data. A vital aspect of this integration is Entity Resolution, which tries to identify elements that correspond to the same entity across multiple datasets. The complexity of ER tasks can vary significantly, as data exhibits different characteristics and levels of structuredness, which can influence the difficulty of the task. In this thesis, we evaluate how current state-of-the-art Entity Resolution systems perform when dealing with semi-structured data. To do this, several semi-structured ER benchmarks covering data from various domains were created for evaluation. Additionally, to explore how different data characteristics or other influencing factors impact the performance of matching systems, we developed the Benchmark Creator. This tool allows us and other users to generate benchmarks where data exhibits specific characteristics that may influence the complexity of the ER task. We used Ditto, Sudowoodo and the GPT4o-mini model to evaluate performance on the newly created benchmarks. Our evaluation reveals that Ditto and the GPT4o-mini model can effectively perform schema-agnostic ER on semi-structured data.

Volltext und
andere Links
Volltext
Abteilung(en)Universität Stuttgart, Institut für Parallele und Verteilte Systeme, Data Engineering
BetreuerHerschel, Prof. Melanie
Eingabedatum11. Juli 2025
   Publ. Informatik